Occasionally I write about debugging, for the edification of others and to try to explain to muggles what I do all day. I ran into a fun one the other day.
Unicode
Joel Spolsky’s explanation of Unicode is excellent, but long. In brief: on a computer, we represent letters (“a”, “b” and so on) as numbers. Computers work with zeroes and ones, binary digits (or bits), usually in groups of 8 bits called bytes. Back in the mists of time, someone came up with ASCII, a way to represent decent American letters by giving each letter a number. All those numbers fitted a single byte (a byte can represent 256 different numbers), so one byte was one letter, and all was well… unless you weren’t American and wanted to represent funny foreign letters like “£”, or some non-Latin alphabet, or a frowning pile of poo.
The modern way of handling those foreign letters and poos is Unicode. Each different letter still has a number assigned to it, but there are a lot them, so the numbers can be bigger than you can fit in a byte. Computers still like to work in bytes, so you need to represent a letter using a sequence of one or more bytes. A way of doing this is called an encoding. One popular encoding, UTF-8, has the handy feature that all those decent American letters have the same single byte representation as they did in ASCII, but other letters get longer sequences of bytes.
The Internet
The series of tubes we call the Internet is a way of carrying bytes around. As a programmer, you often end up writing code to connect to other computers and read data. Suppose we just want to sit there forever doing something with a continuous stream of bytes the other computer is sending us1:
connection = connect_to_the_thing()
# loop forever
while True:
# receive up to 1024 bytes from the other computer
bytes = connection.recv(1024)
do_something_with(bytes)
The data that comes back from the other computer is a series of bytes. What if you know it’s UTF-8 encoded text, and you want to turn those bytes into that text?
connection = connect_to_the_thing()
# loop forever
while True:
# receive up to 1024 bytes from the other computer
bytes = connection.recv(1024)
# turn it into text
text = bytes.decode("utf-8")
do_something_with(text)
This seems to work fine, but very occasionally crashes on line 5 with a mysterious error message: “UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xe2 in position 1023: unexpected end of data”. Whaaat?
Some frantic Googling of “UnicodeDecodeError” turns up a bunch of people getting that error because they weren’t actually reading UTF-8 encoded text at all, but something else2. So, you check what the other side is sending, and in this case, you’re pretty sure it is sending UTF-8. Whaaat?
Squint at the error message a bit more, and you find it’s complaining about the last byte it’s read. You have to give the recv() a maximum number of bytes to read, so you picked 1024 (a handy power of 2, as is traditional). “Position 1023” is the 1024th byte received (since we start counting from 0, as is tradidional). That “0xe2” thing is hexadecimal E2, equivalent to 11100010 in binary. Read the UTF-8 stuff a bit more, and you find that 11100010 means “this letter is made up of this byte and the two more bytes following this one”. It stopped in the middle of the sequence of bytes which represent a single letter, hence the “unexpected end of data” in the error message.
At this point, if you have control over the other computer, you might be thinking up cunning schemes to ensure that what it passes to each send() is always less than 1024 bytes at a time, without breaking up a multi-byte letter. After all, the data goes out in packets, so what you get when you invoke recv() must line up with the other side’s send()s, right? Wrong.
Avian carrier
The series of tubes is narrower in some places than others, and your data may be broken up to fit. A single carrier pigeon can only carry so much weight, you see, and the RSPB is pretty strict about that sort of thing. All that’s guaranteed is that you get the bytes out in the order they went in, not how many you get out at a time.
Fortunately, Guido thought of this and blessed us with IncrementalDecoder, which knows how to remember that it was part way through a letter when it left off, so that the next time around the loop, it’ll hopefully get the rest of the bytes and give you the letter you were hoping for:
connection = connect_to_the_thing()
decoder_class = codecs.getincrementaldecoder("utf-8")
# Make a new instance of the decoder_class
decoder = decoder_class()
# loop forever
while True:
# receive up to 1024 bytes from the other computer
bytes = connection.recv(1024)
text = decoder.decode(bytes)
do_something_with(text)
We’ll not worry about the other side closing the connection or the wifi packing up, for now. ↩
I do wonder whether questions on Stack Overflow about errors from Python’s Unicode handling have more views in the aggregate than the “How do I exit Vim?” question (which is at 2.1 million views as I write this). ↩
Contra Internet (“shoe”) atheism: I don’t need to be able to prove a thing to you before I can rationally believe it. (tags: philosophybeliefepistemologyAtheismproof)
Make replacement in Python which finds file dependencies by using strace to work out which files the compiler reads. (tags: pythontoolsbuildmakeprogramming)
Things to bear in mind before starting on your quest to replace Make, especially if you’re writing your own replacement. (tags: makebuildprogrammingtools)
“Social psychologist and author Carol Tavris on “Who’s Lying? Who’s Self-Justifying?: Origins of the He Said/She Said Gap in Sexual Communications”. Discusses sexual assault but is mainly about discussions of sexual assault and dissonance. (tags: sexsexismpsychologyscepticismcognitive-biasevidence)
Descriptions of strange and horrifying objects being held by a secret organisation. If you liked Stross's Laundry stuff, you might like this. Time sink warning, there are lots of them. Looks like it's a collaboration using a wiki. (tags: lovecraftsci-fiwikiscience-fictionhorror)
"In December 2010, the Irish government was told by the European Court of Human Rights to deal with exactly this kind of situation, either by making legislative changes or by issuing clear guidelines which acted to remove any and all ambiguities surround the question of when doctors are required to carry out terminations in order to save women’s lives.
To date, it has done nothing, largely, it seems, because Ireland’s anti-abortion lobby, and the Roman Catholic Church (naturally) have spent the last two years or so trying to shout down any notion that an abortion may be necessary to save a woman’s life in any circumstances.
What this sad case proves, definitively, is that they are lying and the real tragedy here is not just that a woman has died because they were lying but that woman has had to die, unnecessarily and in excruciating pain, to prove them wrong." (tags: medicinereligioncatholicismirelandlawabortion)
Pelican is a Python static blog generator which works with Markdown. Looks nice. There's also Calepin.co, which is a service that'll publish your blog if you stick it in your Dropbox. Will I finally leave LJ? Maybe... (tags: markdownsoftwareblogpython)
The Edge also did a feature on Kahneman a while back. Here it is, with more examples of ways in which our thinking fails, but also things we can do which we're finding difficult to program computers to do. (tags: psychologyintuitiondaniel-kahnemancognitioncognitive-biasrationality)
An HTTP library for Python that's less awful than urllib2. Hopefully someone will add it to the standard library at some point. Via Leonard Richardson. (tags: pythonhttplibraryrequestsprogramming)
OK, so remixing videos of Pentecostal services is like shooting fish in a barrel, but you've got to love the person who though of turning it into an 90s video game. (tags: funnypentecostalvideoyoutubecharismaticchristianity)
"Cardinal Keith O’Brien accused the Foreign Secretary of doubling overseas aid to Pakistan to more than £445 million without demanding religious freedom for Christians and other religious minorities, such as Shia Muslims. " I think O'Brien has a point: nobody should be coerced into conversion, and it's clear that Christians need some protection from the Religion Of Peace. (tags: religionpoliticsaidpakistanislamchristianity)
C.S. Lewis wrote that "You would not call a man humane for ceasing to set mousetraps if he did so because he believed there were no mice in the house." Wrongbot points out that to behave ethically one must have correct beliefs as well as the right theory of normative ethics. (tags: ethicsphilosophyrationalitymoralitywrongbot)
"Japan's nuclear powerplants have performed magnificently in the face of a disaster hugely greater than they were designed to withstand, remaining entirely safe throughout and sustaining only minor damage. The unfolding Fukushima story has enormously strengthened the case for advanced nations – including Japan – to build more nuclear powerplants, in the knowledge that no imaginable disaster can result in serious problems." (tags: sciencenuclearsafetyphysicsjapanearthquake)
How not to do it: Atheist starts anonymous blog to tell some other outspoken atheists (PZ, Ophelia Benson, and so on) to cool it, or something. Eventually, someone notices that many commenters on the site are the same person. That person makes a flounce post about being "silenced" and makes their blog private. D'oh! (tags: bloggingdramaatheisminternet)
Theo Hobson on events in Leicester, where the new Lord Mayor has appointed a secular chaplain and removed prayers before monthly council meetings. Hobson notes that the C of E is, perhaps wisely, not making much of a fuss about this: "establishment at all levels is more or less indefensible; the more discussed it is, the more obvious this is. The church can only hope that interest dies down." (tags: anglicanismreligionleicestersecularism)
"This project's goals are to develop the PyMite virtual machine, device drivers, high-level libraries and other tools to run a significant subset of the Python language on microcontrollers without an OS." Nice. (tags: pythonembeddedprogramminghardwaremicrocontrollersavr)
"This article attempts to fundamentally rethink what constitutes community and society on the web, and what possibilities exist for their maintenance and reconstruction in the face of scale and malicious users." I've mentioned this one before, but I've seen a couple of things about creating good comments recently, so I thought I'd wheel it out again. Warning: contains links to Encyclopedia Dramatica, which is very much not safe for work. (tags: communityidentitysocialinternetmoderationreputationkuro5hin)
One of the reasons I'm not a Christian any more is that I realised the God I was being asked to worship was evil. Jeffrey Amos explains what I mean with great clarity, and also addresses the "ah ha, but how do you know what's evil without God, eh?" argument. (tags: hellgodevilchristianityreligionmorality)
Turn anything into a jive (well, anything in 4/4 anyway): "The Swinger is a bit of python code that takes any song and makes it swing. It does this be taking each beat and time-stretching the first half of each beat while time-shrinking the second half. It has quite a magical effect." (tags: musicpythonaudioprogrammingsoftwareswingjive)
As promised, the link blog stuff is now working. It's pulling links and descriptions from my Delicious bookmarks and posting them to LJ in batches of 10 or more, or when there's stuff to be posted and nothing's been posted for 4 days. Let me know if it becomes annoying.
Here comes the science
It turns out there's a PHP script called Delicious Glue to do this, but that would involve using PHP, so no (gateway drug: next thing you know, you'll be using Perl). It looks like that script also doesn't cope with the brave new world of Unicode terribly well, doesn't tag the LJ post using the tags from Delicious, and doesn't support the elaborate posting scheme described in the previous paragraph. Also, it wasn't invented here.
So I did it in Python. Mark Pilgrim's excellent Universal Feed Parser module does much of the heavy lifting. Posting to LJ using XML RPC turns out to be surprisingly easy using the built-in xmlrpclib. Most of the faff comes in getting it to persist state between runs of the script, which I'm doing using pickle. Here's the code: you'd need to be a programmer to adapt it for your own use, but if you are, it shouldn't be hard. I'll probably run it daily using cron.