Encoding in Python 2.x

One of the annoying things where I always forget the specifics. So here it is…

Reading a file line-by-line in python and writing it to another file is easy:

input_file = open("input.txt")
outputFile = open("output.txt", "w")
for line in input_file:
   outputFile.write(line + "\n")

But whenever encodings are involved, everything gets complicated. What is the encoding of line? Actually, not really anything, without specification, line is just a Byte-String (type 'str') with no encoding.

Because this is usually not what we want, the first step is to convert the read line to Unicode. This is done with the method decode. The method has two parameters. The fist is the encoding that should be used. This is a predefined String value which you can guess it for the more common encodings (or look it up in the documentation). If left out, ASCII is assumed. The second parameter defines how to handle unknown byte patterns. The value 'strict' will abort with UnicodeDecodeError, 'ignore' will leave the character out of the Unicode result, and 'replace' will replace every unknown pattern with U+FFFD. Let’s assume our input file and therefor the line we read from there is in Latin-1. We convert this to Unicode with:

lineUnicode = line.decode('latin-1','strict')

or equivalently

lineUnicode = unicode(line, encoding='latin-1', errors='strict')

After decoding, we have sometihng of type 'unicode'. If we try to print this and it is not simple English, it will probably give an error (UnicodeEncodeError: 'ascii' codec can't encode characters in position 63-66: ordinal not in range(128)). This is because Python will try to convert the characters to ASCII, which is not possible for characters that are not ASCII. So, to print out a Unicode text, we have to convert it to some encoding. Let’s say we want UTF-8 (there is no reason not to use UTF-8, so this is what you should always want):

lineUtf8 = lineUnicode.encode('utf-8')
print(lineUtf8)

Here again, there is a second parameter which defines how to handle characters that cannot be represented (which shouldn’t happen too often with UTF-8). Happy coding!

Further reading:
Unicode HOWTO in the Python documentation, Overcoming frustration: Correctly using unicode in python2 from the Python Kitchen project, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky (not specific to Python, but gives a good background).

This entry was posted in Programming and tagged , , , by swk. Bookmark the permalink.

About swk

I am a computational linguist, teacher of computer science and above all a huge fan of LaTeX. I use LaTeX for everything, including things you never wanted to do with LaTeX. My latest love is lilypond, aka LaTeX for music. I'll post at irregular intervals about cool stuff, stupid hacks and annoying settings I want to remember for the future.