Find file names with invalid encoding on Linux

I have files copied from Windows computers in ancient times. The filenames contain special characters and they have been messed up somewhere along the way. For example I got a file named 9.5.2 Modelo de aceptaci??n (espa??ol).doc in the folder 9 Garant??a del Estado.

First, I want to find and list these files. Stackexchange tells us how to do that:

LC_ALL=C find . -name '*[! -~]*'

This will find all names that have non-ASCII letters, not only those that are broken. But in my case I have folders where ALL of the names are broken, so I don’t mind.

Second, I want to fix the names. I did it manually, but for future reference, if I ever were to do anything like that again, I might use one of the solutions proposed in this thread on serverfault.com.

Fun with newlines

Use a typewriter lately? No? Well, who cares… except when you encounter stupidities left over from the early days of computing where people were still used to typewriters. Because typewriters had two ways of going to a new line, ASCII knows two ways of representing the newline:

  • LF (line feed, German Zeilenvorschub), represented as Unicode code point 0x0A, ASCII 00001100 and escape character \n
  • CR (carriage return, German Wagenrücklauf), represented as Unicode code point 0x0D, ASCII 00001101 and escape character \r

ASCII was the first-ever invented encoding for representing text in bits. It’s from the 1960s and at the time someone probably thought it is a good idea to have two characters for the concept of a new line. We’d think "who cares about stuff from the 1960s", it’s 2017, right? But unfortunately many later encodings base themselves on ASCII, most notably those from the Unicode family, e.g., the widely used UTF-8. So – thank you, 1960s! /sarcasm

Two characters for a new line would not be too bad if they were used consistently, but that is where the fun begins. Of course they are not! Differnt operating systems use different conventions to mark the end of a line:

  • Linux and Mac OSX use LF
  • Windows uses CR LF
  • (and to make the chaos complete, Mac OS from before version X uses CR)

So have fun reading "plain text" files! /sarcasm

Encoding in Python 2.x

One of the annoying things where I always forget the specifics. So here it is…

Reading a file line-by-line in python and writing it to another file is easy:

input_file = open("input.txt")
outputFile = open("output.txt", "w")
for line in input_file:
   outputFile.write(line + "\n")

But whenever encodings are involved, everything gets complicated. What is the encoding of line? Actually, not really anything, without specification, line is just a Byte-String (type 'str') with no encoding.

Because this is usually not what we want, the first step is to convert the read line to Unicode. This is done with the method decode. The method has two parameters. The fist is the encoding that should be used. This is a predefined String value which you can guess it for the more common encodings (or look it up in the documentation). If left out, ASCII is assumed. The second parameter defines how to handle unknown byte patterns. The value 'strict' will abort with UnicodeDecodeError, 'ignore' will leave the character out of the Unicode result, and 'replace' will replace every unknown pattern with U+FFFD. Let’s assume our input file and therefor the line we read from there is in Latin-1. We convert this to Unicode with:

lineUnicode = line.decode('latin-1','strict')

or equivalently

lineUnicode = unicode(line, encoding='latin-1', errors='strict')

After decoding, we have sometihng of type 'unicode'. If we try to print this and it is not simple English, it will probably give an error (UnicodeEncodeError: 'ascii' codec can't encode characters in position 63-66: ordinal not in range(128)). This is because Python will try to convert the characters to ASCII, which is not possible for characters that are not ASCII. So, to print out a Unicode text, we have to convert it to some encoding. Let’s say we want UTF-8 (there is no reason not to use UTF-8, so this is what you should always want):

lineUtf8 = lineUnicode.encode('utf-8')
print(lineUtf8)

Here again, there is a second parameter which defines how to handle characters that cannot be represented (which shouldn’t happen too often with UTF-8). Happy coding!

Further reading:
Unicode HOWTO in the Python documentation, Overcoming frustration: Correctly using unicode in python2 from the Python Kitchen project, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky (not specific to Python, but gives a good background).

Encoding question mark in TikZ

I’m trying to draw the question mark that is sometimes displayed when there are encoding issues: �

T-shirt Sch... Encoding

This is my solution:

\tikz[baseline=(wi.base)]{
\node[fill=black,rotate=45,inner sep=.1ex,text height=1.8ex,text width=1.8ex] {};
\node[font=\color{white}] (wi) {?};
}