# Find file names with invalid encoding on Linux

I have files copied from Windows computers in ancient times. The filenames contain special characters and they have been messed up somewhere along the way. For example I got a file named 9.5.2 Modelo de aceptaci??n (espa??ol).doc in the folder 9 Garant??a del Estado.

First, I want to find and list these files. Stackexchange tells us how to do that:

LC_ALL=C find . -name '*[! -~]*'


This will find all names that have non-ASCII letters, not only those that are broken. But in my case I have folders where ALL of the names are broken, so I don’t mind.

Second, I want to fix the names. I did it manually, but for future reference, if I ever were to do anything like that again, I might use one of the solutions proposed in this thread on serverfault.com.

# Fun with newlines

Use a typewriter lately? No? Well, who cares… except when you encounter stupidities left over from the early days of computing where people were still used to typewriters. Because typewriters had two ways of going to a new line, ASCII knows two ways of representing the newline:

• LF (line feed, German Zeilenvorschub), represented as Unicode code point 0x0A, ASCII 00001100 and escape character \n
• CR (carriage return, German Wagenrücklauf), represented as Unicode code point 0x0D, ASCII 00001101 and escape character \r

ASCII was the first-ever invented encoding for representing text in bits. It’s from the 1960s and at the time someone probably thought it is a good idea to have two characters for the concept of a new line. We’d think "who cares about stuff from the 1960s", it’s 2017, right? But unfortunately many later encodings base themselves on ASCII, most notably those from the Unicode family, e.g., the widely used UTF-8. So – thank you, 1960s! /sarcasm

Two characters for a new line would not be too bad if they were used consistently, but that is where the fun begins. Of course they are not! Differnt operating systems use different conventions to mark the end of a line:

• Linux and Mac OSX use LF
• Windows uses CR LF
• (and to make the chaos complete, Mac OS from before version X uses CR)

So have fun reading "plain text" files! /sarcasm

# Encoding in Python 2.x

One of the annoying things where I always forget the specifics. So here it is…

Reading a file line-by-line in python and writing it to another file is easy:

input_file = open("input.txt")
outputFile = open("output.txt", "w")
for line in input_file:
outputFile.write(line + "\n")


But whenever encodings are involved, everything gets complicated. What is the encoding of line? Actually, not really anything, without specification, line is just a Byte-String (type 'str') with no encoding.

Because this is usually not what we want, the first step is to convert the read line to Unicode. This is done with the method decode. The method has two parameters. The fist is the encoding that should be used. This is a predefined String value which you can guess it for the more common encodings (or look it up in the documentation). If left out, ASCII is assumed. The second parameter defines how to handle unknown byte patterns. The value 'strict' will abort with UnicodeDecodeError, 'ignore' will leave the character out of the Unicode result, and 'replace' will replace every unknown pattern with U+FFFD. Let’s assume our input file and therefor the line we read from there is in Latin-1. We convert this to Unicode with:

lineUnicode = line.decode('latin-1','strict')


or equivalently

lineUnicode = unicode(line, encoding='latin-1', errors='strict')


After decoding, we have sometihng of type 'unicode'. If we try to print this and it is not simple English, it will probably give an error (UnicodeEncodeError: 'ascii' codec can't encode characters in position 63-66: ordinal not in range(128)). This is because Python will try to convert the characters to ASCII, which is not possible for characters that are not ASCII. So, to print out a Unicode text, we have to convert it to some encoding. Let’s say we want UTF-8 (there is no reason not to use UTF-8, so this is what you should always want):

lineUtf8 = lineUnicode.encode('utf-8')
print(lineUtf8)


Here again, there is a second parameter which defines how to handle characters that cannot be represented (which shouldn’t happen too often with UTF-8). Happy coding!

Unicode HOWTO in the Python documentation, Overcoming frustration: Correctly using unicode in python2 from the Python Kitchen project, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky (not specific to Python, but gives a good background).

# Using babel with Hebrew in texlive

Many articles say that to use Hebrew with LaTeX, you should use xelatex instead of Tex Live (which is default on Ubuntu). It IS possible to write Hebrew using Tex Live, here is how. Working minimal example:

\documentclass{article}
\usepackage[utf8x]{inputenc}
\usepackage[hebrew,english]{babel}

\begin{document}
test
\sethebrew
\end{document}


On my Ubuntu 12.10 installation this fails with this error (even though I have the packages ‘culmus’ and ‘texlive-lang-hebrew’):

kpathsea: Running mktextfm jerus10
mktextfm: Running mf-nowin -progname=mf \mode:=ljfour; mag:=1; nonstopmode; input jerus10
This is METAFONT, Version 2.718281 (TeX Live 2012/Debian)
kpathsea: Running mktexmf jerus10
! I can't find file jerus10'.


Solution:

1. Get the ‘jerus10.mf’ file from the hebtex LaTeX package (available on CTAN). Don’t install the package, it is deprecated (as in REALLY old).
2. Put the ‘jerus10.mf’ file in the folder ~/texmf/fonts/source/hebrew/
3. In terminal, run the command ‘texhash’. If it says ‘done’ without error, everything is fine.

The above file should work. It’s not particularily pretty and there are some compatibility issues with some packages that I might address another time. There are two possible error sources if you didn’t copy/paste carefully enough:

1. Enable UTF8X input

! Package inputenc Error: Keyboard character used is undefined
(inputenc)                in inputencoding 8859-8'.


or

! Package inputenc Error: Unicode char \u8:×© not set up for use with LaTeX.


If you want to write unicode Hebrew or copy-paste Hebrew from somewhere, you need to define the input encoding [utf8x], not only [utf8]:

\usepackage[utf8x]{inputenc}


2. Be sure to change the language to Hebrew

! LaTeX Error: Command \hebshin unavailable in encoding OT1.


Fix this by declaring that the following text is Hebrew with

\sethebrew