Legacy Documents and TeX Live Docker Images

Over the past few years there have been a number of major changes to the LaTeX kernel that have unfortunately broken code in old packages, classes and documents. The changes are beneficial to new documents and packages, but before the introduction of the new kernel features it was necessary for packages to hack the old kernel internal commands in order to achieve the desired result and, for the most part, the instabilities come from these hacks.

Some of the affected packages have been updated to work with the new kernels, but in some cases the original hack may have been too complicated to untangle or the package author may no longer be available to update the code.

My own packages have been affected, starting with flowfram.sty in 2015, then datatool in 2019 (that had a knock-on problem for mfirstuc.sty, which requires datatool-base.sty and glossaries.sty, which relies on mfirstuc), and jmlrbook.cls in 2020.

So what happens if you have a legacy document that compiled without a problem when it was first created but now goes wrong? This may not necessarily mean than an error occurs, but it could silently cause unexpected output.

For example, when I wrote LaTeX for Administrative Work (volume 3 of the Dickimaw LaTeX series) I had TeX Live 2014 installed. Below is an image of page 35 (from the A4 PDF version). The page starts with the page header, there is then a figure (which has floated to the top of the page) and then the page body. All looks fine.

Image of page 35 with correctly typeset content.
Page 35 Built with TeX Live 2014

However, if I rebuild the document with a newer TeX distribution then the page content becomes mangled, as shown below. The figure is now at the top of the page, and the page header has been shunted down so that it overlaps the figure caption. The text body height doesn’t take the figure into account, which causes the text to overflow the bottom of the page.

Image of page 35 with textual content shunted down the page.
Page 35 Built with TeX Live 2021

The hack of adding \def\f@depth{1sp} suggested in bug report #105 works in some cases, but unfortunately in this case it leads to the “Too many unprocessed floats” error. So, until I can find a reliable fix for flowfram.sty, how can I rebuild this document? I do have some old versions of TeX Live installed, but not that far back.

The solution lies with the Docker images provided by the Island of TeX. Docker basically allows you to run an application inside an isolated container. So, instead of hunting for my TeX Live 2014 DVD and installing TL2014, I can fetch the Docker image and build my document inside a container.

If you don’t already have Docker installed, you will first need to install it. Once it’s installed, you can use the docker command line tool. On Unix-like systems, you may need to use sudo. You can view the currently installed Docker images using docker images (or sudo docker images).

I need the TeX Live 2014 image so I have to pull the TL2014-historic release from the Island of TeX:

sudo docker pull registry.gitlab.com/islandoftex/images/texlive:TL2014-historic

Now I need to change to the directory (cd) where my document source is located (change the path as applicable):

cd path/to/directory

If my document source code is in the file called document.tex and I would ordinarily compile it using:

pdflatex document.tex

then when I want to compile it inside a TL2014 Docker image container I would need to do:

sudo docker run -i --rm --name latex -v "$PWD":/usr/src/app -w /usr/src/app registry.gitlab.com/islandoftex/images/texlive:TL2014-historic pdflatex document.tex

In this case, my document source file is called admin-report.tex and the build process is rather complicated as it consists of multiple pdflatex, bibtex, makeglossaries and makeindex invocations. Rather than using docker run for each step, it’s simpler to use an automated process, such as arara. First I need to ensure that I have the appropriate arara directives at the start of admin-report.tex:

% arara: pdflatex
% arara: bibtex
% arara: makeglossaries
% arara: pdflatex
% arara: makeglossaries
% arara: pdflatex
% arara: makeindex: { style: admin-index.ist, options: -c }
% arara: pdflatex
% arara: pdflatex

Now I just need one docker run instance:

sudo docker run -i --rm --name latex -v "$PWD":/usr/src/app -w /usr/src/app registry.gitlab.com/islandoftex/images/texlive:TL2014-historic arara --verbose admin-report.tex

This is rather lengthy to type, so I wrote a simple bash script called dockerbuild:

#!/bin/sh

docker run -i --rm --name latex -v "$PWD":/usr/src/app -w /usr/src/app registry.gitlab.com/islandoftex/images/texlive:TL2014-historic arara --verbose "$@"

So I can now just do:

sudo ./dockerbuild admin-report

Unfortunately using sudo means that I end up with files owned by root, but this can be fixed by adding chown to the bash script (change the username and groupname to your own):

chown username:groupname `basename "$@" .tex`.*

There were two further problems. Firstly, the version of flowfram.sty bundled with TeX Live 2014 doesn’t have a couple of options used by the document. This is probably because I added those options while I was working on the book but I uploaded the package to CTAN after the version of TL 2014 captured in the Docker image. I needed to copy the newer v1.17 into my current directory to ensure the document compiled correctly.

Secondly, I can’t input files on my hard drive that are outside of the current working directory from inside the container. Volume 3 cross-references the previous two books in the series using:

\externaldocument[nov-]{../novices/novices-report}
\externaldocument[thesis-]{../thesis/thesis-report}

This picks up the cross-referencing information from the aux files of volumes 1 and 2, but this won’t work inside the Docker container. Instead, I need to copy those files into my current directory. (Note that symbolic links to files outside the current directory won’t work.)

Ideally the best solution is for me to find a way to fix all my affected packages, but this is proving to be non-trivial. The historic TeX Live Docker images at least provide a workaround.

Binary Files, Text Files and File Encodings

The TeX distribution comes with a mixture of binary files and text files. The source code for your document is written in a text file and you need a text editor to create and modify it, but you need to make sure the file (or input) encoding is correct otherwise you can end up with error messages, warnings and strange characters in your PDF file. This can be very confusing to new users without a computer science background who might ask, “what’s the difference between a binary file and a text file, and what does file encoding mean?” It can also confuse people with a computer science background who might blithely inform you that, naturally, a binary file is a file that has binary content and a text file is a file that contains plain text.

So what actually is the difference between a binary file and a text file, and what causes weird symbols to appear and “missing character” warnings?

This isn’t intended to be a lecture on hardware, so I’m going to simplify things somewhat, but digital devices (such as laptops, tablets and smartphones) essentially treat everything as binary data. Binary in this context means one of two states, so you can view the internals of a computer as a series of tiny switches that can either be on or off.

Row of switches: up, down, up, up, down, down, up, up, up.

We could call these two states “on”/“off” or “up”/“down” or “true”/“false” but the most compact form for a human to visualise the two states of a tiny electronic switch is to use the digits 1 (on or up or true) and 0 (off or down or false).

Row of switches with a digit below each one: up (1), down (0), up (1), up (1), down (0), down (0), up (1), up (1), up (1).

Each switch is one bit and a sequence of eight bits is one octet. With 8-bit systems, eight bits is also one byte. (Half a byte, or four bits, is a nybble, but that’s not often used.) Your hard drive (or USB stick etc) is essentially full of bits. The device’s filing system contains an index of where each file starts and ends.

Five rows of twenty 0 or 1 digits with three blocks highlighted and annotated file1, file2 and file3, respectively.

If you delete a file the index is removed but the bits remain.

As the previous image except that the second highlighted region and its annotation have been removed.

So each file contains a sequence of bits and the file size is measured in bytes. In other words, all files have binary content.

The file format determines how the binary content should be interpreted. The format is basically a set of rules. If a file is identified as having a particular format but its content doesn’t follow the rules for that format, then the declared format is incorrect or invalid (which is what triggers an “invalid format” error if you try to open it).

Suppose I have an application (called, say, FooBar) that allows me to draw either a rectangle or an ellipse. It’s very restrictive and only has a limited set of options: vertical/horizontal (is the shape’s long axis vertical? true/false), large/small (is the shape large? true/false), filled/open (is the shape filled? true/false), ellipse/rectangle (is the shape elliptical? true/false). Each setting is binary so the options can be compactly written as a series of bits. For example: vertical, small, open, ellipse can be written as 1001. This needs to be zero-padded to make it up to 8 bits (since most digital storage measurements are in bytes): 00001001.

Binary data is difficult for humans to read and write. The longer the sequence of bits, the harder it becomes, so programmers usually convert the value to hexadecimal (base 16) to make it more compact and easier to read. Each nybble (4 bits) can be represented by one hexadecimal digit (0–9, A–F) so one byte (8 bits) can be represented by two hexadecimal digits. Instead of writing 00001001, I can write the equivalent hexadecimal value: 09. (In order to clarify that the value is a hexadecimal not a decimal representation, it’s often prefixed with “0x”: 0x09. In this case, because the number is less than ten, it happens to be the same as decimal 9.)

Let’s now suppose that FooBar allows me to specify a colour for the shape as a combination of red, green and blue (RGB). Each of these three colours have a numerical value indicating how much of that colour to add where 0 indicates none of the colour and the maximum value indicates all of that colour. There are different scales for quantifying a colour, such as a decimal number between 0.0 and 1.0 or a percentage between 0% and 100% or an integer between 0 (0x00) and 255 (0xFF). The last scale is convenient for FooBar because it means that each of the three colour components can be stored in a single byte so the complete colour specification takes up three bytes.

I’d like my vertical, small, open, ellipse to be drawn in a sort of greyish-blue colour. After playing around with the colour selector I’ve found the shade I like: 0x42 (red), 0x6F (green) and 0x6F (blue).

A greyish-blue, vertical, small, open ellipse.

Having created my work of art, I need to go off and do something else, but I’d like to save my ellipse so I can look at it again later. The most compact way of saving the information is in four bytes (the settings, followed by the red, green and blue values): 00001001 01000010 01101111 01101111. I’ve put a space between each group of eight bits here for clarity, but from the computer’s point of view this is just a sequence of 32 bits. From a human point of view the information looks better in hexadecimal: 09 42 6F 6F.

I need to think of a file name but I’m not very good at naming schemes so I’m just going to call it “image1”. The format is FooBar’s native binary format. The rules for this format are: the file must contain exactly four bytes, the first byte has the settings information stored in the last (least significant) four bits, the second byte is the amount of red, the third byte is the amount of green and the fourth byte is the amount of blue.

If I try to open a file in FooBar that only contains, say, three bytes, then this breaks the rules, so FooBar will popup an “invalid format” error message. What happens if the file has four bytes but the first four bits aren’t 0? Should they simply be ignored or should this trigger an invalid format error? The rules don’t say so the file format has an ambiguity in it.

An application can only read a file if it has been provided with the rules for the file format.

So if the content of all files is just a sequence of bits, what is a text file? A text file is simply a file that obeys one of the known text file formats or encoding. The most well known text encoding is the American Standard Code for Information Interchange (US-ASCII or, more colloquially, ASCII). The ASCII rules are: each byte must be in the range 0x0 to 0x7F (00000000 to 01111111, note that the most significant bit is always 0) and each byte either represents a control character (an instruction) or a printable character (letter, digit or punctuation). The ASCII table describes what each of the 128 allowed bytes represent.

For example, the byte 0x0A (00001010) is the line feed instruction. This means that whatever application is trying to interpret the data must move down one line. However, there is some ambiguity here as some systems will also move back to the first column (the start of the line) when encountering this line feed instruction but others require a carriage return instruction (0x0D) as well. (For those of you who remember using a typewriter, when you reached the end of a line, you had to hit a lever, which rotated the barrel one line, and also push the carriage across, which brought you back to the start of the line. Both actions were performed simultaneously with a single sweep of the hand. The line feed and carriage return terminology have carried over to the digital world.)

Another control code is 0x09 (00001001) which is the horizontal tab instruction. This means to move to the next tab stop, but it’s up to the application reading the data to define the tab stops. The space character (0x20) can also be considered a control code as it’s an instruction to move on one “space” without actually displaying anything.

The bytes in the range 0x21 to 0x7E are printable characters. Each of these values (or codes) has an associated shape (or glyph) that needs to be displayed. This shape is obtained from the font table, but the ASCII format doesn’t provide any information about what font should be used. That’s again up to the application reading the data.

For example, the byte 0x42 (01000010) represents the upper case Latin B and the byte 0x6F (01101111) represents the lower case Latin o. So if I create a file in a text editor that contains a tab followed by the word “Boo” and save it (as ASCII) then the file will contain four bytes: 00001001 01000010 01101111 01101111 (or 09 42 6F 6F).

These four bytes may look familiar. They are the same four bytes that make up the earlier “image1” file. So this file is both a FooBar binary file and an ASCII text file. It obeys the rules of both formats.

What happens if I replace the tab character with an upper case Latin I 0x49 (01001001)? This still obeys the ASCII format, but is it still a valid FooBar binary file? Remember that the FooBar format doesn’t say anything about the first four bits. If an application chooses to simply ignore the value of those first four bits then the file content will still be interpreted as a greyish-blue, vertical, small, open, ellipse.

Let’s suppose I increase the amount of blue and save the file so that it now contains the four bytes: 00001001 01000010 01101111 11111111 (or 09 42 6F FF). This is a valid FooBar binary file but it’s no longer valid ASCII as ASCII doesn’t allow a 1 in the first (most significant) bit of any of the 8-bit bytes.

ASCII only provides rules for 128 values (0x00 to 0x7F). This is quite a limited set of characters. It doesn’t include, for example, accented characters (such as é) or more aesthetic punctuation such as “smart quotes” or various length dashes — such as the em-dash. What if I want to add a pound sterling symbol (£)? The ASCII format doesn’t allow it, just as the FooBar format doesn’t allow a triangle. A different format is required.

The ISO-8859-1 encoding (or latin1) also has each character represented by an 8-bit byte but the range goes up to 0xFF (11111111). The first 128 values are identical to ASCII, but there are extra characters available (where the first — most significant — bit is 1) including the pound £ symbol (0xA3) and ÿ (0xFF). This means that my modified FooBar binary file with the extra blue (09 42 6F FF) is also a valid ISO-8859-1 text file. If I open the file in a text editor and stipulate the ISO-8859-1 encoding then it will interpret the contents as a tab followed by the characters “Boÿ”.

Although ISO-8859-1 provides some accented characters and some extra punctuation (such as guillemets « and ») there are still many characters that are unavailable. A more comprehensive format (text encoding) is UTF-8, which is a variable-width character encoding. This means that some characters are represented by more than one byte.

Just as ASCII is a subset of ISO-8859-1, ASCII is also a subset of UTF-8, which means that all the bytes from 0x00 to 0x7F in UTF-8 are identical to ASCII so, for example, 01101111 (6F) still represents a lower case Latin o. However, unlike ISO-8859-1, the non-ASCII characters are identified by two or more bytes in UTF-8. For example, the pound £ symbol requires two bytes: 11000010 10100011 (C2 A3).

Let’s suppose I now have a file containing the four bytes: 11000010 10100011 00110001 00110010 (C2 A3 31 32). If I open this in a text editor, identifying the text encoding as UTF-8, then the first two bytes will be interpreted as the single character £, the next byte is the digit 1 and the final byte is the digit 2, so I have three characters in total “£12” and the file is four bytes long. If I instead identify the text encoding as ISO-8859-1 then each byte is a separate character, where the first byte is the upper case Latin A with circumflex (Â), so I now have four characters in total “£12”.

Is this still a valid FooBar binary file? Yes, it is, provided we are adopting the lax approach of ignoring the first four bits.

Reddish ellipse with long axis horizontal.

Is this a valid ASCII file? No, because it contains bytes outside of the valid range.

Let’s go back to the original “image1” file and reduce the green to 0x08 and save the image as a file called, say, “testfile”. This contains the four bytes: 00001001 01000010 00001000 01101111 (09 42 08 6F). This is a FooBar binary file but is it also a text file? ASCII defines 00001000 (0x08) as the backspace control code, which is an instruction to move back one space. So this is also a valid ASCII file, but let’s see how it looks if we view it in a text editor. My preferred editor is vim:

Image of vim with a black background showing file contents: a space 8 characters wide followed by the upper case letter B (in white), and then (in blue) the caret symbol followed by the upper case letter H, and then the lower case letter o (in white).

This shows a space eight characters wide (which is the result of the tab 0x09), the upper case letter B (0x42), but this is followed by a sequence in cyan consisting of ^H. It’s in cyan to highlight the fact that it’s not the two characters ^ and H but is a control code with the value 0x08 (H is the eighth letter of the alphabet). This is caret notation and is used to denote control codes. This is followed by the final character (lower case o).

Not all text editors use caret notation. Here’s how this file looks like in gedit:

Image of the gedit (black text on white background): there is a space 8 characters wide (the tab), followed by the upper case letter B, followed by a rectangle containing the digits 0008, followed by a lower case letter o.

In this case the control code is shown using a rectangle with the control code’s hexadecimal value inside it (in this case padded to four digits 0008).

If, on the other hand, I display the file contents using cat in a bash terminal then the result is just a space eight characters wide followed by the lower case o. This is because cat obeys the control code instruction. It firsts moves the cursor to the next tab stop (which creates the initial space), then it prints the letter B, then it moves the cursor back one space, then it prints the letter o, which overwrites the B.

The purpose of a text editor is to create and edit files. If I type Tab Shift+B Backspace o in the text editor then it will interpret the backspace as an instruction to remove the previous character from the buffer. If I then save the file, it will only contain 00001001 01101111 (the tab character and the lower case Latin o). It won’t contain the unwanted B and the backspace character. Therefore, if I open a file in a text editor that contains a control code, such as backspace, the editor will assume that I want a visual representation of the character and won’t interpret it as an instruction.

Although this file is valid ASCII, it would normally be considered just a binary file not a text file because it looks weird if you open it in a text editor.

An application may be able to read a text file (that is, it knows the file format rules), but that doesn’t mean that it will follow the actions assigned to control codes (such as backspace), and there is no guarantee that the font the application is using has an associated glyph for a particular printable character.

UTF-8 also has control characters, such as the zero width joiner, which consists of three bytes (0xE2 0x80 0x8D), and the “variation selector-16” character, which also consists of three bytes (0xEF 0xB8 0x8F). These are used to apply attributes to emoji. For example, the superhero character 🦸 consists of four bytes (0xF0 0x9F 0xA6 0xB8) and the female sign ♀ consists of three bytes (0xE2 0x99 0x80). The sequence of thirteen bytes 0xF0 0x9F 0xA6 0xB8 0xE2 0x80 0x8D 0xE2 0x99 0x80 0xEF 0xB8 0x8F (superhero, zero width joiner, female sign, variation selector-16) identifies the female superhero emoji 🦸‍♀️. However, some applications may not have the function required to implement this and may end up displaying the superhero and female symbols: 🦸♀ (if the font being used has a corresponding glyph for those characters).

If the font doesn’t have a glyph for a particular character then a “not defined character” glyph may be used instead. This could be a rectangle with the hexadecimal value inside (as with the gedit example above) or it could simply be an open rectangle ▯ or a rectangle containing a question mark. (If a byte is invalid — that is, it’s not valid for the given text format — then the replacement character � is typically used.)

So, just because a file has a valid text format, it doesn’t necessarily mean that an application that is ordinarily able to read text files won’t encounter some difficulty with certain characters in that file.

What happens if I try to input my original “image1” file into a LaTeX document:

\documentclass{article}
\begin{document}
\input{image1}
\end{document}

The \input{} command expects the file identified in the braces to be a LaTeX file. This means that it expects the file to be a text file that contains LaTeX markup. So the contents of “image1” won’t be interpreted as a FooBar image but will be interpreted as the characters Tab B o o. LaTeX doesn’t interpret the Tab control code as a tabulation instruction but instead treats it as a space. It also ignores any spaces at the start of a line (which allows you to indent your source code to make it easier to read without introducing spurious spaces). The result is a PDF file with the word “Boo”.

Now let’s replace \input{image1} with \input{testfile} (the file shown in vim and gedit above) and try compiling (building) the document with pdflatex. This triggers the following error:

! Package inputenc Error: Unicode character ^^H (U+0008)
(inputenc)                not set up for use with LaTeX.

See the inputenc package documentation for explanation.
Type  H <return>  for immediate help.
 ...                                              
                                                  
l.1 	B^^H
         o
? 

This is complaining about the third byte in “testfile”, which LaTeX is interpreting as the Unicode character U+0008 (the backspace character). LaTeX has no instructions regarding this character as it’s not a character that would ordinarily be found in LaTeX source code. LaTeX knows what this character is (U+0008), but it doesn’t know what to do with it. If I type h Return at the prompt (the question mark at the bottom) then I get the following message:

You may provide a definition with
\DeclareUnicodeCharacter

This is telling me that if I want to use this character then I need to declare it and provide instructions as to how LaTeX should deal with it. If I press Return at this point and let LaTeX carry on processing then it will ignore the backspace character, so the resulting PDF will simply display “Bo”.

In neither of the above cases do I get a PDF with an image of a small ellipse because \input expects the file to contain (La)TeX instructions and will parse it as such.

You may have noticed that the error message above mentions the inputenc package even though the document hasn’t loaded it. In that example, I was using the TeX Live 2021 distribution. If I use an older distribution, say, TeX Live 2016 then I get a different error message and a different help message:

l.1 	B^^H
         o
? h
A funny symbol that I can't read has just been input.
Continue, and I'll forget that it ever happened.

? 

In both cases the backslash character is ignored and the result is the same.

Now let’s try the four-byte file that can be interpreted as the UTF-8 characters “£12”: 11000010 10100011 00110001 00110010 (C2 A3 31 32). With pdflatex from TeX Live 2016, there’s no error message but the log file contains the following warnings:

Missing character: There is no  in font cmr10!
Missing character: There is no £ in font cmr10!

So this is interpreting the first two bytes of the file as two separate characters, Â (0xC2) and £ (0xA3), but there’s no glyph available for either of these characters in the default font (cmr10). So the PDF just contains “12”. With TeX Live 2021, there are no errors or warnings and the PDF contains “£12”.

Donald Knuth first released TeX in 1978, and Leslie Lamport released LaTeX in 1985. ISO-8859-1 was also first published in 1985, but UTF-8 was designed in 1992. (ASCII was first published in 1963.) So it’s not surprising that the original versions of TeX and LaTeX were designed for single byte text encodings.

The great advantage about UTF-8 is that it covers all Unicode characters (as opposed to ISO-8859-1, which is limited to 256 characters, and ASCII, which is limited to 128 characters). It’s natural that users who wanted to be able to type extended Latin or non-Latin characters into their LaTeX document source code were keen to adopt UTF-8. This is awkward for (La)TeX, which treats each byte as a separate token. The inputenc package (with the utf8 setting) provides a workaround: it makes the first byte of a multi-byte sequence an active character which takes the subsequent byte as its argument. This can be demonstrated by the following UTF-8 document:

\documentclass{article}
\usepackage[utf8]{inputenc}
\begin{document}
\show £
\end{document}

This produces the following message in the transcript:

> �=macro:
->\UTFviii@two@octets �.
l.4 \show �
           �

The \show command displays the definition of the token that follows it. In this case, the command is followed by two tokens: the two bytes (0xC2 and 0xA3) that indicate the £ symbol. So \show picks up the first token (0xC2) and shows the definition. This token (0xC2) is written to the transcript, but the transcript is being viewed in a bash terminal that’s expecting UTF-8 content. The 0xC2 byte isn’t followed by a legitimate byte (as defined by the UTF-8 format) so it’s flagged with the replacement symbol � to denote that it’s invalid. If I view the transcript in vim with the binary mode on, I can see the value of the bytes.

Image of the above transcript message shown in vim in binary mode: the invalid bytes C2 and A3 are shown as <c2> and <a3> (in cyan).

This shows that the 0xC2 byte (octet) has been defined as a macro (command) that expands to \UTFviii@two@octets followed by the byte 0xC2. This internal command is defined to take two arguments: the first is provided (0xC2, the first byte in the two-byte pair) and the second is the token that follows (the second byte in the two-byte pair).

If I press Return to continue processing the document I encounter an error because the second byte (0xA3) of the two-byte pair has become detached. The transcript in the bash terminal at this point is:

! Package inputenc Error: Invalid UTF-8 byte "A3.

See the inputenc package documentation for explanation.
Type  H <return>  for immediate help.
 ...                                              
                                                  
l.4 \show £

If I look at the log file in vim again, but this time without the binary mode on, then vim determines that the file can’t be a UTF-8 file (because it breaks the UTF-8 rules) so it decides that it must be an ISO-8859-1 file:

> Â=macro:
->\UTFviii@two@octets Â.
l.4 \show Â
           £
?

! Package inputenc Error: Invalid UTF-8 byte "A3.

See the inputenc package documentation for explanation.
Type  H <return>  for immediate help.
 ...

l.4 \show £

Recent changes to the LaTeX kernel in the past few years mean that UTF-8 is now the default encoding for LaTeX document source files, but this trick is still employed. If you want the multi-byte UTF-8 characters to be treated as a single token then you need to switch to a modern TeX engine (XeLaTeX or LuaLaTeX), which natively supports UTF-8.

So how do you tell what format (or encoding) a file is in? That is unfortunately quite difficult. You can parse the file and find out if it breaks a rule (is invalid) to determine if it’s not a particular format. For example, if the file contains a byte larger than 0x7F then it’s definitely not ASCII, or if the file contains a byte such as 0xC2 that isn’t followed by a byte (or bytes) that results in a valid UTF-8 character then the file isn’t UTF-8. However, as illustrated by the “image1” file, just because the file contents satisfy the rules of one format doesn’t mean that it doesn’t also coincidentally satisfy the rules of another format.

One general rule of thumb is that if the file contains a certain proportion of bytes that represent control characters (such as the earlier backspace example) then it’s likely to be a binary file. However, just because a file only contains bytes that represent printable characters doesn’t mean that the file isn’t a binary file.

Some formats use a “magic marker”: a sequence of bytes at the start of the file that identifies the format. For example, a PDF 1.3 file will start with the bytes 0x25 0x50 0x44 0x46 0x2D 0x31 0x2E 0x33 (which represents the ASCII characters “%PDF-1.3”). However, there’s nothing to stop me from starting a LaTeX file with the lines:

%PDF-1.3
\pdfmajorversion=1
\pdfminorversion=3

In this case, the first line is a comment to remind the author (or anyone else reading the source code) that the next two lines are stipulating that the resulting PDF file must be version 1.3, but this is enough to confuse some applications into thinking that this LaTeX source file is a PDF file.

The byte order mark (BOM) is another form of magic marker that’s used to indicate the byte-endianness of a UTF-16 or UTF-32 file. For example, a big endian byte order UTF-16 file will start with the bytes 0xFE 0xFF. UTF-8, on the other hand, only has one byte order so, although the BOM character is defined in UTF-8, there’s no point in using it to indicate the byte order.

Despite this, the BOM character is sometimes used at the start of a file to simply indicate that the file is UTF-8, but this can be problematic. If a text editor automatically inserts it at the start of every file that it saves, then it forces the file to be UTF-8 even if there are no non-ASCII characters in the content. This makes it less compatible with other applications, particularly when the file is a script for a language that has it’s own magic marker (such as bash file that must start with #!).

Another way of indicating the file format is to incorporate the information in the file name. This is typically done in the form of a suffix that starts with a dot (the file extension). For example, “image.jpg” indicates a JPEG file and “image.png” indicates a PNG file. There can be multiple extensions. For example, the file “myDoc.synctex.gz” has the extension “.gz” which means that it uses the gzip compression format. If I uncompress it (gunzip myDoc.synctex.gz) then I will have the file “myDoc.synctex”, which is in the synctex format.

Some filing systems hide the extension when showing the list of files. This can be particularly annoying for LaTeX users because a directory (folder) can become full of files with the same basename but different extensions. For example, if I create a document called, say, “myDoc.tex” then using pdflatex will, at the very least, create the files “myDoc.log”, “myDoc.aux” and “myDoc.pdf”. Once a table of contents, list of figures, list of tables, bibliography, index, glossaries etc are added, the file list increases significantly and it can be hard to tell which is which if the extensions are hidden.

More generally, hiding the file extensions can have serious security implications. An executable file called, say, “notes.txt.exe” will be displayed as “notes.txt” which gives the impression that it’s just a text file, but if you double-click on it, expecting to open it in a text editor, the file will instead be executed. This is one way in which users can be tricked into running a malicious executable file.

Unfortunately there’s nothing to stop anyone from renaming the file so that it has a different extension. For example, if I rename a PNG file from “image.png” to “image.jpg” then this doesn’t alter the file content — it’s still a PNG file — but it misrepresents the file, making it look like a JPEG file when it’s actually a PNG file. This can confuse an application that tries to determine the file type from the file name extension, and it will try to read the file using the wrong set of rules.

Another way of identifying the file format is with the MIME type but, as with file extensions, the MIME type can be incorrect (either through accident or deliberately).

Returning to my FooBar “image1” file (which doesn’t have an extension), if I forget about it and stumble on it months later, the chances are that I will have forgotten what it was and what I used to create it. My first step will be to try to identify it with the file tool. This returns “image1: ASCII text, with no line terminators” so my next step will be to open it in a text editor, where I will find the message: Tab Boo. Therefore the developer of the FooBar application really needs to modify the file format so that it includes a magic marker at the start and also decide on a file extension to help identify what type of file it is.

Can I include my FooBar ellipse in my LaTeX document? Not in its FooBar binary format. I would first need to convert it to a graphics format that \includegraphics recognises.

So in summary:

  • All files contain binary data.
  • The file format is the algorithm or set of rules needed to understand the data contained in the file.
  • The term “text file” is used to indicate a file that is written in one of the standard text file formats (such as ASCII, ISO-5988-1 or UTF-8) that is intended to be readable in a text editor (that is, it doesn’t have a high proportion of non-whitespace control codes).
  • The term “binary file” is used to indicate a file that is not a text file.
  • The file (or input) encoding of a text file is the particular text format used to store the textual data in the file.
  • An application can only properly parse or process a file if it recognises and understands the file’s format.
  • If the format is mis-identified then this can either cause outright failure (“invalid format”) or incorrect instructions (such as placing an unwanted  before a £).

TeX Live and Fedora

I’ve been using TeX Live on Fedora for years, but today I encountered an odd error when trying to perform the usual sudo tlmgr update package. I tried an Internet search of the error message but it didn’t provide any helpful clues. I finally worked out what had happened and, since it’s possible someone else might stumble on the same thing, I thought it might be useful to post about it in case it helps others.

First a little background information to supply some context. I normally use dnf to install or update software on Fedora, but not when it comes to TeX because I have found in the past that the Linux distros tend to have outdated TeX packages. Instead, I install TeX Live from the DVD (which I automatically receive as a joint member of UK TUG and TUG) as I have an iffy broadband connection, and I also have to update the TeX Live distributions for other family members. It’s easy to slap the DVD in the drive and set the installer going regardless of whether the computer has Linux or Windows. On my own device, I keep the TeX Live installations from the previous couple of years as it’s useful to be able to switch to an older version when trying to investigate a bug that has appeared with a new TeX Live release. (I have a symbolic link /usr/local/texlive/default that points to the release I want to use. All I need to do is change the link to switch to a different release.)

I don’t like automatic updates (it can be confusing if an update occurs without my noticing and causes an unexpected conflict) so I just periodically run sudo tlmgr update --all but today this resulted in an unexpected error. (The message suggests it’s a warning but the process fails.)

*** WARNING ***: Performing this action will likely destroy the Fedora TeXLive install on your system.
*** WARNING ***: This is almost NEVER what you want to do.
*** WARNING ***: Try using dnf install/update instead.
*** WARNING ***: If performing this action is really what you want to do, pass the "ignore-warning" option.
*** WARNING ***: But please do not file any bugs with the OS Vendor.

As is often the case, the problem is obvious in hindsight but it flummoxed me for a while. Why was the TeX Live manager suddenly telling me to use dnf when I’d installed it from the DVD? It had worked fine the last time (not that long ago), so what had changed since then? A few days ago I’d upgraded to Fedora 31.

It turned out that I now have an extra TeX Live installation that I didn’t know about in /usr/share/texlive/ with its own tlmgr in /bin (which is a symbolic link to /usr/bin). To add to the confusion my normal user PATH has /usr/local/texlive/default/bin/x86_64-linux near the start of the list but the /etc/sudoers file had it at the end:

Defaults    secure_path = /sbin:/bin:/usr/sbin:/usr/bin:/usr/local/texlive/default/bin/x86_64-linux

This means that when I use TeX as a normal user it’s picking up the installation from the DVD, but with sudo it’s picking up the other installation, which requires dnf rather than tlmgr to update TeX packages. The question is, how did that other TeX Live installation suddenly appear?

Some years ago I installed texlive-dummy to satisfy dependencies in the event that I had to install software that required a TeX distribution. As far as I can tell, that texlive-dummy RPM no longer exists. My guess is that when I upgraded to Fedora 31, the upgrade process detected the TeX Live dependency, but texlive-dummy had disappeared, so it installed the complete TeX Live distribution instead. For now, I’ve simply edited the /etc/sudoers file so that /usr/local/texlive/default/bin/x86_64-linux is listed first in the path.

Localisation

[Previously posted on Goodreads 2018-07-26.] The chances are that you’re reading this in a web browser. Perhaps it has a menu bar along the top with words like ‘Bookmarks’ or ‘History’, or perhaps it has a hamburger style menu that appears when you click on a button with three horizontal lines. However you interact with an application, the instructions are provided in words or pictures (or a combination). Commonly known icons, such as a floppy disk or printer, are easy to understand for those familiar with computers, but more complex actions, prompts, and warning or error messages need to be written in words.

For example, if you want to check your email, there might be a message that says ‘1 unread email(s)’ or ‘2 unread email(s)’. If the software is sophisticated, it might be able to say ‘1 unread email’ or ‘2 unread emails’. Naturally, you’ll want this kind of information to be in a language you can understand. Another user may be using the same application in, say, France or Germany, in which case they’ll probably want the messages in French or German.

An application that supports localisation is one that is designed to allow such textual information to be displayed in different languages, and (where necessary) to format certain elements, such as dates or currency, according to a particular region. This support is typically provided in a file that contains a list of all possible messages, each identified by a unique key. Adding a new language is simply a matter of finding someone who can translate those messages and creating a new file with the appropriate name.

The recommended way of identifying a particular language or region is with an ISO code. The ISO 639-1 two-letter code is the most commonly used code to identify root languages, such as ‘en’ for English, ‘fr’ for French and ‘de’ for German. (Languages can also be identified by three-letter codes or numeric codes.) The language code can be combined with an ISO 3166 country code. For example, ‘en-GB’ indicates British English (so a printer dialogue box might ask if you want the ‘colour’ setting), ‘en-US’ indicates US English (‘color’) and ‘fr-CA’ indicates French Canadian (‘couleur’).

On Friday 20th July 2018, Paulo Cereda presented the newly released version 4.0 of his arara tool at the TeX User Group (TUG) 2018 conference in Rio de Janeiro. For those of you who have read my LaTeX books, I mentioned arara in Using LaTeX to Write a PhD Thesis and provided further information in LaTeX for Administrative Work. This very useful tool for automating document builds has localisation support for English, German, Italian, Dutch, Brazilian Portuguese, and — Broad Norfolk.

Wait! What was that?

Broad Norfolk is the dialect spoken in the county of Norfolk in East Anglia. There’s a video of Paulo’s talk available. If you find it a bit too technical but are interested in the language support, skip to around time-frame 18:50. Below are some screenshots of arara in action. (It’s a command line application, so there’s no fancy point and click graphical interface.)

Here’s arara reporting a successful job (converting the file test.tex to test.pdf) with the language set to Broad Norfolk:

Image of arara output (reproduced below).

For those who can’t see the image, the transcript is as follows:

Hold yew hard, ole partner, I'm gornta hev a look at 'test.tex'
(thass 693 bytes big, that is, and that was last chearnged on
07/26/2018 12:09:08 in case yew dunt remember).

(PDFLaTeX) PDFLaTeX engine ..... THASS A MASTERLY JOB, MY BEWTY
(Bib2Gls) The Bib2Gls sof....... THASS A MASTERLY JOB, MY BEWTY
(PDFLaTeX) PDFLaTeX engine ..... THASS A MASTERLY JOB, MY BEWTY

Wuh that took 1.14 seconds but if thass a slight longer than you
expected, dunt yew go mobbing me abowt it cors that ent my fault.
My grandf'ar dint have none of these pearks. He had to use a pen
and a bit o' pearper, but thass bin nice mardling wi' yew. Dew
yew keep a troshin'!

For comparison, the default English setting produces:

Image of arara output (reproduced below).

For those who can’t see the image, the transcript is as follows:

Processing 'test.tex' (size: 693 bytes, last modified: 07/26/2018
12:09:08), please wait.

(PDFLaTeX) PDFLaTeX engine .............................. SUCCESS
(Bib2Gls) The Bib2Gls software .......................... SUCCESS
(PDFLaTeX) PDFLaTeX engine .............................. SUCCESS

Total: 1.18 seconds

For a bit of variety, I then introduced an error that causes the second task (Bib2Gls) to fail. Here’s the Broad Norfolk response:

Image of arara output (reproduced below).

For those who can’t see the image, the transcript is as follows:

Hold yew hard, ole partner, I'm gornta hev a look at 'test.tex'
(thass 694 bytes big, that is, and that was last chearnged on
07/26/2018 12:23:42 in case yew dunt remember).

(PDFLaTeX) PDFLaTeX engine ..... THASS A MASTERLY JOB, MY BEWTY
(Bib2Gls) The Bib2Gls sof....... THAT ENT GORN RIGHT, OLE PARTNER

Wuh that took 0.91 seconds but if thass a slight longer than you
expected, dunt yew go mobbing me abowt it cors that ent my fault.
My grandf'ar dint have none of these pearks. He had to use a pen
and a bit o' pearper, but thass bin nice mardling wi' yew. Dew
yew keep a troshin'!

For comparison, the default English setting produces:

Image of arara output (reproduced below).

For those who can’t see the image, the transcript is as follows:

Processing 'test.tex' (size: 694 bytes, last modified: 07/26/2018
12:23:42), please wait.

(PDFLaTeX) PDFLaTeX engine .............................. SUCCESS
(Bib2Gls) The Bib2Gls software .......................... FAILURE

Total: 0.91 seconds

Here’s the help message in Broad Norfolk:

Image of arara output (reproduced below).

For those who can’t see the image, the transcript is as follows:

arara 4.0 (revision 1)
Copyright (c) 2012-2018, Paulo Roberto Massa Cereda
Orl them rights are reserved, ole partner

usage: arara [file [--dry-run] [--log] [--verbose | --silent] [--timeout
 N] [--max-loops N] [--language L] [ --preamble P ] [--header]
 | --help | --version]
 -h,--help          wuh, cor blast me, my bewty, but that'll tell
                    me to dew jist what I'm dewun rite now
 -H,--header        wuh, my bewty, that'll only peek at directives
                    what are in the file header
 -l,--log           that'll make a log file wi' orl my know dew
                    suffin go wrong
 -L,--language      that'll tell me what language to mardle in
 -m,--max-loops     wuh, yew dunt want me to run on forever, dew
                    you, so use this to say when you want me to
                    stop
 -n,--dry-run       that'll look like I'm dewun suffin, but I ent
 -p,--preamble      dew yew git hold o' that preamble from the
                    configuration file
 -s,--silent        that'll make them system commands clam up and
                    not run on about what's dewin
 -t,--timeout       wuh, yew dunt want them system commands to run
                    on forever dew suffin' go wrong, dew you, so
                    use this to set the execution timeout (thass in
                    milliseconds)
 -V,--version       dew yew use this dew you want my know abowt
                    this version
 -v,--verbose       thass dew you want ter system commands to hav'
                    a mardle wi'yew an'orl

For comparison, the default English setting produces:

Image of arara output (reproduced below).

For those who can’t see the image, the transcript is as follows:

arara 4.0 (revision 1)
Copyright (c) 2012-2018, Paulo Roberto Massa Cereda
All rights reserved

usage: arara [file [--dry-run] [--log] [--verbose | --silent] [--timeout
 N] [--max-loops N] [--language L] [ --preamble P ] [--header]
 | --help | --version]
 -h,--help          print the help message
 -H,--header        extract directives only in the file header
 -l,--log           generate a log output
 -L,--language      set the application language
 -m,--max-loops     set the maximum number of loops
 -n,--dry-run       go through all the motions of running a
                    command, but with no actual calls
 -p,--preamble      set the file preamble based on the
                    configuration file
 -s,--silent        hide the command output
 -t,--timeout       set the execution timeout (in milliseconds)
 -V,--version       print the application version
 -v,--verbose       print the command output

In case you’re wondering why Broad Norfolk was included, Paulo originally asked me if I could add a slang version of English as an Easter egg, but I decided to take advantage of this request and introduce Broad Norfolk to the international TeX community as it’s been sadly misrepresented in film and television, much to the annoyance of those who speak it. As far as we know, it’s the only application that includes Broad Norfolk localisation support. (If you know of any other, please say!)

Having decided to add Broad Norfolk, we needed to consider what code to use. The ISO 3166-1 set includes a sub-set of user-assigned codes provided for non-standard territories for in-house application use. These codes are AA, QM to QZ, XA to XZ, and ZZ. I chose ‘QN’ and decided it’s an abbreviation for Queen’s Norfolk, as the Queen has a home in Norfolk.

Turbot the Witch

[Previously posted on Goodreads 2018-04-29.] I had an interesting encounter with a couple of children as I was heading back into the village after walking around the muddy footpaths and byways around the area. (This is not only setting the scenic background detail, but also noting that I might’ve had a slightly dishevelled and windswept appearance as a result.) In general, I find it a bit awkward when unknown children want to strike up a conversation as on the one hand I don’t want to encourage them to talk to strangers, but on the other hand I don’t want to appear rude, so when they called out a friendly greeting, I gave a friendly acknowledgement without breaking my stride, but the girl called me back.

‘Hello, whoever you are. Who are you?’ she asked.

‘I live in the village,’ I replied, non-committally. Since she seemed to require more detail, I added: ‘My son used to go to the village school.’

‘Is he So-and-so?’ she asked.

(I don’t think I ought to disclose names in a public post, so let’s just stick with So-and-so.)

‘No,’ I said. ‘My son’s grown up and has left school now.’

‘Are you So-and-so’s granny?’

‘No.’

So-and-so’s granny is 68.’

‘I’m not that old,’ I said. ‘I’m not even 50.’

‘Are you 49?’ the boy asked.

I could see that this was going to lead to a guessing game, and he was only two off, so I decided to just cut straight in there with the answer.

‘No, I’m 47.’

‘I hope you don’t mind me saying this,’ the boy said, in a very polite tone of voice, ‘but you look much older.’

‘Are you a witch?’ the girl asked.

‘No,’ I said, ‘but if I was a witch, I might not admit it.’

I’m not sure if they grasped the sub-text there: people aren’t always what they claim to be (or not be).

‘Do you know So-and-so?’ the girl asked, reverting the subject back to whoever he is, but apparently he’s a boy in their school.

‘No, I don’t know So-and-so, and I think you should be careful about talking to strangers.’

‘Are you a stranger? What’s your name?’

‘I have two names,’ I replied. ‘My real name is Nicola Cawley, but my writing name is Talbot.’

‘Turbot?’

‘No, Talbot.’

Clearly, they haven’t yet heard of a local village author of children’s stories that are charmingly illustrated by a talented artist from nearby Poringland.

‘If you’re a witch,’ the girl said, ‘you could turn me into a dog.’

‘Witches don’t exist,’ the boy said.

‘Well, either I’m not a witch or I don’t exist,’ I replied.

All those years studying mathematics haven’t been wasted. I can still apply logical reasoning in a conversation with kids. As I finally walked away, a voice called after me:

‘Goodbye, Whatever-your-name-is Turbot.’

Sir Quackalot

So now I feel that Turbot the Witch has to appear in a story. Perhaps she should join Sir Quackalot, Dickie Duck, José Arara and friends. Sir Quackalot, for those of you who don’t know, started life in the TeX.SE chatroom in a little story containing TeX-related jokes to amuse my friend Paulo who likes ducks and is the creator of an application called arara, which means macaw in Portuguese. The story was called ‘Sir Quackalot and the Golden Arara.’ The image of Sir Quackalot on the left is created using the tikzducks package. The code is:

\documentclass{article}

\usepackage[T1]{fontenc}
\usepackage{tikzducks}

\begin{document}
\begin{tikzpicture}
\begin{scope}[rotate=-15,shift={(-0.5,0.2)}]
\draw[fill=black!40] 
 (1,0.5) -- (0.2,0.5) -- (0, 0.55) -- (0.2,0.6) -- (1, 0.6) -- cycle;
\end{scope}
\duck[cape=darkgray,shorthair=darkgray]
\begin{scope}[rotate=-20,shift={(.25,0.25)}]
\draw[fill=black!50] 
 (0,1) .. controls (0.05, 0.57) and (0.23, 0.23) .. (0.5, 0)
 .. controls (0.77, 0.23) and (0.95, 0.57) .. (1, 1)
 .. controls (0.83, 0.9) and (0.67, 0.9) .. (0.5, 1)
 .. controls (0.33, 0.9) and (0.07, 0.9) .. cycle;
\node[orange,at={(0.5,0.5)}] {\bfseries\large Q};
\end{scope}
\end{tikzpicture}
\end{document}

Sir Quackalot next made an appearance in LaTeX for Administrative Work as the author of titles such as ‘The Adventures of Duck and Goose’, ‘The Return of Duck and Goose’ and ‘More Fun with Duck and Goose’ in one of the sample datasets that accompanies the textbook. The more adventurous reader can, in Exercise 12 (Chapter 4), try to programmatically fetch the titles from the database to typeset an invoice for José Arara’s book order.

The sample data also includes a list of people, such as Dickie Duck, Polly Parrot, Mabel Canary and (to test UTF-8 support) José Arara of São Paulo. At various times in the textbook, they are customers (as in the above invoice exercise), letter recipients (Chapter 3, typesetting correspondence), job applicants (Chapter 5, typesetting a CV), and members of the Secret Lab of Experimental Stuff (and their co-researchers in the Department of Stripy Confectioners) who have to write memos, press releases, and minutes. They also have to redact classified information, use hierarchical numbering in their terms and conditions, prepare presentations, a z-fold leaflet advertising their highly classified projects, and collaborate on documents.

Dickie Duck also moonlights as the author of ‘Oh No! The Chickens have Escaped!’ illustrated by José Arara, whose paintings bear an uncanny resemblance to digitally manipulated photos of my mum’s chickens. In Chapter 10, they have to create a postcard and design an advance information sheet to advertise the book.

Sir Quackalot reappears in my testidx package, which is designed for testing indexing applications with LaTeX. My original plan was to use dummy text, but I’ve grown bored of lorem ipsum and I wanted the first few paragraphs to be informative. I also needed the index to cover the full Basic Latin letter groups A, …, Z as well as some extended Latin characters commonly used in European languages, such as Ð (eth), Þ (thorn) and Ø. After five pages of filler text, I discovered that some of the letter groups were still missing, so I added the story of ‘Sir Quackalot and the Golden Arara’, which provided an extra page of text and conveniently helped with the rather sparse Q letter group. The code to produce the document is quite simple:

\documentclass{article}

\usepackage{imakeidx}
\usepackage{testidx}
\makeindex

\begin{document}
\testidx
\printindex
\end{document}

For those who don’t have a TeX distribution, here’s a PDF I made earlier. That example only has the Basic Latin groups. There’s a fancier example with hyperlinks, extended letter groups, digraphs (IJ, Ll, etc) and a trigraph (Dzs): source code and the final PDF created from it (using XeLaTeX and bib2gls).

Turbot the Witch

So if you read my textbooks or manuals, watch out for a cameo from Turbot the Witch. What does she look like? I think tikzducks can supply the answer again:

\documentclass{article}

\usepackage{tikzducks}

\begin{document}
\begin{tikzpicture}
\duck[witch=black!70,longhair=brown!60!gray,jacket=black!70,magicwand]
\end{tikzpicture}