Legacy Documents and TeX Live Docker Images

Over the past few years there have been a number of major changes to the LaTeX kernel that have unfortunately broken code in old packages, classes and documents. The changes are beneficial to new documents and packages, but before the introduction of the new kernel features it was necessary for packages to hack the old kernel internal commands in order to achieve the desired result and, for the most part, the instabilities come from these hacks.

Some of the affected packages have been updated to work with the new kernels, but in some cases the original hack may have been too complicated to untangle or the package author may no longer be available to update the code.

My own packages have been affected, starting with flowfram.sty in 2015, then datatool in 2019 (that had a knock-on problem for mfirstuc.sty, which requires datatool-base.sty and glossaries.sty, which relies on mfirstuc), and jmlrbook.cls in 2020.

So what happens if you have a legacy document that compiled without a problem when it was first created but now goes wrong? This may not necessarily mean than an error occurs, but it could silently cause unexpected output.

For example, when I wrote LaTeX for Administrative Work (volume 3 of the Dickimaw LaTeX series) I had TeX Live 2014 installed. Below is an image of page 35 (from the A4 PDF version). The page starts with the page header, there is then a figure (which has floated to the top of the page) and then the page body. All looks fine.

Image of page 35 with correctly typeset content.
Page 35 Built with TeX Live 2014

However, if I rebuild the document with a newer TeX distribution then the page content becomes mangled, as shown below. The figure is now at the top of the page, and the page header has been shunted down so that it overlaps the figure caption. The text body height doesn’t take the figure into account, which causes the text to overflow the bottom of the page.

Image of page 35 with textual content shunted down the page.
Page 35 Built with TeX Live 2021

The hack of adding \def\f@depth{1sp} suggested in bug report #105 works in some cases, but unfortunately in this case it leads to the “Too many unprocessed floats” error. So, until I can find a reliable fix for flowfram.sty, how can I rebuild this document? I do have some old versions of TeX Live installed, but not that far back.

The solution lies with the Docker images provided by the Island of TeX. Docker basically allows you to run an application inside an isolated container. So, instead of hunting for my TeX Live 2014 DVD and installing TL2014, I can fetch the Docker image and build my document inside a container.

If you don’t already have Docker installed, you will first need to install it. Once it’s installed, you can use the docker command line tool. On Unix-like systems, you may need to use sudo. You can view the currently installed Docker images using docker images (or sudo docker images).

I need the TeX Live 2014 image so I have to pull the TL2014-historic release from the Island of TeX:

sudo docker pull registry.gitlab.com/islandoftex/images/texlive:TL2014-historic

Now I need to change to the directory (cd) where my document source is located (change the path as applicable):

cd path/to/directory

If my document source code is in the file called document.tex and I would ordinarily compile it using:

pdflatex document.tex

then when I want to compile it inside a TL2014 Docker image container I would need to do:

sudo docker run -i --rm --name latex -v "$PWD":/usr/src/app -w /usr/src/app registry.gitlab.com/islandoftex/images/texlive:TL2014-historic pdflatex document.tex

In this case, my document source file is called admin-report.tex and the build process is rather complicated as it consists of multiple pdflatex, bibtex, makeglossaries and makeindex invocations. Rather than using docker run for each step, it’s simpler to use an automated process, such as arara. First I need to ensure that I have the appropriate arara directives at the start of admin-report.tex:

% arara: pdflatex
% arara: bibtex
% arara: makeglossaries
% arara: pdflatex
% arara: makeglossaries
% arara: pdflatex
% arara: makeindex: { style: admin-index.ist, options: -c }
% arara: pdflatex
% arara: pdflatex

Now I just need one docker run instance:

sudo docker run -i --rm --name latex -v "$PWD":/usr/src/app -w /usr/src/app registry.gitlab.com/islandoftex/images/texlive:TL2014-historic arara --verbose admin-report.tex

This is rather lengthy to type, so I wrote a simple bash script called dockerbuild:

#!/bin/sh

docker run -i --rm --name latex -v "$PWD":/usr/src/app -w /usr/src/app registry.gitlab.com/islandoftex/images/texlive:TL2014-historic arara --verbose "$@"

So I can now just do:

sudo ./dockerbuild admin-report

Unfortunately using sudo means that I end up with files owned by root, but this can be fixed by adding chown to the bash script (change the username and groupname to your own):

chown username:groupname `basename "$@" .tex`.*

There were two further problems. Firstly, the version of flowfram.sty bundled with TeX Live 2014 doesn’t have a couple of options used by the document. This is probably because I added those options while I was working on the book but I uploaded the package to CTAN after the version of TL 2014 captured in the Docker image. I needed to copy the newer v1.17 into my current directory to ensure the document compiled correctly.

Secondly, I can’t input files on my hard drive that are outside of the current working directory from inside the container. Volume 3 cross-references the previous two books in the series using:

\externaldocument[nov-]{../novices/novices-report}
\externaldocument[thesis-]{../thesis/thesis-report}

This picks up the cross-referencing information from the aux files of volumes 1 and 2, but this won’t work inside the Docker container. Instead, I need to copy those files into my current directory. (Note that symbolic links to files outside the current directory won’t work.)

Ideally the best solution is for me to find a way to fix all my affected packages, but this is proving to be non-trivial. The historic TeX Live Docker images at least provide a workaround.

Binary Files, Text Files and File Encodings

The TeX distribution comes with a mixture of binary files and text files. The source code for your document is written in a text file and you need a text editor to create and modify it, but you need to make sure the file (or input) encoding is correct otherwise you can end up with error messages, warnings and strange characters in your PDF file. This can be very confusing to new users without a computer science background who might ask, “what’s the difference between a binary file and a text file, and what does file encoding mean?” It can also confuse people with a computer science background who might blithely inform you that, naturally, a binary file is a file that has binary content and a text file is a file that contains plain text.

So what actually is the difference between a binary file and a text file, and what causes weird symbols to appear and “missing character” warnings?

This isn’t intended to be a lecture on hardware, so I’m going to simplify things somewhat, but digital devices (such as laptops, tablets and smartphones) essentially treat everything as binary data. Binary in this context means one of two states, so you can view the internals of a computer as a series of tiny switches that can either be on or off.

Row of switches: up, down, up, up, down, down, up, up, up.

We could call these two states “on”/“off” or “up”/“down” or “true”/“false” but the most compact form for a human to visualise the two states of a tiny electronic switch is to use the digits 1 (on or up or true) and 0 (off or down or false).

Row of switches with a digit below each one: up (1), down (0), up (1), up (1), down (0), down (0), up (1), up (1), up (1).

Each switch is one bit and a sequence of eight bits is one octet. With 8-bit systems, eight bits is also one byte. (Half a byte, or four bits, is a nybble, but that’s not often used.) Your hard drive (or USB stick etc) is essentially full of bits. The device’s filing system contains an index of where each file starts and ends.

Five rows of twenty 0 or 1 digits with three blocks highlighted and annotated file1, file2 and file3, respectively.

If you delete a file the index is removed but the bits remain.

As the previous image except that the second highlighted region and its annotation have been removed.

So each file contains a sequence of bits and the file size is measured in bytes. In other words, all files have binary content.

The file format determines how the binary content should be interpreted. The format is basically a set of rules. If a file is identified as having a particular format but its content doesn’t follow the rules for that format, then the declared format is incorrect or invalid (which is what triggers an “invalid format” error if you try to open it).

Suppose I have an application (called, say, FooBar) that allows me to draw either a rectangle or an ellipse. It’s very restrictive and only has a limited set of options: vertical/horizontal (is the shape’s long axis vertical? true/false), large/small (is the shape large? true/false), filled/open (is the shape filled? true/false), ellipse/rectangle (is the shape elliptical? true/false). Each setting is binary so the options can be compactly written as a series of bits. For example: vertical, small, open, ellipse can be written as 1001. This needs to be zero-padded to make it up to 8 bits (since most digital storage measurements are in bytes): 00001001.

Binary data is difficult for humans to read and write. The longer the sequence of bits, the harder it becomes, so programmers usually convert the value to hexadecimal (base 16) to make it more compact and easier to read. Each nybble (4 bits) can be represented by one hexadecimal digit (0–9, A–F) so one byte (8 bits) can be represented by two hexadecimal digits. Instead of writing 00001001, I can write the equivalent hexadecimal value: 09. (In order to clarify that the value is a hexadecimal not a decimal representation, it’s often prefixed with “0x”: 0x09. In this case, because the number is less than ten, it happens to be the same as decimal 9.)

Let’s now suppose that FooBar allows me to specify a colour for the shape as a combination of red, green and blue (RGB). Each of these three colours have a numerical value indicating how much of that colour to add where 0 indicates none of the colour and the maximum value indicates all of that colour. There are different scales for quantifying a colour, such as a decimal number between 0.0 and 1.0 or a percentage between 0% and 100% or an integer between 0 (0x00) and 255 (0xFF). The last scale is convenient for FooBar because it means that each of the three colour components can be stored in a single byte so the complete colour specification takes up three bytes.

I’d like my vertical, small, open, ellipse to be drawn in a sort of greyish-blue colour. After playing around with the colour selector I’ve found the shade I like: 0x42 (red), 0x6F (green) and 0x6F (blue).

A greyish-blue, vertical, small, open ellipse.

Having created my work of art, I need to go off and do something else, but I’d like to save my ellipse so I can look at it again later. The most compact way of saving the information is in four bytes (the settings, followed by the red, green and blue values): 00001001 01000010 01101111 01101111. I’ve put a space between each group of eight bits here for clarity, but from the computer’s point of view this is just a sequence of 32 bits. From a human point of view the information looks better in hexadecimal: 09 42 6F 6F.

I need to think of a file name but I’m not very good at naming schemes so I’m just going to call it “image1”. The format is FooBar’s native binary format. The rules for this format are: the file must contain exactly four bytes, the first byte has the settings information stored in the last (least significant) four bits, the second byte is the amount of red, the third byte is the amount of green and the fourth byte is the amount of blue.

If I try to open a file in FooBar that only contains, say, three bytes, then this breaks the rules, so FooBar will popup an “invalid format” error message. What happens if the file has four bytes but the first four bits aren’t 0? Should they simply be ignored or should this trigger an invalid format error? The rules don’t say so the file format has an ambiguity in it.

An application can only read a file if it has been provided with the rules for the file format.

So if the content of all files is just a sequence of bits, what is a text file? A text file is simply a file that obeys one of the known text file formats or encoding. The most well known text encoding is the American Standard Code for Information Interchange (US-ASCII or, more colloquially, ASCII). The ASCII rules are: each byte must be in the range 0x0 to 0x7F (00000000 to 01111111, note that the most significant bit is always 0) and each byte either represents a control character (an instruction) or a printable character (letter, digit or punctuation). The ASCII table describes what each of the 128 allowed bytes represent.

For example, the byte 0x0A (00001010) is the line feed instruction. This means that whatever application is trying to interpret the data must move down one line. However, there is some ambiguity here as some systems will also move back to the first column (the start of the line) when encountering this line feed instruction but others require a carriage return instruction (0x0D) as well. (For those of you who remember using a typewriter, when you reached the end of a line, you had to hit a lever, which rotated the barrel one line, and also push the carriage across, which brought you back to the start of the line. Both actions were performed simultaneously with a single sweep of the hand. The line feed and carriage return terminology have carried over to the digital world.)

Another control code is 0x09 (00001001) which is the horizontal tab instruction. This means to move to the next tab stop, but it’s up to the application reading the data to define the tab stops. The space character (0x20) can also be considered a control code as it’s an instruction to move on one “space” without actually displaying anything.

The bytes in the range 0x21 to 0x7E are printable characters. Each of these values (or codes) has an associated shape (or glyph) that needs to be displayed. This shape is obtained from the font table, but the ASCII format doesn’t provide any information about what font should be used. That’s again up to the application reading the data.

For example, the byte 0x42 (01000010) represents the upper case Latin B and the byte 0x6F (01101111) represents the lower case Latin o. So if I create a file in a text editor that contains a tab followed by the word “Boo” and save it (as ASCII) then the file will contain four bytes: 00001001 01000010 01101111 01101111 (or 09 42 6F 6F).

These four bytes may look familiar. They are the same four bytes that make up the earlier “image1” file. So this file is both a FooBar binary file and an ASCII text file. It obeys the rules of both formats.

What happens if I replace the tab character with an upper case Latin I 0x49 (01001001)? This still obeys the ASCII format, but is it still a valid FooBar binary file? Remember that the FooBar format doesn’t say anything about the first four bits. If an application chooses to simply ignore the value of those first four bits then the file content will still be interpreted as a greyish-blue, vertical, small, open, ellipse.

Let’s suppose I increase the amount of blue and save the file so that it now contains the four bytes: 00001001 01000010 01101111 11111111 (or 09 42 6F FF). This is a valid FooBar binary file but it’s no longer valid ASCII as ASCII doesn’t allow a 1 in the first (most significant) bit of any of the 8-bit bytes.

ASCII only provides rules for 128 values (0x00 to 0x7F). This is quite a limited set of characters. It doesn’t include, for example, accented characters (such as é) or more aesthetic punctuation such as “smart quotes” or various length dashes — such as the em-dash. What if I want to add a pound sterling symbol (£)? The ASCII format doesn’t allow it, just as the FooBar format doesn’t allow a triangle. A different format is required.

The ISO-8859-1 encoding (or latin1) also has each character represented by an 8-bit byte but the range goes up to 0xFF (11111111). The first 128 values are identical to ASCII, but there are extra characters available (where the first — most significant — bit is 1) including the pound £ symbol (0xA3) and ÿ (0xFF). This means that my modified FooBar binary file with the extra blue (09 42 6F FF) is also a valid ISO-8859-1 text file. If I open the file in a text editor and stipulate the ISO-8859-1 encoding then it will interpret the contents as a tab followed by the characters “Boÿ”.

Although ISO-8859-1 provides some accented characters and some extra punctuation (such as guillemets « and ») there are still many characters that are unavailable. A more comprehensive format (text encoding) is UTF-8, which is a variable-width character encoding. This means that some characters are represented by more than one byte.

Just as ASCII is a subset of ISO-8859-1, ASCII is also a subset of UTF-8, which means that all the bytes from 0x00 to 0x7F in UTF-8 are identical to ASCII so, for example, 01101111 (6F) still represents a lower case Latin o. However, unlike ISO-8859-1, the non-ASCII characters are identified by two or more bytes in UTF-8. For example, the pound £ symbol requires two bytes: 11000010 10100011 (C2 A3).

Let’s suppose I now have a file containing the four bytes: 11000010 10100011 00110001 00110010 (C2 A3 31 32). If I open this in a text editor, identifying the text encoding as UTF-8, then the first two bytes will be interpreted as the single character £, the next byte is the digit 1 and the final byte is the digit 2, so I have three characters in total “£12” and the file is four bytes long. If I instead identify the text encoding as ISO-8859-1 then each byte is a separate character, where the first byte is the upper case Latin A with circumflex (Â), so I now have four characters in total “£12”.

Is this still a valid FooBar binary file? Yes, it is, provided we are adopting the lax approach of ignoring the first four bits.

Reddish ellipse with long axis horizontal.

Is this a valid ASCII file? No, because it contains bytes outside of the valid range.

Let’s go back to the original “image1” file and reduce the green to 0x08 and save the image as a file called, say, “testfile”. This contains the four bytes: 00001001 01000010 00001000 01101111 (09 42 08 6F). This is a FooBar binary file but is it also a text file? ASCII defines 00001000 (0x08) as the backspace control code, which is an instruction to move back one space. So this is also a valid ASCII file, but let’s see how it looks if we view it in a text editor. My preferred editor is vim:

Image of vim with a black background showing file contents: a space 8 characters wide followed by the upper case letter B (in white), and then (in blue) the caret symbol followed by the upper case letter H, and then the lower case letter o (in white).

This shows a space eight characters wide (which is the result of the tab 0x09), the upper case letter B (0x42), but this is followed by a sequence in cyan consisting of ^H. It’s in cyan to highlight the fact that it’s not the two characters ^ and H but is a control code with the value 0x08 (H is the eighth letter of the alphabet). This is caret notation and is used to denote control codes. This is followed by the final character (lower case o).

Not all text editors use caret notation. Here’s how this file looks like in gedit:

Image of the gedit (black text on white background): there is a space 8 characters wide (the tab), followed by the upper case letter B, followed by a rectangle containing the digits 0008, followed by a lower case letter o.

In this case the control code is shown using a rectangle with the control code’s hexadecimal value inside it (in this case padded to four digits 0008).

If, on the other hand, I display the file contents using cat in a bash terminal then the result is just a space eight characters wide followed by the lower case o. This is because cat obeys the control code instruction. It firsts moves the cursor to the next tab stop (which creates the initial space), then it prints the letter B, then it moves the cursor back one space, then it prints the letter o, which overwrites the B.

The purpose of a text editor is to create and edit files. If I type Tab Shift+B Backspace o in the text editor then it will interpret the backspace as an instruction to remove the previous character from the buffer. If I then save the file, it will only contain 00001001 01101111 (the tab character and the lower case Latin o). It won’t contain the unwanted B and the backspace character. Therefore, if I open a file in a text editor that contains a control code, such as backspace, the editor will assume that I want a visual representation of the character and won’t interpret it as an instruction.

Although this file is valid ASCII, it would normally be considered just a binary file not a text file because it looks weird if you open it in a text editor.

An application may be able to read a text file (that is, it knows the file format rules), but that doesn’t mean that it will follow the actions assigned to control codes (such as backspace), and there is no guarantee that the font the application is using has an associated glyph for a particular printable character.

UTF-8 also has control characters, such as the zero width joiner, which consists of three bytes (0xE2 0x80 0x8D), and the “variation selector-16” character, which also consists of three bytes (0xEF 0xB8 0x8F). These are used to apply attributes to emoji. For example, the superhero character 🦸 consists of four bytes (0xF0 0x9F 0xA6 0xB8) and the female sign ♀ consists of three bytes (0xE2 0x99 0x80). The sequence of thirteen bytes 0xF0 0x9F 0xA6 0xB8 0xE2 0x80 0x8D 0xE2 0x99 0x80 0xEF 0xB8 0x8F (superhero, zero width joiner, female sign, variation selector-16) identifies the female superhero emoji 🦸‍♀️. However, some applications may not have the function required to implement this and may end up displaying the superhero and female symbols: 🦸♀ (if the font being used has a corresponding glyph for those characters).

If the font doesn’t have a glyph for a particular character then a “not defined character” glyph may be used instead. This could be a rectangle with the hexadecimal value inside (as with the gedit example above) or it could simply be an open rectangle ▯ or a rectangle containing a question mark. (If a byte is invalid — that is, it’s not valid for the given text format — then the replacement character � is typically used.)

So, just because a file has a valid text format, it doesn’t necessarily mean that an application that is ordinarily able to read text files won’t encounter some difficulty with certain characters in that file.

What happens if I try to input my original “image1” file into a LaTeX document:

\documentclass{article}
\begin{document}
\input{image1}
\end{document}

The \input{} command expects the file identified in the braces to be a LaTeX file. This means that it expects the file to be a text file that contains LaTeX markup. So the contents of “image1” won’t be interpreted as a FooBar image but will be interpreted as the characters Tab B o o. LaTeX doesn’t interpret the Tab control code as a tabulation instruction but instead treats it as a space. It also ignores any spaces at the start of a line (which allows you to indent your source code to make it easier to read without introducing spurious spaces). The result is a PDF file with the word “Boo”.

Now let’s replace \input{image1} with \input{testfile} (the file shown in vim and gedit above) and try compiling (building) the document with pdflatex. This triggers the following error:

! Package inputenc Error: Unicode character ^^H (U+0008)
(inputenc)                not set up for use with LaTeX.

See the inputenc package documentation for explanation.
Type  H <return>  for immediate help.
 ...                                              
                                                  
l.1 	B^^H
         o
? 

This is complaining about the third byte in “testfile”, which LaTeX is interpreting as the Unicode character U+0008 (the backspace character). LaTeX has no instructions regarding this character as it’s not a character that would ordinarily be found in LaTeX source code. LaTeX knows what this character is (U+0008), but it doesn’t know what to do with it. If I type h Return at the prompt (the question mark at the bottom) then I get the following message:

You may provide a definition with
\DeclareUnicodeCharacter

This is telling me that if I want to use this character then I need to declare it and provide instructions as to how LaTeX should deal with it. If I press Return at this point and let LaTeX carry on processing then it will ignore the backspace character, so the resulting PDF will simply display “Bo”.

In neither of the above cases do I get a PDF with an image of a small ellipse because \input expects the file to contain (La)TeX instructions and will parse it as such.

You may have noticed that the error message above mentions the inputenc package even though the document hasn’t loaded it. In that example, I was using the TeX Live 2021 distribution. If I use an older distribution, say, TeX Live 2016 then I get a different error message and a different help message:

l.1 	B^^H
         o
? h
A funny symbol that I can't read has just been input.
Continue, and I'll forget that it ever happened.

? 

In both cases the backslash character is ignored and the result is the same.

Now let’s try the four-byte file that can be interpreted as the UTF-8 characters “£12”: 11000010 10100011 00110001 00110010 (C2 A3 31 32). With pdflatex from TeX Live 2016, there’s no error message but the log file contains the following warnings:

Missing character: There is no  in font cmr10!
Missing character: There is no £ in font cmr10!

So this is interpreting the first two bytes of the file as two separate characters, Â (0xC2) and £ (0xA3), but there’s no glyph available for either of these characters in the default font (cmr10). So the PDF just contains “12”. With TeX Live 2021, there are no errors or warnings and the PDF contains “£12”.

Donald Knuth first released TeX in 1978, and Leslie Lamport released LaTeX in 1985. ISO-8859-1 was also first published in 1985, but UTF-8 was designed in 1992. (ASCII was first published in 1963.) So it’s not surprising that the original versions of TeX and LaTeX were designed for single byte text encodings.

The great advantage about UTF-8 is that it covers all Unicode characters (as opposed to ISO-8859-1, which is limited to 256 characters, and ASCII, which is limited to 128 characters). It’s natural that users who wanted to be able to type extended Latin or non-Latin characters into their LaTeX document source code were keen to adopt UTF-8. This is awkward for (La)TeX, which treats each byte as a separate token. The inputenc package (with the utf8 setting) provides a workaround: it makes the first byte of a multi-byte sequence an active character which takes the subsequent byte as its argument. This can be demonstrated by the following UTF-8 document:

\documentclass{article}
\usepackage[utf8]{inputenc}
\begin{document}
\show £
\end{document}

This produces the following message in the transcript:

> �=macro:
->\UTFviii@two@octets �.
l.4 \show �
           �

The \show command displays the definition of the token that follows it. In this case, the command is followed by two tokens: the two bytes (0xC2 and 0xA3) that indicate the £ symbol. So \show picks up the first token (0xC2) and shows the definition. This token (0xC2) is written to the transcript, but the transcript is being viewed in a bash terminal that’s expecting UTF-8 content. The 0xC2 byte isn’t followed by a legitimate byte (as defined by the UTF-8 format) so it’s flagged with the replacement symbol � to denote that it’s invalid. If I view the transcript in vim with the binary mode on, I can see the value of the bytes.

Image of the above transcript message shown in vim in binary mode: the invalid bytes C2 and A3 are shown as <c2> and <a3> (in cyan).

This shows that the 0xC2 byte (octet) has been defined as a macro (command) that expands to \UTFviii@two@octets followed by the byte 0xC2. This internal command is defined to take two arguments: the first is provided (0xC2, the first byte in the two-byte pair) and the second is the token that follows (the second byte in the two-byte pair).

If I press Return to continue processing the document I encounter an error because the second byte (0xA3) of the two-byte pair has become detached. The transcript in the bash terminal at this point is:

! Package inputenc Error: Invalid UTF-8 byte "A3.

See the inputenc package documentation for explanation.
Type  H <return>  for immediate help.
 ...                                              
                                                  
l.4 \show £

If I look at the log file in vim again, but this time without the binary mode on, then vim determines that the file can’t be a UTF-8 file (because it breaks the UTF-8 rules) so it decides that it must be an ISO-8859-1 file:

> Â=macro:
->\UTFviii@two@octets Â.
l.4 \show Â
           £
?

! Package inputenc Error: Invalid UTF-8 byte "A3.

See the inputenc package documentation for explanation.
Type  H <return>  for immediate help.
 ...

l.4 \show £

Recent changes to the LaTeX kernel in the past few years mean that UTF-8 is now the default encoding for LaTeX document source files, but this trick is still employed. If you want the multi-byte UTF-8 characters to be treated as a single token then you need to switch to a modern TeX engine (XeLaTeX or LuaLaTeX), which natively supports UTF-8.

So how do you tell what format (or encoding) a file is in? That is unfortunately quite difficult. You can parse the file and find out if it breaks a rule (is invalid) to determine if it’s not a particular format. For example, if the file contains a byte larger than 0x7F then it’s definitely not ASCII, or if the file contains a byte such as 0xC2 that isn’t followed by a byte (or bytes) that results in a valid UTF-8 character then the file isn’t UTF-8. However, as illustrated by the “image1” file, just because the file contents satisfy the rules of one format doesn’t mean that it doesn’t also coincidentally satisfy the rules of another format.

One general rule of thumb is that if the file contains a certain proportion of bytes that represent control characters (such as the earlier backspace example) then it’s likely to be a binary file. However, just because a file only contains bytes that represent printable characters doesn’t mean that the file isn’t a binary file.

Some formats use a “magic marker”: a sequence of bytes at the start of the file that identifies the format. For example, a PDF 1.3 file will start with the bytes 0x25 0x50 0x44 0x46 0x2D 0x31 0x2E 0x33 (which represents the ASCII characters “%PDF-1.3”). However, there’s nothing to stop me from starting a LaTeX file with the lines:

%PDF-1.3
\pdfmajorversion=1
\pdfminorversion=3

In this case, the first line is a comment to remind the author (or anyone else reading the source code) that the next two lines are stipulating that the resulting PDF file must be version 1.3, but this is enough to confuse some applications into thinking that this LaTeX source file is a PDF file.

The byte order mark (BOM) is another form of magic marker that’s used to indicate the byte-endianness of a UTF-16 or UTF-32 file. For example, a big endian byte order UTF-16 file will start with the bytes 0xFE 0xFF. UTF-8, on the other hand, only has one byte order so, although the BOM character is defined in UTF-8, there’s no point in using it to indicate the byte order.

Despite this, the BOM character is sometimes used at the start of a file to simply indicate that the file is UTF-8, but this can be problematic. If a text editor automatically inserts it at the start of every file that it saves, then it forces the file to be UTF-8 even if there are no non-ASCII characters in the content. This makes it less compatible with other applications, particularly when the file is a script for a language that has it’s own magic marker (such as bash file that must start with #!).

Another way of indicating the file format is to incorporate the information in the file name. This is typically done in the form of a suffix that starts with a dot (the file extension). For example, “image.jpg” indicates a JPEG file and “image.png” indicates a PNG file. There can be multiple extensions. For example, the file “myDoc.synctex.gz” has the extension “.gz” which means that it uses the gzip compression format. If I uncompress it (gunzip myDoc.synctex.gz) then I will have the file “myDoc.synctex”, which is in the synctex format.

Some filing systems hide the extension when showing the list of files. This can be particularly annoying for LaTeX users because a directory (folder) can become full of files with the same basename but different extensions. For example, if I create a document called, say, “myDoc.tex” then using pdflatex will, at the very least, create the files “myDoc.log”, “myDoc.aux” and “myDoc.pdf”. Once a table of contents, list of figures, list of tables, bibliography, index, glossaries etc are added, the file list increases significantly and it can be hard to tell which is which if the extensions are hidden.

More generally, hiding the file extensions can have serious security implications. An executable file called, say, “notes.txt.exe” will be displayed as “notes.txt” which gives the impression that it’s just a text file, but if you double-click on it, expecting to open it in a text editor, the file will instead be executed. This is one way in which users can be tricked into running a malicious executable file.

Unfortunately there’s nothing to stop anyone from renaming the file so that it has a different extension. For example, if I rename a PNG file from “image.png” to “image.jpg” then this doesn’t alter the file content — it’s still a PNG file — but it misrepresents the file, making it look like a JPEG file when it’s actually a PNG file. This can confuse an application that tries to determine the file type from the file name extension, and it will try to read the file using the wrong set of rules.

Another way of identifying the file format is with the MIME type but, as with file extensions, the MIME type can be incorrect (either through accident or deliberately).

Returning to my FooBar “image1” file (which doesn’t have an extension), if I forget about it and stumble on it months later, the chances are that I will have forgotten what it was and what I used to create it. My first step will be to try to identify it with the file tool. This returns “image1: ASCII text, with no line terminators” so my next step will be to open it in a text editor, where I will find the message: Tab Boo. Therefore the developer of the FooBar application really needs to modify the file format so that it includes a magic marker at the start and also decide on a file extension to help identify what type of file it is.

Can I include my FooBar ellipse in my LaTeX document? Not in its FooBar binary format. I would first need to convert it to a graphics format that \includegraphics recognises.

So in summary:

  • All files contain binary data.
  • The file format is the algorithm or set of rules needed to understand the data contained in the file.
  • The term “text file” is used to indicate a file that is written in one of the standard text file formats (such as ASCII, ISO-5988-1 or UTF-8) that is intended to be readable in a text editor (that is, it doesn’t have a high proportion of non-whitespace control codes).
  • The term “binary file” is used to indicate a file that is not a text file.
  • The file (or input) encoding of a text file is the particular text format used to store the textual data in the file.
  • An application can only properly parse or process a file if it recognises and understands the file’s format.
  • If the format is mis-identified then this can either cause outright failure (“invalid format”) or incorrect instructions (such as placing an unwanted  before a £).

Good Bots and Bad Bots

You’ve probably come across websites that want you to prove that you’re human and not a robot. This may come in the form of a picture challenge (for example, select all the squares with bicycles) or it may simply require you to check a box to assert that you’re not a robot. Perhaps you’re wondering why you need to do this. Why is the website so concerned about being visited by robots? Alternatively, perhaps you’re a website developer and are determined to find a way to keep out all bots.

What is a bot? Are all bots bad?

As with cookies, bots are important tools in the digital world. However, as with cookies, bots can also be used for unwholesome purposes.

“Bot” is short for robot and is simply a piece of software (an application) that visits websites. A bot may follow one link after another, crawling through pages across the World Wide Web. For this reason, they are often called “crawlers” or “spiders”.

Good Bots

If you go to your favourite search engine and type in a keyword or phrase (or use a voice activated request on your mobile device) then the results usually come up fairly quickly. This is only possible because the search engine has an index that has been compiled by bots that have followed link after link, gathering information. Without this index, it would take a very long time to scour all the millions of pages that make up the web to find something relevant.

Not all bots are crawlers. For example, Facebook has a bot that’s used when a post contains a link. The bot is used to check that the link exists and it reads any Open Graph markup. This allows Facebook to include an image and short excerpt to arouse the interest of anyone who views the post. Unlike the search engine bots, this bot doesn’t roam free about the Internet but instead restricts itself to links posted on Facebook pages.

Well behaved bots commonly identify themselves in the user agent string in the form:

bot-name/version URL

For example, “facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)” identifies the Facebook bot (facebookexternalhit), its version number (1.1) and a way of finding out information about the bot.

So these are useful bots that help users to discover interesting sites.

However, even the good bots don’t always honestly identify themselves. For example, if you post a link in the Signal Messaging app, the bot used to fetch the preview information identifies itself as WhatsApp, and this basically seems to be a rehash of the “all browsers identify themselves as Mozilla” problem.

Not So Good Bots

Although the crawlers used by search engines are useful, some crawlers that index sites to provide certain types of information for their users (who may require free or paid accounts to access it) can be a nuisance because they’re not well-behaved. For example, they may not follow the robot instructions stipulated by the website (robots.txt), they may try to access pages that are only intended for human visitors or they may hit the site so hard (that is, they look up pages so fast) that they slow down the site and it becomes unusable for everyone else.

This could be because the bot’s developer made a mistake (a bug in the bot’s code or an inexperienced programmer) or it could be because the developer simply doesn’t care and wants the information quickly regardless of the inconvenience to others (perhaps to satisfy the demands of paying customers). In the long run, this is counter-productive as it will lead to the bot (which is identified in the user agent string) being banned.

Scrapers

Web scraping (or harvesting) is when a bot extracts data from a webpage. In the earlier case of search engines and social media, this data can just be keywords or phrases or the URL for the page image, but some bots are designed to gather all information from a page in order to reproduce it verbatim on another site. This is often done to lure visitors to their own copycat site, which will most likely be stuffed full of adverts and tracking (which makes a profit for their owner). This is usually a violation of intellectual property. Even where the original page is available under a permissive licence, attribution is usually required but is often omitted. This happens a lot for question and answer sites, such as Stack Exchange, or forums.

These bots may well have the user agent string empty or set to the default value for the given API that they are built with.

Trolls and Spambots

These are the types of bots that the pages that require you to identify yourself as a human are mostly trying to block. The user agent string is typically set to a common browser and platform to make the bot appear as though it is a human visitor. These bots search for forms to fill in, such as contact forms to send spam messages or comment forms to advertise dubious products and sites.

While spambots are the digital equivalent of fly-posters, trollbots are the equivalent of poison-pen letter writers. They are created by individuals who take a puckish delight in causing hurt and discord. These bots are designed to search for certain keywords on a page and craft an offensive or divisive comment that relates to the topic. The creators of these bots may have a particular hatred towards a certain group of people, but they can also be chaotic nihilists with a set of offensive comments for every group.

The expression “don’t feed the trolls” has been around for a long time. I remember first encountering it on Usenet back in the early 1990s (accompanied by some ASCII art). It’s very good advice. Don’t give trolls the attention that they are looking for, but, in some cases, the troll posting the offensive comments isn’t human. It’s a bot that has no ability to reason, no feelings, no embarrassment. Its function is solely to post content that its creator programmed into it.

Chatbots can come under both this category and the next. Chatbots in general are just a tool that simulates conversation, and are often used for legitimate services, such as online help, but they are also used by criminals to deceive people. For example, a fraudster might create a fake account on an Internet dating site and use a chatbot to hook victims who believe they are chatting with a human. Once the chatbot has gained the victim’s trust, the fraudster takes over.

Malware Bots

The worst of the bad bots are the ones created by cyber-criminals and they are designed to wreak havoc, stealing data and installing malware. These bots look for dynamic web pages that use parameters and will try to inject malicious code into the parameter values.

For example, the page https://www.dickimaw-books.com/booklist.php?book_id=11 has a parameter (book_id) that identifies a particular edition of a book. (In this case, the second paperback edition of The Private Enemy.) The parameter value (11) uniquely identifies this edition in the database that contains all the title information.

A malicious bot will try altering the parameter value to break into the database. For example, it may start out by simply appending an apostrophe (book_id=11'). If this triggers a syntax error then the site is vulnerable to SQL injection and the bot can then try something far nastier to access the contents of the database.

Another possibility is that the parameter value may be printed on the web page, so the bot will try replacing the value with JavaScript. For example, the bot may start out with a simple alert. If the bot detects an alert box then the site is vulnerable to cross-site scripting (XSS) and the bot can try something more damaging.

Or the parameter value may be the name of a template file, which is used for the main body of the web page, so the bot will try replacing the parameter value with /etc/passwd (or ../etc/passwd etc) in order to trick that web page into revealing the contents of the password file instead.

Bad bots can also disrupt a website by repeatedly accessing pages in rapid succession (a denial of service attack or, where an army of bots are working together, a distributed denial of service attack). This can make the site completely inaccessible to anyone else.

These types of bots rarely identify themselves honestly. The user agent string is typically empty or contains a common browser and platform combination (as with the trolls and spambots). I’ve also encountered attempts at SQL injection where the user agent string was the same as the aforementioned Facebook bot. At first glance, it gives the impression that a Facebook bot has gone rogue (or followed a bad link) but the IP was registered to somewhere in Russia, which seems an unlikely origin for a Facebook bot, so bad bots not only pretend to be human but also try to pass themselves off as legitimate bots.

Sometimes the user agent string will contain “sqlmap”. This is a legitimate pen testing application. However, in many jurisdictions, penetration testing can only be performed by mutual consent between the pen tester and the website owner. If you are a website developer and a pen tester has been hired by your organisation, then don’t block bots with this user agent as the site needs to be tested by an unblocked bot since most bad bots don’t conveniently identify themselves. If a pen tester hasn’t been engaged then the tool is being used illegally (which is par for the course with criminals).

So, if you’re a website developer and you want to stop bad bots, remember that you can’t rely on the user agent string. Bots pretend to be human and some humans blank their user agent string for privacy reasons. The first line of defence is to filter (e.g. ensure that a numeric value is actually a number), escape special characters (e.g. htmlentities) and use prepared statements.

If you’re just a regular website user, don’t assume that every comment you read was actually posted by a human and, while captchas may be frustrating, your web browsing experience may be far worse without them.


Update 2021-08-08: added paragraph on Signal in Good Bots section and paragraph on chatbots in Trolls and Spambots section.

Another Migration

In the first post of this blog, I wrote about my decision to migrate to a new web hosting provider back in 2019. Last week, the site migrated again, but this time I stayed with the same web hosting provider. I moved from the cloud hosting platform (which uses a server cluster) to a newer single server platform.

Migrating a web site is rather like moving house. The removal company moves the content and will connect the large appliances in your new home, but there are a lot of little bits and pieces that you have to do yourself. You have to let everyone know you’ve moved and you need to get used to the new layout. Those handy tools that were in a certain location in the old place are now somewhere else. Gadgets need re-configuring. A convenient local service isn’t available and another one needs to be found.

In an analogous way, the web hosting company’s migration team moved over files and databases from the old servers onto the new one and set things up, but there are different paths and configurations on the new server that needed to be taken into account. Certain files lost their executable bit that had to be restored. Some code that worked in the old location doesn’t work in the new environment and had to be modified. The mail boxes had to be created manually, DNS records needed changing, and custom cron jobs had to be checked and set up.

The Domain Name System (DNS) provides public records associated with every domain. When you type an address in your browser, the browser needs to know where to go to fetch the file associated with that address. The DNS records provide the route to the server for the given domain (dickimaw-books.com in this case) and the information is cached (usually for around 24 hours) so that the browser doesn’t have to keep looking up the information as you move from one page to the next. Similarly, when you send an email, your mail server has to look up the appropriate entry in the DNS record to find out how to route your message.

When a site moves to a new server, all these records need to be updated, but there’s an additional delay as a result of caching. For a while, emails can’t be delivered, and visitors are directed to the old server and then, when the old site certificate becomes invalid, they find themselves confronted with a big scary warning message from the browser until the new certificate is sorted out.

There was a moment last week when I wondered why I’d been mad enough to consider migrating the site. Sure, the old cloud hosting package had its problems and it could be a little slow, but at least it had worked and I knew what tools were available and where to find them. However, eventually things were sorted out, the new server is much faster, and the stricter PHP settings flagged up a few bugs that I’ve now fixed.

Once the migration was successfully completed, the final step was to cancel the old cloud hosting package, but just before I did that I learnt that it had been marked for obsolescence and I would’ve had to have migrated in a month’s time anyway. So it all worked out for the best in the end. I’m sorry if you encountered any problems while trying to access the site last week, but it should mostly be operational now (except for the shop, which requires some further testing before it can be reopened).

If you are a regular visitor to the site, you may have noticed that there’s a new “Account” link in the main navigation bar. This is something I’ve been working on for some months now, and it was while working on it that I became so frustrated with the limitations of the cloud hosting package that I decided to move. I’ll describe it in more detail in the next post.

Farewell to the Hedgehog and Little Duck

Ingram (the parent company of Lightning Source, who print and distribute paperback titles published by Dickimaw Books) have announced that their saddle stitch format is being retired on 1st March 2021 because the software and equipment used to print that format have become obsolete. This means that the first editions of “The Foolish Hedgehog” and “Quack, Quack, Quack, Give My Hat Back!” will be going out of print on that date.

The saddle stitch format is where the pages are held together by staples down the spine. This works well for these illustrated children’s books, particularly “Quack, Quack, Quack, Give My Hat Back!” which has double spread images. The other paperback titles published by Dickimaw Books all use perfect bound (and so aren’t affected by this change), which has a stiff, flat, rectangular spine. Perfect bound doesn’t work well for young children’s books which are often opened out flat.

If you have been thinking about buying a copy of either of these books then you will have to order them from an online book seller before 1st March 2021. After that date, there may still be copies available from the Dickimaw Books store (once it reopens) until existing stock runs out.

SmashWords Ebook Sale

The DRM-free ebook retailer SmashWords has a sale from 18th December 2020 to 1st January 2021. My crime novel “The Private Enemy” has a 75% discount and my crime fiction short story “I’ve Heard the Mermaid Sing” has a 100% discount (i.e. free!) for the duration of the sale. Did you know that you can gift ebooks on SmashWords? If you’re stuck for a present for a book lover this provides a cheap and convenient option, especially if they’re far away or isolating.

Book Samples

The Dickimaw Books site now has a new book samples area. This provides a collection of sample images taken from pages of the selected paperback book with an accompanying audio track. At the moment there’s only one book listed (The Private Enemy) although I plan to add other fiction paperbacks at a later date.

The sample starts with an image of the jacket. You can navigate to the next available sample page using the “next page” icon (a right pointing arrowhead ), which can be found at the top right of the page image. Below the page image are links to further details about the book and to the book’s listing in the Dickimaw Books store . If there is an ebook edition (which there is for The Private Enemy), then there will also be a link to the ebook’s HTML sample.

The audio file accompanying the jacket image is simply an introduction. The audio files for the actual sample pages are of me reading out that page. This means that you can see how the text is actually typeset on the page of the paperback edition and you can hear the page content. If you prefer to just read the text then follow the link to the ebook sample instead. The navigational icons can be changed in the site settings page (see also the Sticky Hamburgers post).

Remember that the ebook edition of The Private Enemy is free for the duration of the “Authors Give Back” SmashWords sale (ends 31st May 2020), so now’s a good time to try it out!