Hello E-Hedgehog

A little hedgehog promises his grandmother that he won’t go onto the wasteland (the road) where the dragons (vehicles) live, but one day he’s tempted onto the road by a hungry crow. Will he manage to get home safely?

Last year, my illustrated children’s books went out of print when Ingram retired their saddle stitch format, but now The Foolish Hedgehog is back again, this time as an ebook. What’s more, it’s out just in time for the SmashWords end of year sale, which takes place from 12:01 am 15th December 2022 to 11:59pm 1st January 2023 Pacific Time. So if you’re looking for a short illustrated story for young children to keep them entertained over the holiday, have a read of the preview on SmashWords and buy it while it’s half-price!

Ongoing Email Issues on Website

For some weeks now, the forms on this site, such as the contact page, have been unable to send an email. It seems to be caused by an SSL issue outside of my control. All support channels to the web hosting company used by this site are down, which means I can’t even report the issue, let alone get it fixed. You can still post bug reports, but you will encounter an error message when the site attempts to email me that a new report has been posted. The report will be saved in the database, but (unless your account is permitted to bypass the reviewing system) it won’t show up in the bug tracker until its status has been changed from Pending to Open. Similarly for comments, feature requests, and other notifications. I will pick up the error messages in the server’s log file, which will include the email message, but there will be a delay as it’s not something I can constantly monitor.

I will update this post when I have further information.

Update 2022-10-08: the problem with reporting this issue to the web hosting company was quite frustrating. They had removed their telephone support number, the online chat wasn’t available, and whenever I tried to open a ticket I got an access denied error message (even though I was logged in and could access the rest of the site’s dashboard). It turned out that in order to use the online chat I had to delete all cookies and then accept them all (which wasn’t particularly intuitive, but I should’ve thought of it sooner as that seems to be an increasingly common problem with websites). The ticket system still wasn’t working, but it’s possible I had to log out and in again, and maybe shutdown, reboot, switch things off and back on again, and perhaps count to 10 or whatever. Anyway, the chat operator has logged the issue and with a bit of luck it will be fixed soon. I’m sorry for any inconvenience this has been causing.

I’ve re-enabled the contact form in eager anticipation of the issue being fixed soon. If you get an error about the failure to send an email, don’t worry. Your message will be logged and I’ll find it in the log file, but there will likely be a delay of up to 24 hours before then.

Book Shop Closed Indefinitely Due to PayPal Removing Support for Encrypted Website Payments

The Dickimaw Books store has unfortunately closed until further notice. The reason for this is because PayPal has removed support for encryption with its PayPal Payments Standard option. This is where an online store redirects the customer to PayPal’s site in order to make the payment. PayPal is still providing this payment option, but the store will now only work if I switch off encryption, which I’m not prepared to do.

For those who want more detail, the way that this works is as follows. The customer adds products to the basket and proceeds through the checkout process until they arrive at the final checkout page that confirms the price of each item, any discount applied, postage and packaging, final total, invoice address and shipping address. All this information needs to be sent to PayPal so that the correct amount can be charged. Once the transaction is successfully completed, PayPal then sends a notification back to the store to confirm that the payment has been made.

Without encryption, the transaction data at the checkout page is contained in plain text within the form parameters and is sent as plain text to PayPal when the customer clicks on the continue button.

There are two problems with using plain text. The first is that these private details about the customer and their transaction can be intercepted by a third party eavesdropper.¹ The second is that a dishonest customer can open the developer tools in their web browser and alter the payment details, awarding themselves a hefty discount and defrauding the merchant. Under those circumstances, it’s hard for the merchant to prove that they didn’t have the products temporarily listed at a lower price when the transaction was made.

Encryption helps to protect both the customer’s private details and the merchant. The way that this is done is through public/private key encryption. At the checkout page, all the transaction details are stored within a single form parameter with an encrypted value. This prevents any tampering and also protects the data when it’s transmitted.

There is a two-way communication between the merchant’s site and PayPal. In order for the encryption to work, the merchant’s store needs a copy of PayPal’s public certificate (which the merchant used to be able to download from their PayPal business account). PayPal, in turn, needs the merchant’s public certificate. The encryption and decryption can’t be performed without a valid public/private key pair.

Certificates have an expiry date. This is a precaution in case the private key is stolen. Whilst stolen keys can be revoked, there’s a chance that this may not be noticed. An expiry date at least limits the length of time a stolen key can be used for.

The certificate for the Dickimaw Books store expired last Sunday. I had set myself a reminder to create a new pair and did so the day before, but when I tried to upload the new public certificate to PayPal, I encountered a 404 page not found error. I raised an issue with their merchant technical support and was informed that the encrypted option was no longer available. The checkout will now only work if I disable the encryption from the store’s admin page.

I have no idea why PayPal would intentionally remove a security feature, particularly without giving any prior warning. This will obviously impact all small merchants who use this method, although they may not discover this until their certificate expires and they try to upload a new one. I’m hoping that this issue will turn out to be a miscommunication within PayPal’s technical support department and an inadvertent broken link. Until they restore the ability to use encryption or until I find an alternative payment provider, the store will remain closed.

Meanwhile, if you want to purchase any of my paperback books, you can purchase them from a third party book seller, such as Amazon.

¹Using https instead of http does, of course, add a layer of protection, which help protect against eavesdropping, but it doesn’t protect against fraudulently altering the information before it’s sent.

Muirgealia: A Tale of Temporal Enchantment

If you’ve looked at the book list recently, you may have noticed a new pending title called Muirgealia. Muirgealia is a land that was home to a mixture of enchanters (with various magical abilities, including telepathy, telekinesis and the power to create areas of stasis), mystics (who can see through time in the form of visions), and handcræfters (people with no magical ability, who are typically inventors, engineers or artisans). Five hundred years before the start of the story, two calamities (one natural and the other contrived) lead to the usurpation of power by the enchanter¹ Pelouana, a traitor who formed an alliance with the warlords from the neighbouring land of Acralund. A warning from the future allowed most of the Muirgealians to escape into exile before a curse hit both Pelouana and Acralund.

At the start of the story, Muirgealia is deserted, except for Pelouana who has been trapped there for five centuries as a result of a magical explosion that fractured a stasis field (where time is almost stopped) causing drifting pockets of stasis. Acralund is sparsely populated by the remnant descendants of the warlords, who are still affected by the curse that prevents them from rebuilding their civilisation. The descendants of the Muirgealians in exile are in hiding, waiting for their future ally to be born and grow up to help them regain their land.

The people of Langdene, to the north of Muirgealia, unwittingly became caught up in events when their king ordered a mass migration south to escape a savage winter. They pass through Muirgealia just after Pelouana has taken over. She murders the king, suspecting him of being the one who had cursed her (it was, in fact, one of his descendants), and a rift forms between the king’s two surviving sons who end up creating two new nations, Farlania and Barlaneland, in the land to the south of Acralund.

Five hundred years later, Luciana is the daughter of the king of Barlaneland and Rupert is the second son of the king of Farlania. Pelouana is finally freed by the Acran, who first target Rupert, believing him to be the one who cursed them, and then target Luciana, believing that they can get to Rupert through her. Meanwhile, the Muirgealians want revenge and their land back (which contains all their books and other cultural works locked up in time). More importantly, they need to destroy the stasis bubbles that will cause havoc if they break free from the loose tethers that are keeping them in Muirgealia and drift around the rest of the world.

The “world” in question is an Earth-like planet with a single moon and orbiting a single star (or perhaps it is Earth in some forgotten past or other timeline). Langdene, Muirgealia, Acralund, Farlania and Barlaneland are all located in an eastern seaboard, bordered to the west by a mountain range. Well over a thousand years earlier, the entire continent was ruled by Potensmunda, a land beyond the mountains. The Potensmundan Empire was much like the Roman Empire. If you know any Latin, you might pick out the words “potens” (powerful) and “munda” (clean). The empire was like bleach, erasing all local languages and replacing them with Potensmundan. The empire has long since fallen, but the language and a few structures, including old roads, remain.

In universe, the above map was created long ago by people who didn’t have the modern surveyor’s tools that we have in our own world, so it’s not accurate. The three different fonts represent three different sets of handwriting, as the map has been modified over time. The translation convention applies here. English is used for modern dialects of Potensmundan, and the few words of ancient high Potensmundan are in Latin. The character names (which are a mixture of Roman, Anglo-Saxon and Germanic) can also be considered as having been translated into the closest match.

Incidentally, in case you’re wondering why I didn’t choose “Potensmundi” (powerful world) instead, there are two reasons. The first is aesthetic, I prefer the sound of Potensmunda. (I was looking for a four or five syllable name ending in an ahh sound, but don’t ask me why.) The second is pragmatic: an Internet search of “potensmundi” produces a huge load of hits (which isn’t particularly surprising, given its meaning) whereas “potensmunda” (or “potens munda”) isn’t so common, so I’m less likely to step into someone else’s digital footprints.

If you’re keen to know when Muirgealia is published, you can sign up to be notified, if you have a Dickimaw Books site account. Alternatively, you can subscribe to the RSS feed for this blog or the main news RSS feed or follow the Dickimaw Books Facebook page.

What happened to The Fourth Protectorate, which I’ve previously posted about and is also still listed as pending? Life happened. When I wrote it, the social disorder of the 1980s seemed firmly in the past. Recent years have put me off it. The Fourth Protectorate is without doubt the biggest and bleakest of all my novels. Muirgealia is the shortest and lightest. The Private Enemy comes somewhere in between.


¹Muirgealians use “enchanter” as a gender-neutral term. They don’t use “enchantress”. The Acran and Langdeners use “sorceress” instead.

Legacy Documents and TeX Live Docker Images

Over the past few years there have been a number of major changes to the LaTeX kernel that have unfortunately broken code in old packages, classes and documents. The changes are beneficial to new documents and packages, but before the introduction of the new kernel features it was necessary for packages to hack the old kernel internal commands in order to achieve the desired result and, for the most part, the instabilities come from these hacks.

Some of the affected packages have been updated to work with the new kernels, but in some cases the original hack may have been too complicated to untangle or the package author may no longer be available to update the code.

My own packages have been affected, starting with flowfram.sty in 2015, then datatool in 2019 (that had a knock-on problem for mfirstuc.sty, which requires datatool-base.sty and glossaries.sty, which relies on mfirstuc), and jmlrbook.cls in 2020.

So what happens if you have a legacy document that compiled without a problem when it was first created but now goes wrong? This may not necessarily mean than an error occurs, but it could silently cause unexpected output.

For example, when I wrote LaTeX for Administrative Work (volume 3 of the Dickimaw LaTeX series) I had TeX Live 2014 installed. Below is an image of page 35 (from the A4 PDF version). The page starts with the page header, there is then a figure (which has floated to the top of the page) and then the page body. All looks fine.

Image of page 35 with correctly typeset content.
Page 35 Built with TeX Live 2014

However, if I rebuild the document with a newer TeX distribution then the page content becomes mangled, as shown below. The figure is now at the top of the page, and the page header has been shunted down so that it overlaps the figure caption. The text body height doesn’t take the figure into account, which causes the text to overflow the bottom of the page.

Image of page 35 with textual content shunted down the page.
Page 35 Built with TeX Live 2021

The hack of adding \def\f@depth{1sp} suggested in bug report #105 works in some cases, but unfortunately in this case it leads to the “Too many unprocessed floats” error. So, until I can find a reliable fix for flowfram.sty, how can I rebuild this document? I do have some old versions of TeX Live installed, but not that far back.

The solution lies with the Docker images provided by the Island of TeX. Docker basically allows you to run an application inside an isolated container. So, instead of hunting for my TeX Live 2014 DVD and installing TL2014, I can fetch the Docker image and build my document inside a container.

If you don’t already have Docker installed, you will first need to install it. Once it’s installed, you can use the docker command line tool. On Unix-like systems, you may need to use sudo. You can view the currently installed Docker images using docker images (or sudo docker images).

I need the TeX Live 2014 image so I have to pull the TL2014-historic release from the Island of TeX:

sudo docker pull registry.gitlab.com/islandoftex/images/texlive:TL2014-historic

Now I need to change to the directory (cd) where my document source is located (change the path as applicable):

cd path/to/directory

If my document source code is in the file called document.tex and I would ordinarily compile it using:

pdflatex document.tex

then when I want to compile it inside a TL2014 Docker image container I would need to do:

sudo docker run -i --rm --name latex -v "$PWD":/usr/src/app -w /usr/src/app registry.gitlab.com/islandoftex/images/texlive:TL2014-historic pdflatex document.tex

In this case, my document source file is called admin-report.tex and the build process is rather complicated as it consists of multiple pdflatex, bibtex, makeglossaries and makeindex invocations. Rather than using docker run for each step, it’s simpler to use an automated process, such as arara. First I need to ensure that I have the appropriate arara directives at the start of admin-report.tex:

% arara: pdflatex
% arara: bibtex
% arara: makeglossaries
% arara: pdflatex
% arara: makeglossaries
% arara: pdflatex
% arara: makeindex: { style: admin-index.ist, options: -c }
% arara: pdflatex
% arara: pdflatex

Now I just need one docker run instance:

sudo docker run -i --rm --name latex -v "$PWD":/usr/src/app -w /usr/src/app registry.gitlab.com/islandoftex/images/texlive:TL2014-historic arara --verbose admin-report.tex

This is rather lengthy to type, so I wrote a simple bash script called dockerbuild:

#!/bin/sh

docker run -i --rm --name latex -v "$PWD":/usr/src/app -w /usr/src/app registry.gitlab.com/islandoftex/images/texlive:TL2014-historic arara --verbose "$@"

So I can now just do:

sudo ./dockerbuild admin-report

Unfortunately using sudo means that I end up with files owned by root, but this can be fixed by adding chown to the bash script (change the username and groupname to your own):

chown username:groupname `basename "$@" .tex`.*

There were two further problems. Firstly, the version of flowfram.sty bundled with TeX Live 2014 doesn’t have a couple of options used by the document. This is probably because I added those options while I was working on the book but I uploaded the package to CTAN after the version of TL 2014 captured in the Docker image. I needed to copy the newer v1.17 into my current directory to ensure the document compiled correctly.

Secondly, I can’t input files on my hard drive that are outside of the current working directory from inside the container. Volume 3 cross-references the previous two books in the series using:

\externaldocument[nov-]{../novices/novices-report}
\externaldocument[thesis-]{../thesis/thesis-report}

This picks up the cross-referencing information from the aux files of volumes 1 and 2, but this won’t work inside the Docker container. Instead, I need to copy those files into my current directory. (Note that symbolic links to files outside the current directory won’t work.)

Ideally the best solution is for me to find a way to fix all my affected packages, but this is proving to be non-trivial. The historic TeX Live Docker images at least provide a workaround.

Binary Files, Text Files and File Encodings

The TeX distribution comes with a mixture of binary files and text files. The source code for your document is written in a text file and you need a text editor to create and modify it, but you need to make sure the file (or input) encoding is correct otherwise you can end up with error messages, warnings and strange characters in your PDF file. This can be very confusing to new users without a computer science background who might ask, “what’s the difference between a binary file and a text file, and what does file encoding mean?” It can also confuse people with a computer science background who might blithely inform you that, naturally, a binary file is a file that has binary content and a text file is a file that contains plain text.

So what actually is the difference between a binary file and a text file, and what causes weird symbols to appear and “missing character” warnings?

This isn’t intended to be a lecture on hardware, so I’m going to simplify things somewhat, but digital devices (such as laptops, tablets and smartphones) essentially treat everything as binary data. Binary in this context means one of two states, so you can view the internals of a computer as a series of tiny switches that can either be on or off.

Row of switches: up, down, up, up, down, down, up, up, up.

We could call these two states “on”/“off” or “up”/“down” or “true”/“false” but the most compact form for a human to visualise the two states of a tiny electronic switch is to use the digits 1 (on or up or true) and 0 (off or down or false).

Row of switches with a digit below each one: up (1), down (0), up (1), up (1), down (0), down (0), up (1), up (1), up (1).

Each switch is one bit and a sequence of eight bits is one octet. With 8-bit systems, eight bits is also one byte. (Half a byte, or four bits, is a nybble, but that’s not often used.) Your hard drive (or USB stick etc) is essentially full of bits. The device’s filing system contains an index of where each file starts and ends.

Five rows of twenty 0 or 1 digits with three blocks highlighted and annotated file1, file2 and file3, respectively.

If you delete a file the index is removed but the bits remain.

As the previous image except that the second highlighted region and its annotation have been removed.

So each file contains a sequence of bits and the file size is measured in bytes. In other words, all files have binary content.

The file format determines how the binary content should be interpreted. The format is basically a set of rules. If a file is identified as having a particular format but its content doesn’t follow the rules for that format, then the declared format is incorrect or invalid (which is what triggers an “invalid format” error if you try to open it).

Suppose I have an application (called, say, FooBar) that allows me to draw either a rectangle or an ellipse. It’s very restrictive and only has a limited set of options: vertical/horizontal (is the shape’s long axis vertical? true/false), large/small (is the shape large? true/false), filled/open (is the shape filled? true/false), ellipse/rectangle (is the shape elliptical? true/false). Each setting is binary so the options can be compactly written as a series of bits. For example: vertical, small, open, ellipse can be written as 1001. This needs to be zero-padded to make it up to 8 bits (since most digital storage measurements are in bytes): 00001001.

Binary data is difficult for humans to read and write. The longer the sequence of bits, the harder it becomes, so programmers usually convert the value to hexadecimal (base 16) to make it more compact and easier to read. Each nybble (4 bits) can be represented by one hexadecimal digit (0–9, A–F) so one byte (8 bits) can be represented by two hexadecimal digits. Instead of writing 00001001, I can write the equivalent hexadecimal value: 09. (In order to clarify that the value is a hexadecimal not a decimal representation, it’s often prefixed with “0x”: 0x09. In this case, because the number is less than ten, it happens to be the same as decimal 9.)

Let’s now suppose that FooBar allows me to specify a colour for the shape as a combination of red, green and blue (RGB). Each of these three colours have a numerical value indicating how much of that colour to add where 0 indicates none of the colour and the maximum value indicates all of that colour. There are different scales for quantifying a colour, such as a decimal number between 0.0 and 1.0 or a percentage between 0% and 100% or an integer between 0 (0x00) and 255 (0xFF). The last scale is convenient for FooBar because it means that each of the three colour components can be stored in a single byte so the complete colour specification takes up three bytes.

I’d like my vertical, small, open, ellipse to be drawn in a sort of greyish-blue colour. After playing around with the colour selector I’ve found the shade I like: 0x42 (red), 0x6F (green) and 0x6F (blue).

A greyish-blue, vertical, small, open ellipse.

Having created my work of art, I need to go off and do something else, but I’d like to save my ellipse so I can look at it again later. The most compact way of saving the information is in four bytes (the settings, followed by the red, green and blue values): 00001001 01000010 01101111 01101111. I’ve put a space between each group of eight bits here for clarity, but from the computer’s point of view this is just a sequence of 32 bits. From a human point of view the information looks better in hexadecimal: 09 42 6F 6F.

I need to think of a file name but I’m not very good at naming schemes so I’m just going to call it “image1”. The format is FooBar’s native binary format. The rules for this format are: the file must contain exactly four bytes, the first byte has the settings information stored in the last (least significant) four bits, the second byte is the amount of red, the third byte is the amount of green and the fourth byte is the amount of blue.

If I try to open a file in FooBar that only contains, say, three bytes, then this breaks the rules, so FooBar will popup an “invalid format” error message. What happens if the file has four bytes but the first four bits aren’t 0? Should they simply be ignored or should this trigger an invalid format error? The rules don’t say so the file format has an ambiguity in it.

An application can only read a file if it has been provided with the rules for the file format.

So if the content of all files is just a sequence of bits, what is a text file? A text file is simply a file that obeys one of the known text file formats or encoding. The most well known text encoding is the American Standard Code for Information Interchange (US-ASCII or, more colloquially, ASCII). The ASCII rules are: each byte must be in the range 0x0 to 0x7F (00000000 to 01111111, note that the most significant bit is always 0) and each byte either represents a control character (an instruction) or a printable character (letter, digit or punctuation). The ASCII table describes what each of the 128 allowed bytes represent.

For example, the byte 0x0A (00001010) is the line feed instruction. This means that whatever application is trying to interpret the data must move down one line. However, there is some ambiguity here as some systems will also move back to the first column (the start of the line) when encountering this line feed instruction but others require a carriage return instruction (0x0D) as well. (For those of you who remember using a typewriter, when you reached the end of a line, you had to hit a lever, which rotated the barrel one line, and also push the carriage across, which brought you back to the start of the line. Both actions were performed simultaneously with a single sweep of the hand. The line feed and carriage return terminology have carried over to the digital world.)

Another control code is 0x09 (00001001) which is the horizontal tab instruction. This means to move to the next tab stop, but it’s up to the application reading the data to define the tab stops. The space character (0x20) can also be considered a control code as it’s an instruction to move on one “space” without actually displaying anything.

The bytes in the range 0x21 to 0x7E are printable characters. Each of these values (or codes) has an associated shape (or glyph) that needs to be displayed. This shape is obtained from the font table, but the ASCII format doesn’t provide any information about what font should be used. That’s again up to the application reading the data.

For example, the byte 0x42 (01000010) represents the upper case Latin B and the byte 0x6F (01101111) represents the lower case Latin o. So if I create a file in a text editor that contains a tab followed by the word “Boo” and save it (as ASCII) then the file will contain four bytes: 00001001 01000010 01101111 01101111 (or 09 42 6F 6F).

These four bytes may look familiar. They are the same four bytes that make up the earlier “image1” file. So this file is both a FooBar binary file and an ASCII text file. It obeys the rules of both formats.

What happens if I replace the tab character with an upper case Latin I 0x49 (01001001)? This still obeys the ASCII format, but is it still a valid FooBar binary file? Remember that the FooBar format doesn’t say anything about the first four bits. If an application chooses to simply ignore the value of those first four bits then the file content will still be interpreted as a greyish-blue, vertical, small, open, ellipse.

Let’s suppose I increase the amount of blue and save the file so that it now contains the four bytes: 00001001 01000010 01101111 11111111 (or 09 42 6F FF). This is a valid FooBar binary file but it’s no longer valid ASCII as ASCII doesn’t allow a 1 in the first (most significant) bit of any of the 8-bit bytes.

ASCII only provides rules for 128 values (0x00 to 0x7F). This is quite a limited set of characters. It doesn’t include, for example, accented characters (such as é) or more aesthetic punctuation such as “smart quotes” or various length dashes — such as the em-dash. What if I want to add a pound sterling symbol (£)? The ASCII format doesn’t allow it, just as the FooBar format doesn’t allow a triangle. A different format is required.

The ISO-8859-1 encoding (or latin1) also has each character represented by an 8-bit byte but the range goes up to 0xFF (11111111). The first 128 values are identical to ASCII, but there are extra characters available (where the first — most significant — bit is 1) including the pound £ symbol (0xA3) and ÿ (0xFF). This means that my modified FooBar binary file with the extra blue (09 42 6F FF) is also a valid ISO-8859-1 text file. If I open the file in a text editor and stipulate the ISO-8859-1 encoding then it will interpret the contents as a tab followed by the characters “Boÿ”.

Although ISO-8859-1 provides some accented characters and some extra punctuation (such as guillemets « and ») there are still many characters that are unavailable. A more comprehensive format (text encoding) is UTF-8, which is a variable-width character encoding. This means that some characters are represented by more than one byte.

Just as ASCII is a subset of ISO-8859-1, ASCII is also a subset of UTF-8, which means that all the bytes from 0x00 to 0x7F in UTF-8 are identical to ASCII so, for example, 01101111 (6F) still represents a lower case Latin o. However, unlike ISO-8859-1, the non-ASCII characters are identified by two or more bytes in UTF-8. For example, the pound £ symbol requires two bytes: 11000010 10100011 (C2 A3).

Let’s suppose I now have a file containing the four bytes: 11000010 10100011 00110001 00110010 (C2 A3 31 32). If I open this in a text editor, identifying the text encoding as UTF-8, then the first two bytes will be interpreted as the single character £, the next byte is the digit 1 and the final byte is the digit 2, so I have three characters in total “£12” and the file is four bytes long. If I instead identify the text encoding as ISO-8859-1 then each byte is a separate character, where the first byte is the upper case Latin A with circumflex (Â), so I now have four characters in total “£12”.

Is this still a valid FooBar binary file? Yes, it is, provided we are adopting the lax approach of ignoring the first four bits.

Reddish ellipse with long axis horizontal.

Is this a valid ASCII file? No, because it contains bytes outside of the valid range.

Let’s go back to the original “image1” file and reduce the green to 0x08 and save the image as a file called, say, “testfile”. This contains the four bytes: 00001001 01000010 00001000 01101111 (09 42 08 6F). This is a FooBar binary file but is it also a text file? ASCII defines 00001000 (0x08) as the backspace control code, which is an instruction to move back one space. So this is also a valid ASCII file, but let’s see how it looks if we view it in a text editor. My preferred editor is vim:

Image of vim with a black background showing file contents: a space 8 characters wide followed by the upper case letter B (in white), and then (in blue) the caret symbol followed by the upper case letter H, and then the lower case letter o (in white).

This shows a space eight characters wide (which is the result of the tab 0x09), the upper case letter B (0x42), but this is followed by a sequence in cyan consisting of ^H. It’s in cyan to highlight the fact that it’s not the two characters ^ and H but is a control code with the value 0x08 (H is the eighth letter of the alphabet). This is caret notation and is used to denote control codes. This is followed by the final character (lower case o).

Not all text editors use caret notation. Here’s how this file looks like in gedit:

Image of the gedit (black text on white background): there is a space 8 characters wide (the tab), followed by the upper case letter B, followed by a rectangle containing the digits 0008, followed by a lower case letter o.

In this case the control code is shown using a rectangle with the control code’s hexadecimal value inside it (in this case padded to four digits 0008).

If, on the other hand, I display the file contents using cat in a bash terminal then the result is just a space eight characters wide followed by the lower case o. This is because cat obeys the control code instruction. It firsts moves the cursor to the next tab stop (which creates the initial space), then it prints the letter B, then it moves the cursor back one space, then it prints the letter o, which overwrites the B.

The purpose of a text editor is to create and edit files. If I type Tab Shift+B Backspace o in the text editor then it will interpret the backspace as an instruction to remove the previous character from the buffer. If I then save the file, it will only contain 00001001 01101111 (the tab character and the lower case Latin o). It won’t contain the unwanted B and the backspace character. Therefore, if I open a file in a text editor that contains a control code, such as backspace, the editor will assume that I want a visual representation of the character and won’t interpret it as an instruction.

Although this file is valid ASCII, it would normally be considered just a binary file not a text file because it looks weird if you open it in a text editor.

An application may be able to read a text file (that is, it knows the file format rules), but that doesn’t mean that it will follow the actions assigned to control codes (such as backspace), and there is no guarantee that the font the application is using has an associated glyph for a particular printable character.

UTF-8 also has control characters, such as the zero width joiner, which consists of three bytes (0xE2 0x80 0x8D), and the “variation selector-16” character, which also consists of three bytes (0xEF 0xB8 0x8F). These are used to apply attributes to emoji. For example, the superhero character 🦸 consists of four bytes (0xF0 0x9F 0xA6 0xB8) and the female sign ♀ consists of three bytes (0xE2 0x99 0x80). The sequence of thirteen bytes 0xF0 0x9F 0xA6 0xB8 0xE2 0x80 0x8D 0xE2 0x99 0x80 0xEF 0xB8 0x8F (superhero, zero width joiner, female sign, variation selector-16) identifies the female superhero emoji 🦸‍♀️. However, some applications may not have the function required to implement this and may end up displaying the superhero and female symbols: 🦸♀ (if the font being used has a corresponding glyph for those characters).

If the font doesn’t have a glyph for a particular character then a “not defined character” glyph may be used instead. This could be a rectangle with the hexadecimal value inside (as with the gedit example above) or it could simply be an open rectangle ▯ or a rectangle containing a question mark. (If a byte is invalid — that is, it’s not valid for the given text format — then the replacement character � is typically used.)

So, just because a file has a valid text format, it doesn’t necessarily mean that an application that is ordinarily able to read text files won’t encounter some difficulty with certain characters in that file.

What happens if I try to input my original “image1” file into a LaTeX document:

\documentclass{article}
\begin{document}
\input{image1}
\end{document}

The \input{} command expects the file identified in the braces to be a LaTeX file. This means that it expects the file to be a text file that contains LaTeX markup. So the contents of “image1” won’t be interpreted as a FooBar image but will be interpreted as the characters Tab B o o. LaTeX doesn’t interpret the Tab control code as a tabulation instruction but instead treats it as a space. It also ignores any spaces at the start of a line (which allows you to indent your source code to make it easier to read without introducing spurious spaces). The result is a PDF file with the word “Boo”.

Now let’s replace \input{image1} with \input{testfile} (the file shown in vim and gedit above) and try compiling (building) the document with pdflatex. This triggers the following error:

! Package inputenc Error: Unicode character ^^H (U+0008)
(inputenc)                not set up for use with LaTeX.

See the inputenc package documentation for explanation.
Type  H <return>  for immediate help.
 ...                                              
                                                  
l.1 	B^^H
         o
? 

This is complaining about the third byte in “testfile”, which LaTeX is interpreting as the Unicode character U+0008 (the backspace character). LaTeX has no instructions regarding this character as it’s not a character that would ordinarily be found in LaTeX source code. LaTeX knows what this character is (U+0008), but it doesn’t know what to do with it. If I type h Return at the prompt (the question mark at the bottom) then I get the following message:

You may provide a definition with
\DeclareUnicodeCharacter

This is telling me that if I want to use this character then I need to declare it and provide instructions as to how LaTeX should deal with it. If I press Return at this point and let LaTeX carry on processing then it will ignore the backspace character, so the resulting PDF will simply display “Bo”.

In neither of the above cases do I get a PDF with an image of a small ellipse because \input expects the file to contain (La)TeX instructions and will parse it as such.

You may have noticed that the error message above mentions the inputenc package even though the document hasn’t loaded it. In that example, I was using the TeX Live 2021 distribution. If I use an older distribution, say, TeX Live 2016 then I get a different error message and a different help message:

l.1 	B^^H
         o
? h
A funny symbol that I can't read has just been input.
Continue, and I'll forget that it ever happened.

? 

In both cases the backspace character is ignored and the result is the same.

Now let’s try the four-byte file that can be interpreted as the UTF-8 characters “£12”: 11000010 10100011 00110001 00110010 (C2 A3 31 32). With pdflatex from TeX Live 2016, there’s no error message but the log file contains the following warnings:

Missing character: There is no  in font cmr10!
Missing character: There is no £ in font cmr10!

So this is interpreting the first two bytes of the file as two separate characters, Â (0xC2) and £ (0xA3), but there’s no glyph available for either of these characters in the default font (cmr10). So the PDF just contains “12”. With TeX Live 2021, there are no errors or warnings and the PDF contains “£12”.

Donald Knuth first released TeX in 1978, and Leslie Lamport released LaTeX in 1985. ISO-8859-1 was also first published in 1985, but UTF-8 was designed in 1992. (ASCII was first published in 1963.) So it’s not surprising that the original versions of TeX and LaTeX were designed for single byte text encodings.

The great advantage about UTF-8 is that it covers all Unicode characters (as opposed to ISO-8859-1, which is limited to 256 characters, and ASCII, which is limited to 128 characters). It’s natural that users who wanted to be able to type extended Latin or non-Latin characters into their LaTeX document source code were keen to adopt UTF-8. This is awkward for (La)TeX, which treats each byte as a separate token. The inputenc package (with the utf8 setting) provides a workaround: it makes the first byte of a multi-byte sequence an active character which takes the subsequent byte as its argument. This can be demonstrated by the following UTF-8 document:

\documentclass{article}
\usepackage[utf8]{inputenc}
\begin{document}
\show £
\end{document}

This produces the following message in the transcript:

> �=macro:
->\UTFviii@two@octets �.
l.4 \show �
           �

The \show command displays the definition of the token that follows it. In this case, the command is followed by two tokens: the two bytes (0xC2 and 0xA3) that indicate the £ symbol. So \show picks up the first token (0xC2) and shows the definition. This token (0xC2) is written to the transcript, but the transcript is being viewed in a bash terminal that’s expecting UTF-8 content. The 0xC2 byte isn’t followed by a legitimate byte (as defined by the UTF-8 format) so it’s flagged with the replacement symbol � to denote that it’s invalid. If I view the transcript in vim with the binary mode on, I can see the value of the bytes.

Image of the above transcript message shown in vim in binary mode: the invalid bytes C2 and A3 are shown as <c2> and <a3> (in cyan).

This shows that the 0xC2 byte (octet) has been defined as a macro (command) that expands to \UTFviii@two@octets followed by the byte 0xC2. This internal command is defined to take two arguments: the first is provided (0xC2, the first byte in the two-byte pair) and the second is the token that follows (the second byte in the two-byte pair).

If I press Return to continue processing the document I encounter an error because the second byte (0xA3) of the two-byte pair has become detached. The transcript in the bash terminal at this point is:

! Package inputenc Error: Invalid UTF-8 byte "A3.

See the inputenc package documentation for explanation.
Type  H <return>  for immediate help.
 ...                                              
                                                  
l.4 \show £

If I look at the log file in vim again, but this time without the binary mode on, then vim determines that the file can’t be a UTF-8 file (because it breaks the UTF-8 rules) so it decides that it must be an ISO-8859-1 file:

> Â=macro:
->\UTFviii@two@octets Â.
l.4 \show Â
           £
?

! Package inputenc Error: Invalid UTF-8 byte "A3.

See the inputenc package documentation for explanation.
Type  H <return>  for immediate help.
 ...

l.4 \show £

Recent changes to the LaTeX kernel in the past few years mean that UTF-8 is now the default encoding for LaTeX document source files, but this trick is still employed. If you want the multi-byte UTF-8 characters to be treated as a single token then you need to switch to a modern TeX engine (XeLaTeX or LuaLaTeX), which natively supports UTF-8.

So how do you tell what format (or encoding) a file is in? That is unfortunately quite difficult. You can parse the file and find out if it breaks a rule (is invalid) to determine if it’s not a particular format. For example, if the file contains a byte larger than 0x7F then it’s definitely not ASCII, or if the file contains a byte such as 0xC2 that isn’t followed by a byte (or bytes) that results in a valid UTF-8 character then the file isn’t UTF-8. However, as illustrated by the “image1” file, just because the file contents satisfy the rules of one format doesn’t mean that it doesn’t also coincidentally satisfy the rules of another format.

One general rule of thumb is that if the file contains a certain proportion of bytes that represent control characters (such as the earlier backspace example) then it’s likely to be a binary file. However, just because a file only contains bytes that represent printable characters doesn’t mean that the file isn’t a binary file.

Some formats use a “magic marker”: a sequence of bytes at the start of the file that identifies the format. For example, a PDF 1.3 file will start with the bytes 0x25 0x50 0x44 0x46 0x2D 0x31 0x2E 0x33 (which represents the ASCII characters “%PDF-1.3”). However, there’s nothing to stop me from starting a LaTeX file with the lines:

%PDF-1.3
\pdfmajorversion=1
\pdfminorversion=3

In this case, the first line is a comment to remind the author (or anyone else reading the source code) that the next two lines are stipulating that the resulting PDF file must be version 1.3, but this is enough to confuse some applications into thinking that this LaTeX source file is a PDF file.

The byte order mark (BOM) is another form of magic marker that’s used to indicate the byte-endianness of a UTF-16 or UTF-32 file. For example, a big endian byte order UTF-16 file will start with the bytes 0xFE 0xFF. UTF-8, on the other hand, only has one byte order so, although the BOM character is defined in UTF-8, there’s no point in using it to indicate the byte order.

Despite this, the BOM character is sometimes used at the start of a file to simply indicate that the file is UTF-8, but this can be problematic. If a text editor automatically inserts it at the start of every file that it saves, then it forces the file to be UTF-8 even if there are no non-ASCII characters in the content. This makes it less compatible with other applications, particularly when the file is a script for a language that has it’s own magic marker (such as bash file that must start with #!).

Another way of indicating the file format is to incorporate the information in the file name. This is typically done in the form of a suffix that starts with a dot (the file extension). For example, “image.jpg” indicates a JPEG file and “image.png” indicates a PNG file. There can be multiple extensions. For example, the file “myDoc.synctex.gz” has the extension “.gz” which means that it uses the gzip compression format. If I uncompress it (gunzip myDoc.synctex.gz) then I will have the file “myDoc.synctex”, which is in the synctex format.

Some filing systems hide the extension when showing the list of files. This can be particularly annoying for LaTeX users because a directory (folder) can become full of files with the same basename but different extensions. For example, if I create a document called, say, “myDoc.tex” then using pdflatex will, at the very least, create the files “myDoc.log”, “myDoc.aux” and “myDoc.pdf”. Once a table of contents, list of figures, list of tables, bibliography, index, glossaries etc are added, the file list increases significantly and it can be hard to tell which is which if the extensions are hidden.

More generally, hiding the file extensions can have serious security implications. An executable file called, say, “notes.txt.exe” will be displayed as “notes.txt” which gives the impression that it’s just a text file, but if you double-click on it, expecting to open it in a text editor, the file will instead be executed. This is one way in which users can be tricked into running a malicious executable file.

Unfortunately there’s nothing to stop anyone from renaming the file so that it has a different extension. For example, if I rename a PNG file from “image.png” to “image.jpg” then this doesn’t alter the file content — it’s still a PNG file — but it misrepresents the file, making it look like a JPEG file when it’s actually a PNG file. This can confuse an application that tries to determine the file type from the file name extension, and it will try to read the file using the wrong set of rules.

Another way of identifying the file format is with the MIME type but, as with file extensions, the MIME type can be incorrect (either through accident or deliberately).

Returning to my FooBar “image1” file (which doesn’t have an extension), if I forget about it and stumble on it months later, the chances are that I will have forgotten what it was and what I used to create it. My first step will be to try to identify it with the file tool. This returns “image1: ASCII text, with no line terminators” so my next step will be to open it in a text editor, where I will find the message: Tab Boo. Therefore the developer of the FooBar application really needs to modify the file format so that it includes a magic marker at the start and also decide on a file extension to help identify what type of file it is.

Can I include my FooBar ellipse in my LaTeX document? Not in its FooBar binary format. I would first need to convert it to a graphics format that \includegraphics recognises.

So in summary:

  • All files contain binary data.
  • The file format is the algorithm or set of rules needed to understand the data contained in the file.
  • The term “text file” is used to indicate a file that is written in one of the standard text file formats (such as ASCII, ISO-5988-1 or UTF-8) that is intended to be readable in a text editor (that is, it doesn’t have a high proportion of non-whitespace control codes).
  • The term “binary file” is used to indicate a file that is not a text file.
  • The file (or input) encoding of a text file is the particular text format used to store the textual data in the file.
  • An application can only properly parse or process a file if it recognises and understands the file’s format.
  • If the format is mis-identified then this can either cause outright failure (“invalid format”) or incorrect instructions (such as placing an unwanted  before a £).

Good Bots and Bad Bots

You’ve probably come across websites that want you to prove that you’re human and not a robot. This may come in the form of a picture challenge (for example, select all the squares with bicycles) or it may simply require you to check a box to assert that you’re not a robot. Perhaps you’re wondering why you need to do this. Why is the website so concerned about being visited by robots? Alternatively, perhaps you’re a website developer and are determined to find a way to keep out all bots.

What is a bot? Are all bots bad?

As with cookies, bots are important tools in the digital world. However, as with cookies, bots can also be used for unwholesome purposes.

“Bot” is short for robot and is simply a piece of software (an application) that visits websites. A bot may follow one link after another, crawling through pages across the World Wide Web. For this reason, they are often called “crawlers” or “spiders”.

Good Bots

If you go to your favourite search engine and type in a keyword or phrase (or use a voice activated request on your mobile device) then the results usually come up fairly quickly. This is only possible because the search engine has an index that has been compiled by bots that have followed link after link, gathering information. Without this index, it would take a very long time to scour all the millions of pages that make up the web to find something relevant.

Not all bots are crawlers. For example, Facebook has a bot that’s used when a post contains a link. The bot is used to check that the link exists and it reads any Open Graph markup. This allows Facebook to include an image and short excerpt to arouse the interest of anyone who views the post. Unlike the search engine bots, this bot doesn’t roam free about the Internet but instead restricts itself to links posted on Facebook pages.

Well behaved bots commonly identify themselves in the user agent string in the form:

bot-name/version URL

For example, “facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)” identifies the Facebook bot (facebookexternalhit), its version number (1.1) and a way of finding out information about the bot.

So these are useful bots that help users to discover interesting sites.

However, even the good bots don’t always honestly identify themselves. For example, if you post a link in the Signal Messaging app, the bot used to fetch the preview information identifies itself as WhatsApp, and this basically seems to be a rehash of the “all browsers identify themselves as Mozilla” problem.

Not So Good Bots

Although the crawlers used by search engines are useful, some crawlers that index sites to provide certain types of information for their users (who may require free or paid accounts to access it) can be a nuisance because they’re not well-behaved. For example, they may not follow the robot instructions stipulated by the website (robots.txt), they may try to access pages that are only intended for human visitors or they may hit the site so hard (that is, they look up pages so fast) that they slow down the site and it becomes unusable for everyone else.

This could be because the bot’s developer made a mistake (a bug in the bot’s code or an inexperienced programmer) or it could be because the developer simply doesn’t care and wants the information quickly regardless of the inconvenience to others (perhaps to satisfy the demands of paying customers). In the long run, this is counter-productive as it will lead to the bot (which is identified in the user agent string) being banned.

Scrapers

Web scraping (or harvesting) is when a bot extracts data from a webpage. In the earlier case of search engines and social media, this data can just be keywords or phrases or the URL for the page image, but some bots are designed to gather all information from a page in order to reproduce it verbatim on another site. This is often done to lure visitors to their own copycat site, which will most likely be stuffed full of adverts and tracking (which makes a profit for their owner). This is usually a violation of intellectual property. Even where the original page is available under a permissive licence, attribution is usually required but is often omitted. This happens a lot for question and answer sites, such as Stack Exchange, or forums.

These bots may well have the user agent string empty or set to the default value for the given API that they are built with.

Trolls and Spambots

These are the types of bots that the pages that require you to identify yourself as a human are mostly trying to block. The user agent string is typically set to a common browser and platform to make the bot appear as though it is a human visitor. These bots search for forms to fill in, such as contact forms to send spam messages or comment forms to advertise dubious products and sites.

While spambots are the digital equivalent of fly-posters, trollbots are the equivalent of poison-pen letter writers. They are created by individuals who take a puckish delight in causing hurt and discord. These bots are designed to search for certain keywords on a page and craft an offensive or divisive comment that relates to the topic. The creators of these bots may have a particular hatred towards a certain group of people, but they can also be chaotic nihilists with a set of offensive comments for every group.

The expression “don’t feed the trolls” has been around for a long time. I remember first encountering it on Usenet back in the early 1990s (accompanied by some ASCII art). It’s very good advice. Don’t give trolls the attention that they are looking for, but, in some cases, the troll posting the offensive comments isn’t human. It’s a bot that has no ability to reason, no feelings, no embarrassment. Its function is solely to post content that its creator programmed into it.

Chatbots can come under both this category and the next. Chatbots in general are just a tool that simulates conversation, and are often used for legitimate services, such as online help, but they are also used by criminals to deceive people. For example, a fraudster might create a fake account on an Internet dating site and use a chatbot to hook victims who believe they are chatting with a human. Once the chatbot has gained the victim’s trust, the fraudster takes over.

Malware Bots

The worst of the bad bots are the ones created by cyber-criminals and they are designed to wreak havoc, stealing data and installing malware. These bots look for dynamic web pages that use parameters and will try to inject malicious code into the parameter values.

For example, the page https://www.dickimaw-books.com/booklist.php?book_id=11 has a parameter (book_id) that identifies a particular edition of a book. (In this case, the second paperback edition of The Private Enemy.) The parameter value (11) uniquely identifies this edition in the database that contains all the title information.

A malicious bot will try altering the parameter value to break into the database. For example, it may start out by simply appending an apostrophe (book_id=11'). If this triggers a syntax error then the site is vulnerable to SQL injection and the bot can then try something far nastier to access the contents of the database.

Another possibility is that the parameter value may be printed on the web page, so the bot will try replacing the value with JavaScript. For example, the bot may start out with a simple alert. If the bot detects an alert box then the site is vulnerable to cross-site scripting (XSS) and the bot can try something more damaging.

Or the parameter value may be the name of a template file, which is used for the main body of the web page, so the bot will try replacing the parameter value with /etc/passwd (or ../etc/passwd etc) in order to trick that web page into revealing the contents of the password file instead.

Bad bots can also disrupt a website by repeatedly accessing pages in rapid succession (a denial of service attack or, where an army of bots are working together, a distributed denial of service attack). This can make the site completely inaccessible to anyone else.

These types of bots rarely identify themselves honestly. The user agent string is typically empty or contains a common browser and platform combination (as with the trolls and spambots). I’ve also encountered attempts at SQL injection where the user agent string was the same as the aforementioned Facebook bot. At first glance, it gives the impression that a Facebook bot has gone rogue (or followed a bad link) but the IP was registered to somewhere in Russia, which seems an unlikely origin for a Facebook bot, so bad bots not only pretend to be human but also try to pass themselves off as legitimate bots.

Sometimes the user agent string will contain “sqlmap”. This is a legitimate pen testing application. However, in many jurisdictions, penetration testing can only be performed by mutual consent between the pen tester and the website owner. If you are a website developer and a pen tester has been hired by your organisation, then don’t block bots with this user agent as the site needs to be tested by an unblocked bot since most bad bots don’t conveniently identify themselves. If a pen tester hasn’t been engaged then the tool is being used illegally (which is par for the course with criminals).

So, if you’re a website developer and you want to stop bad bots, remember that you can’t rely on the user agent string. Bots pretend to be human and some humans blank their user agent string for privacy reasons. The first line of defence is to filter (e.g. ensure that a numeric value is actually a number), escape special characters (e.g. htmlentities) and use prepared statements.

If you’re just a regular website user, don’t assume that every comment you read was actually posted by a human and, while captchas may be frustrating, your web browsing experience may be far worse without them.


Update 2021-08-08: added paragraph on Signal in Good Bots section and paragraph on chatbots in Trolls and Spambots section.