🔗
When you upload your project files on this site, there is an
encoding selector which gives you a choice of ASCII, Latin 1 or
UTF-8. This should be set to the appropriate encoding. If you set it
incorrectly, some characters may appear incorrect in the PDF
(or not appear at all) and it will trigger warnings and error messages.
Computers internally store characters (such as a
or
$
) as numbers (or, more specifically, as binary information). The
encoding specifies the mapping between a number (as a byte or sequence of
bytes) and the corresponding character. For example, the binary value
01001110
(78 in decimal) represents the Latin capital
N
and 01101110
(110 in decimal) represents the Latin
small n
. Computer programmers commonly use hexadecimal values.
For example, 0x4E for Latin capital N
and 0x6E for Latin small
n
.
ASCII (or, more strictly, US-ASCII) provides mappings for 128
characters, ranging from 0x0 (the null character) to 0x7F (the delete
character). Anything outside this range is invalid.
The characters from 0x0 to 0x20 and the final character
0x7F are non-printable characters (control codes) that include
the horizontal tab (0x9), line feed (0xA), form feed (0xC), carriage
return (0xD) and the space character (0x20).
The codes from 0x21 to
0x2F represent the following punctuation characters:
!
(exclamation mark), "
(straight quote),
#
(hash), $
(dollar), %
(percent),
&
(ampersand), '
(straight
apostrophe), (
(left parenthesis), )
(right
parenthesis), *
(asterisk), +
(plus),
,
(comma), -
(hyphen-minus),
.
(full stop or period), /
(solidus or
forward slash).
The codes from 0x30 to 0x40 start with the decimal
digits: 0
to 9
then
:
(colon), ;
(semi-colon), <
(less than),
=
(equals), >
(greater than),
?
(question mark), @
(at).
The codes from 0x41 to 0x5A represent the Latin capitals
A
to Z
. These are followed by
[
(left square bracket), \
(backslash),
]
(right square bracket), ^
(circumflex),
_
(underscore), `
(grave or backtick).
The codes from 0x61 to 0x7A represent the Latin lower case
a
to z
. (So you can obtain the lower case by
simply adding 0x20 to the corresponding capital.) Then follows:
{
(left curly bracket), |
(vertical line
or pipe), }
(right curly bracket) and ~
(tilde).
Note that ASCII doesn’t include accented characters (such as
é
), other currency symbols (such as £
), long dashes
(such as —
), or “smart quotes”. (Some fonts may render the
straight quote "
and straight apostrophe '
with a
curl so they have the appearance of smart closing quotes but they are different
characters.)
ASCII is a very limited set of characters but it forms the subset
of many encodings so it’s the most portable. Only select
ASCII if you’re sure that you have no non-ASCII characters in your
code or written to STDOUT/STDERR. For example, suppose
your (Java) code contains:
System.out.println("\u00A3");
then this source code only contains ASCII but a non-ASCII character will be
written to STDOUT. In this case, select UTF-8 not ASCII.
Latin 1 (or ISO-8859-1) has mappings ranging from 0x0 to 0xFF.
As with ASCII, every character is represented by a single byte.
The first 128 characters are identical to ASCII. The range 0x80 to
0x9F aren’t printable. The remaining characters from 0xA0 to 0xFF
consists of additional punctuation and symbols (such as £
) and
extended Latin characters (such as é
and ø
).
This doesn’t include characters such as “smart quotes”, long
dashes or emoticons and doesn’t include any non-Latin alphabets.
UTF-8 is a variable-width character encoding using one to four one-byte code units
that identify the Unicode codepoint and covers all Unicode characters.
The first 128 characters are identical to ASCII,
with each character identified by a single byte. Outside of that range, characters
are identified by multiple bytes.
For example, the hash character #
is represented by a single
byte 00100011
with all three types of encoding
(ASCII, Latin 1 and UTF-8). Whereas the character £
can’t be
represented in ASCII, but is represented in Latin 1 with the single byte
10100011
(0xA3) and is represented in UTF-8 with two bytes
11000010
(0xC2) and 10100011
(0xA3). So if you
misidentify a UTF-8 file as Latin 1, those two bytes will be treated as two
separate characters (Â
and £
)
instead of as a single character (£
). If you misidentify the file as
ASCII then both bytes are invalid, as they are both outside the valid range
(> 0x7F).
So if you selected ASCII or Latin 1 and any non-ASCII character appears
as two or more characters (such as “£” instead of “£”) then you
should’ve chosen UTF-8.
UTF-8 is by far the most common encoding on the web and if
you have any non-ASCII characters this is the best encoding to use.
There are many other encodings, such as UTF-16, but these aren’t supported by PASS.
As a general rule of thumb, choose UTF-8 but make
sure that your IDE is set to UTF-8.