Localisation with datatool v3.0+

The article Localisation with tracklang.tex describes the reason why I created the tracklang package. The article Integrating tracklang into Language Packages gives an example of how to integrate tracklang into a language package. The article Using tracklang in Packages with Localisation Features is for those who are writing a package that needs to detect the document’s localisation settings.

This article describes how to write a datatool v3.0+ language resource file. I recommend you first read the previous articles (particularly Using tracklang in Packages with Localisation Features) to understand the purpose of tracklang and how it’s designed to allow packages to input the appropriate language resource files.

Note that if you are using a pre-3.0 version of datatool, there’s no localisation support and the ldf files described below will be ignored.

Overview

As from version 3.0, the datatool package provides localisation support, although it’s actually the underlying base package, datatool-base that deals with loading tracklang and using its interface to find and input the relevant ldf files.

Unlike datetime2, which has files such as datetime2-en-GB.ldf that are tied to a particular language and region combination, the datatool localisation support is split into language-independent region files, such as datatool-GB.ldf (provided with datatool), and regionless language files, such as datatool-english.ldf (provided in a separate datatool-english module).

This separation of language and region means that, if a language isn’t supported, then at least the region (if provided) can be loaded. Likewise, if the region isn’t support but the language is then that can be provided. This allows for partial support and it also means that you can mix and match language and region.

The datatool-lang-region.ldf file is only required in cases where a setting is specifically tied to both the language and region. For example, Canadian English has a decimal dot whereas Canadian French has a decimal comma. So the region file datatool-CA.ldf defines the currency symbol (CAD) and the language file datatool-english.ldf provides the English language rules for sorting alphabetically, but and additional file datatool-en-CA.ldf is needed that sets up the default number symbols.

This means a slightly different approach is needed for finding and loading the required files. The file search performed by \TrackLangRequireDialect will find the region file, if installed, and will then stop the search, which means that the language file won’t be found. For example, if both datatool-GB.ldf and datatool-english.ldf are installed, then for the tracked locale “en-GB” only datatool-GB.ldf will be found an input.

Therefore datatool-base.sty will instead use \TrackLangRequireDialectOmitDialectLabelOmitOnlyRegion to find the separate region and language files. This command is new to tracklang v1.6.3 so if the command isn’t defined datatool-base.sty will fallback on \TrackLangRequireDialect with a warning.

The command \RequireDatatoolDialect is essentially:

\newcommand\RequireDatatoolDialect[1]{%
 \TrackLangRequireDialectOmitDialectLabelOmitOnlyRegion
 [%
   \ifdefempty{\CurrentTrackedRegion}{}%
     {\TrackLangRequireResource{\CurrentTrackedRegion}}%
   \TrackLangRequireResource{\CurrentTrackedTag}%
 ]%
 {datatool}{#1}%
}
(The actual definition is slightly different to allow for downgrading to use \TrackLangRequireDialect.)

For example, if the dialect is “british” then \CurrentTrackedRegion will be “GB” so this will first do:

\TrackLangRequireResource{GB}
This means that, if datatool-GB.ldf is installed, (and hasn’t already be loaded) then it will be input. After that, the more usual
\TrackLangRequireResource{\CurrentTrackedTag}
is implemented. However, the use of \TrackLangRequireDialectOmitDialectLabelOmitOnlyRegion will cause datatool-british.ldf and datatool-GB.ldf to be omitted from the search. This means that datatool-english.ldf can be found and also input (if it hasn’t already been loaded).

In the case of en-CA, this will first do:

\TrackLangRequireResource{CA}
which will load datatool-CA.ldf (if installed and not already loaded). The current tracked tag is “en-CA” so the next step is:
\TrackLangRequireResource{en-CA}
This will find and input datatool-en-CA.ldf (if installed and not already loaded). This file is provided with datatool-english and ensures that the root language file is also loaded:
\TrackLangRequireResource{english}

Each tracked dialect is iterated over with:

\AnyTrackedLanguages
{%
  \ForEachTrackedDialect{\@dtl@thisdialect}%
  {%
    \RequireDatatoolDialect{\@dtl@thisdialect}%
  }%
}
{}%
(Again, the actual definition is slightly different to allow for downgrading for old versions of tracklang.)

The language code must always be included when tracking with tracklang. An error will occur if the language part of the tag is missing. However, datatool-base.sty will allow just the region code in its locales package option. In this case, it will automatically insert und- at the start of the tag when using \TrackLanguageTag. The “und” code corresponds to ”undetermined“ and datatool includes datatool-undetermined.ldf which will be loaded in this case. This allows for region-specific localisation without any language support.

The region files deal with setting the default currency and number characters (group separator and decimal symbol). They also provide a way of parsing numeric dates and times with the default DMY, MDY or YMD applicable to the region. This doesn’t require too much knowledge of the region as the information is easily available from sources such as Wikipedia. The files are much the same (although some may require the currency to be defined) so it’s relatively easy for me to add them to the datatool-regions bundle.

The language files not only deal with defining token list type commands to expand to the appropriate word (or words) but also need to deal with the rather more complex lexicography associated with the language. This is beyond my skill set for anything other than English. Additionally, support can be added for parsing dates that include textual elements, such as month names. These files are therefore distributed as separate modules, which will need to be installed if any of the language elements are required. I have provided datatool-english, which can be used as an example.

This separation of language and region not only makes it easier to divide maintenance across those with the best skills but also makes it possible to mix and match language and region as well as allowing minimal regional support even if the language support is missing.

Supplementary Packages

The supplementary packages databib.sty and person.sty also have localisation support, but in this case there’s no need for the separation of language and region.

The person package only needs \TrackLangRequireDialect to find the files:

\newcommand*{\RequirePersonDialect}[1]{%
 \TrackLangRequireDialect{person}{#1}%
}

And the code to iterate over all dialects:

\AnyTrackedLanguages
{%
  \ForEachTrackedDialect{\@dtl@thisdialect}%
  {%
    \RequirePersonDialect{\@dtl@thisdialect}%
  }%
}%
{}%

Similarly for databib.sty:

\newcommand*{\RequireDataBibDialect}[1]{%
 \TrackLangRequireDialect{databib}{#1}%
} 
and:
\AnyTrackedLanguages
{%
  \ForEachTrackedDialect{\@dtl@thisdialect}%
  {%
    \RequireDataBibDialect{\@dtl@thisdialect}%
  }%
}%
{}%

Region Files

The region files are all bundled in datatool-regions which needs to be installed separately. Originally, I was going to include them with datatool but even a small update to the datatool bundle requires a time-consuming testing and build sequence to make it ready for upload, and if the package is already undergoing modifications pending a new version, those changes need to be completed. By distributing datatool-regions separately, support for a new region can quickly be added, without having to wait for the completion of a new version of datatool.

As described above, the region filename should be in the form datatool-region.ldf where region is the two-letter uppercase region code. The first line of the file should use \TrackLangProvidesResource to identify the file and version.

If applicable, the number group and decimal characters should be set up. This part should be omitted if the number format for the region also depends on the language. To allow for extra flexibility an intermediate command is defined, which will be added to the captions hook. For example, datatool-GB.ldf defines \datatoolGBSetNumberChars.

The currency symbol should then be defined, except for “EUR” which is already defined in datatool-base.sty. Once the currency symbol is defined, the language hook should be adjusted to switch to that currency. Again, intermediate commands allow for extra flexibility. For example, datatool-GB.ldf defines:

\newcommand\datatoolGBsetcurrency{%
  \DTLsetdefaultcurrency{GBP}%
  \renewcommand\DTLCurrentLocaleCurrencyDP{2}%
}

The numeric date and time parsing commands can also be defined, but this feature is still experimental. For example, datatool-GB.ldf defines \datatoolGBsettemporalparsers and \datatoolGBsettemporalformatters.

The token list variable \l_datatool_current_region_tl should be set (in the captions hook) to the region code.

Again, intermediate commands allow for extra flexibility, so a single command is defined to add all region settings to the captions hook. For example, datatool-GB.ldf defines \DataToolBaseGB:

\newcommand \DataToolBaseGB
 {
  \datatoolGBSetNumberChars
  \datatoolGBsetcurrency
  \datatoolGBsettemporalparsers
  \datatoolGBsettemporalformatters
  \tl_set:Nn \l_datatool_current_region_tl { GB }
 }
and then adds it to the caption hook:
\TrackLangAddToCaptions{\DataToolBaseGB}

Language Files

The root language filename should be in the form datatool-language.ldf where language is tracklang’s root language label. The first line of the file should use \TrackLangProvidesResource to identify the file and version.

The word handler hook \DTLCurrentLocaleWordHandler needs to be defined to ensure that word sorting will use the ordering that matches the language’s alphabet. This depends on the file encoding. Although LaTeX now defaults to UTF-8, it’s helpful to also provide support for other encodings that might be used with the language. For example, datatool-english.ldf supports UTF-8 and Latin-1. In the event that a different encoding is used, US-ASCII is provided as a fallback.

This is implemented by providing the files datatool-english-utf8.ldf datatool-english-latin1.ldf and datatool-english-ascii.ldf. The code to load the applicable file is:

% Try loading datatool-english-<encoding>.ldf
\TrackLangRequestResource{english-\TrackLangEncodingName}
{
% Not found, fallback on datatool-english-ascii.ldf
  \TrackLangRequireResource{english-ascii}
}
Again, intermediary commands are provided. Each encoding file defines \DTLenLocaleHandler. The language hook then needs to set \DTLCurrentLocaleWordHandler to this command.

In the case of datatool-english.ldf only the local handler macro is sensitive to the file encoding. The rest of the definitions can all be placed in datatool-english.ldf.

The commands need \ExplSyntaxOn. Don’t forget to switch it off again with \ExplSyntaxOff at the end.

To obtain the first letter of a word using English orthography:

\newcommand \DTLenLocaleGetInitialLetter [ 2 ]
 {
   \datatool_get_first_letter:nN { #1 } #2
 }

The letter group commands:

\newcommand \DTLenSetLetterGroups
 {
  \renewcommand \dtllettergroup [ 1 ]
    { \text_titlecase_first:n { ##1 } }
  \renewcommand \dtlnonlettergroup [ 1 ] { Symbols }
  \renewcommand \dtlnumbergroup [ 1 ] { Numbers }
  \renewcommand \dtlcurrencygroup [ 2 ] { Currency }
}

The date and time parsing commands can also be defined, but this feature is still experimental.

The token list variable \l_datatool_current_language_tl should be set (in the captions hook) to the language code.

As with the region file, there is a single command to setup everything that will be added to the captions hook:

\newcommand \DataToolBaseEnglish
 {
  \let
    \DTLCurrentLocaleWordHandler
    \DTLenLocaleHandler
  \let
    \DTLCurrentLocaleGetInitialLetter
    \DTLenLocaleGetInitialLetter
  \DTLenSetLetterGroups
  \let
   \DTLCurrentLocaleGetMonthNameMap
   \datatool_en_get_monthname_map:n
  \let
   \DTLCurrentLocaleIfpmTF
   \datatool_en_if_pm:nTF
  \tl_set:Nn \l_datatool_current_language_tl { en }
  \renewcommand \DTLandname { and }
 }

\ExplSyntaxOff

\TrackLangAddToCaptions{\DataToolBaseEnglish}

In the case of specific language and region combinations, the file should be provided with the other ldf files for that language. For example, datatool-english includes datatool-en-CA.ldf and datatool-en-ZA.ldf.

Again, the file should first identify itself with \TrackLangProvidesResource. Then it should ensure the root language is loaded. The region file should automatically be loaded before this file is loaded, but you can make sure that it has been. For example, datatool-en-CA.ldf has:

\TrackLangRequireResource{CA}
\TrackLangRequireResource{english}

Then just add the applicable commands to the captions hook. Again, intermediary commands make it easier to customize. For example, datatool-en-CA.ldf defines \datatoolEnglishCASetNumberChars to set the number group and decimal characters and \DataToolBaseEnglishCA, which is added to the captions hook:

\TrackLangAddToCaptions{\DataToolBaseEnglishCA}

The language bundle should also provide support for databib.sty and person.sty (supplementary packages supplied with datatool). These are more straight-forward as they don’t typically require the region, so only databib-language.ldf and person-language.ldf need be provided.

For example, datatool-english provided databib-english.ldf and person-english.ldf.