Gallery: datatool performance
Documents that use the datatool package can take a long time to build, but there are various ways to speed things up a bit. The test files described here can be found in the src/tests/performance subdirectory of datatooltk’s GitHub repository.
The tests described here were compiled with datatool.sty v2.29 and datatooltk v1.8. Older versions may cause an undefined control sequence error or unknown option error.
1. Load Speed
Simply loading data can take a while, especially for large amounts of data. The tests in this section load data using various methods, and then simply iterate over the data, displaying the first three columns:
\DTLforeach*{testdata} {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3} {\Fieldi, \Fieldii, \Fieldiii.\par}or (where
\testdata
has been defined to the database
label):
\DTLforeach*{\testdata} {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3} {\Fieldi, \Fieldii, \Fieldiii.\par}
Each test document has an associated bash script to perform the complete document build. The build time is then calculated using the GNU time command:
/usr/bin/time -f "$filename:\t%e\t%E --append --output=$logfile $filename"
where $filename
is the name of the document build bash script
($logfile
is just the name of the file containing the results of
all the tests). The results are summarised in
Table 1.
Test | Build Time (minutes:seconds) |
---|---|
1.1 \DTLloaddb (1000x3) |
0:12.71 |
1.2 \DTLloadrawdb (1000x3) |
0:12.00 |
1.3 \DTLnewdb (1000x3) |
0:09.58 |
1.4 datatooltk --nomap-tex-specials --csv (1000x3) |
0:01.03 |
1.5 datatooltk --map-tex-specials --csv (1000x3) |
0:01.14 |
1.6 datatooltk --in
with interpreted \DTLnewdb (1000x3) |
0:01.68 |
1.7 datatooltk skipped (1000x3) |
0:00.71 |
1.8 \DTLloaddb (1000x6) |
0:39.00 |
1.9 \DTLloadrawdb (1000x6) |
0:42.30 |
1.10 \DTLnewdb (1000x6) |
0:39.62 |
1.11 datatooltk
--nomap-tex-specials --csv (1000x6) |
0:01.52 |
1.12 datatooltk
--map-tex-specials --csv (1000x6) |
0:01.65 |
1.13 datatooltk --in
with interpreted \DTLnewdb (1000x6) |
0:02.34 |
1.14 fp.sty
with datatooltk skipped (1000x6) |
0:01.11 |
1.15
pgfmath.sty with datatooltk skipped (1000x6) |
0:01.11 |
1.16 datatooltk
--nomap-tex-specials --remove-columns (1000x6) |
0:01.44 |
1.17 datatooltk
--remove-columns skipped (1000x6) |
0:01.10 |
1.1 \DTLloaddb
(1000x3)
This test uses \DTLloaddb
to load the
CSV file sample-1000x3.csv,
which has a header row:
Field1,Field2,Field3and 1000 rows and 3 columns containing randomly generated words. This data includes
\&
in one of the entries:
A\&M,amaurosis,world-plaguing
The document (test-csv-doc1.tex) simply loads this data using:
\DTLloaddb{testdata}{sample-1000x3.csv}
The build process is simply:
pdflatex test-csv-doc1.tex
1.2 \DTLloadrawdb
This test is much like the previous one, but it uses a slightly
modified CSV file sample-1000x3-spchar.csv.
This only differs by one character from sample-1000x3.csv as it
doesn’t escape the &
special character:
A&M,amaurosis,world-plaguing
The test document (test-csv-doc3.tex) loads this data using:
\DTLloadrawdb{testdata}{sample-1000x3-spchar.csv}which converts
&
to \&
.
This method is intended for CSV files that have been generated by a
non-TeX aware tool, so the file shouldn’t contain any commands.
Neither \DTLloaddb
nor \DTLloadrawdb
can
load CSV files that contain multiline cells.
The build process is simply
pdflatex test-csv-doc3.tex
1.3 \DTLnewdb
(1000x3)
For this test, the CSV file from Test 1.1 (sample-1000x3.csv) was converted to sample-1000x3-newdb.tex, which contains datatool’s user commands provided to create a database. The database and first few rows are created using (line breaks added for clarity):
\newcommand{\testdata}{testdata} \DTLnewdb{\testdata} \DTLnewrow{\testdata} \DTLnewdbentry{\testdata}{Field1}{graced} \DTLnewdbentry{\testdata}{Field2}{premankind} \DTLnewdbentry{\testdata}{Field3}{bachelor} \DTLnewrow{\testdata} \DTLnewdbentry{\testdata}{Field1}{toad-housing} \DTLnewdbentry{\testdata}{Field2}{undersighted} \DTLnewdbentry{\testdata}{Field3}{levels} \DTLnewrow{\testdata} \DTLnewdbentry{\testdata}{Field1}{A\&M} \DTLnewdbentry{\testdata}{Field2}{amaurosis} \DTLnewdbentry{\testdata}{Field3}{world-plaguing}
The test document (test-newdb-doc.tex) loads this data using:
\input{sample-1000x3-newdb.tex}
The build process is simply:
pdflatex test-newdb-doc.tex
1.4 datatooltk --nomap-tex-specials --csv
(1000x3)
For this test, the CSV file from Test 1.1
(sample-1000x3.csv)
is first converted to a .dbtex file by datatooltk. Since
the CSV file contains \&
the settings
--nomap-tex-specials
and --nocsv-escape
are required.
The resulting file is then input in the document (test-dbtex-doc.tex) using:
\DTLloaddbtex{\testdata}{data.dbtex}
The build process is now:
datatooltk --nomap-tex-specials --nocsv-escape --csv sample-1000x3.csv -o data.dbtex pdflatex -jobname test-dbtex-doc1 test-dbtex-doc
1.5 datatooltk --map-tex-specials --csv
(1000x3)
This is like the previous example, except that it uses the CSV file
with the unescaped & special character
(sample-1000x3-spchar.csv). This means that the
--map-tex-specials
switch is required. The CSV file
doesn’t contain any backslash characters so no escape character is
required, but I’ve set it anyway for consistency. The document is
exactly the same as for the previous test. This build process is
now:
datatooltk --map-tex-specials --csv-escape '\' --csv sample-1000x3-spchar.csv -o data.dbtex pdflatex -jobname test-dbtex-doc4 test-dbtex-doc
1.6 datatooltk --in
with
interpreted \DTLnewdb
(1000x3)
As from version 1.7, datatooltk will first assume that an input file
(optionally identified by --in
; not an imported file,
such as a CSV file) is in the format created by
\DTLsaverawdb
. If the contents of the file don’t match
this exact format, then datatooltk will use the
TeX Parser Library
to try to interpret the contents. This is slower, but it means that
datatooltk can now read files like sample-1000x3-newdb.tex
that contain the database construction commands
\DTLnewdb
, \DTLnewrow
and
\DTLnewdbentry
.
This test still uses the same document test-dbtex-doc.tex from the previous two tests. The build process is now:
datatooltk --in sample-1000x3-newdb.tex -o data.dbtex pdflatex -jobname test-dbtex-doc5 test-dbtex-doc
1.7 datatooltk
skipped (1000x3)
The advantage of using datatooltk is that this step can be skipped in the build process if the data hasn’t changed. This is useful if you can apply conditionals in an automated build sequence. For example, with arara version 4.0 you can add the following comment lines:
% arara: datatooltk: {input: sample-1000x3-newdb.tex, output: data.dbtex} if changed(toFile("sample-1000x3-newdb.tex")) || missing(toFile("data.dbtex")) % arara: pdflatex
This will only run datatooltk if the database file
(data.dbtex) needs updating. Compare this with the first
three tests that use \DTLloaddb
, \DTLloadrawdb
and \DTLnewdb
, where on every LaTeX run the data must
be parsed and converted into datatool’s internal format.
This test was performed immediately after the previous test, so the file data.dbtex is present and contains 1000 rows and 3 columns. This mimics a subsequent build where the original data hasn’t been modified and therefore the datatooltk invocation can be skipped, so the build process is simply:
pdflatex -jobname test-dbtex-doc6 test-dbtex-doc
1.8 \DTLloaddb
(1000x6)
This test is much like Test 1.1, except that it loads a different CSV file (sample-1000x6.csv) that has 1000 rows and 6 columns of randomly generated data. In this case there are no special characters to worry about. The test document (test-csv-1000x6-doc1.tex) still only accesses the first three columns, but all six columns must be parsed when the data is loaded. The build process is simply:
pdflatex test-csv-1000x6-doc1
1.9 \DTLloadrawdb
(1000x6)
This is almost identical to the previous test but uses
\DTLloadrawdb
instead:
\DTLloadrawdb{testdata}{sample-1000x6.csv}The same CSV file is used. The build process is simply
pdflatex test-csv-1000x6-doc2
1.10 \DTLnewdb
(1000x6)
For this test, the CSV file from the previous test (sample-1000x6.csv) was converted to sample-1000x6-newdb.tex, which contains datatool’s user commands provided to create a database. Again, the data contains six columns but the document (test-newdb-1000x6-doc.tex) only uses the first three. The build process is simply
pdflatex test-newdb-1000x6-doc
1.11 datatooltk --nomap-tex-specials --csv
(1000x6)
For this test, the CSV file sample-1000x6.csv is first converted to a .dbtex file by datatooltk. The test document is test-dbtex-doc.tex (the same as for the other tests using datatooltk). There are no special characters to worry about so no mapping is required. The build process is:
datatooltk --nomap-tex-specials --nocsv-escape --csv sample-1000x6.csv -o data.dbtex pdflatex -jobname test-dbtex-1000x6-doc1 test-dbtex-doc
1.12 datatooltk --map-tex-specials --csv
(1000x6)
Although sample-1000x6.csv doesn’t contain any special characters,
this test uses --map-tex-specials
for comparison. The
build process is:
datatooltk --map-tex-specials --csv-escape '\' --csv sample-1000x6.csv -o data.dbtex pdflatex -jobname test-dbtex-1000x6-doc2 test-dbtex-doc
1.13 datatooltk --in
with
interpreted \DTLnewdb
(1000x6)
This uses datatooltk to process the file sample-1000x6-newdb.tex (from Test 1.10) with the TeX Parser Library. The build process is:
datatooltk --in sample-1000x6-newdb.tex -o data.dbtex pdflatex -jobname test-dbtex-1000x6-doc3 test-dbtex-doc
1.14 datatooltk
skipped (1000x6)
Again it’s useful to consider a conditional build where the datatooltk step can be omitted if the original data hasn’t changed. So this test just builds the document (test-dbtex-doc.tex) using the data.dbtex file created from the previous test. The build process is simply:
pdflatex -jobname test-dbtex-1000x6-doc4 test-dbtex-doc
1.15 pgfmath with
datatooltk
skipped (1000x6)
The datatool package will either load the
fp or
pgfmath package to perform
floating point arithmetic. The default is to use fp. This
test is the same as the previous test except that the test document
(test-dbtex-pgfmath-doc.tex) loads
datatool with the math=pgfmath
option:
\usepackage[math=pgfmath]{datatool}
Again this test skips the datatooltk call in the build process as the datat.dbtex file doesn’t need updating.
pdflatex -jobname test-dbtex-1000x6-doc5 test-dbtex-pgfmath-docComparing the build times between this test and the previous one shows the difference in package loading time, which turns out to be insignificant. (Although no numerical computations or comparisons are made using either or those packages in these tests, so there may still be some difference in performance if those commands are required.)
1.16 datatooltk
--nomap-tex-specials --remove-columns
(1000x6)
The document only requires the first three columns, so this test strips the unrequired columns before saving to the .dbtex file. The column removal is always performed last by datatooltk, which allows a column to be used for sorting, merging or filtering purposes even if that column isn’t actually required in the document. Although it takes time to remove the unwanted columns, time is saved by not writing those extra columns to the .dbtex file, which in turn reduces the document code to that of the smaller 1000 x 3 database from the earlier tests.
datatooltk --nomap-tex-specials --nocsv-escape --csv sample-1000x6.csv -o data.dbtex --remove-columns "4-" pdflatex -jobname test-dbtex-1000x6-doc8 test-dbtex-doc
1.17 datatooltk
--remove-columns
skipped (1000x6)
For comparison with Test 1.14, the data.dbtex from the previous test was used without recreating it. The build process is simply:
pdflatex -jobname test-dbtex-1000x6-doc9 test-dbtex-doc
This shows the reduction in build time when the unrequired columns are absent (which reduces the database size to that of the one in Test 1.6).
2. Sort Speed
Sorting is typically the most time-consuming operation performed on
a database. The datatool package provides \dtlsort
that
takes a comparison handler macro as the final argument. There are
four pre-defined handler commands, but they all use a fairly
primitive character code comparison, so if you are sorting according
to a language that contains extended Latin or non-Latin characters,
then you simply can’t sort the data in a locale-sensitive manner within
TeX unless you provide a sort handler macro that can compare
characters according to your alphabet. This would really require a
Unicode-aware engine (XeLaTeX or LuaLaTeX) as TeX considers UTF-8
characters as two tokens. So not only is it faster to sort using an
external tool, it’s also the only practical way of sorting according
to a particular locale.
Since this section is comparing \dtlsort
against
datatooltk --sort
the tests using
sample-1000x3-newdb.tex and sample-1000x6-newdb.tex
are omitted.
As in the previous section, the test documents simply iterate over the data, displaying the first three columns. The build is again performed using bash scripts in order to time each test. The results are summarised in Table 2.
Note there is a significant saving in the document build time if datatooltk is used to sort the data and (see Tests 1.6 and 1.13 in Table 1) even more so if a conditional is used in the document build to skip datatooltk if the data hasn’t been modified.
Test | Build Time (minutes:seconds) |
---|---|
2.1 \DTLloaddb and
\dtlsort (1000x3) |
15:31.54 |
2.2 datatooltk
--sort-case-insensitive --sort-locale none (1000x3) |
0:01.04 |
2.3 datatooltk
--sort-locale en-GB (1000x3) |
0:01.05 |
2.4 UTF-8 \DTLloaddb and
\dtlsort (1000x3) |
15:51.16 |
2.5 UTF-8 \DTLloadrawdb and
\dtlsort (1000x3) |
15:25.92 |
2.6 UTF-8 datatooltk
--nomap-tex-specials --sort en-GB (1000x3) |
0:01.03 |
2.7 UTF-8 datatooltk
--map-tex-specials --sort en-GB (1000x3) |
0:01.23 |
2.8 \DTLloaddb and
\dtlsort (1000x6) |
30:07.43 |
2.9 datatooltk --sort
en-GB (1000x6) |
0:01.63 |
2.10 datatooltk --sort
en-GB --remove-columns (1000x6) |
0:01.57 |
2.1 \DTLloaddb
and
\dltsort
(1000x3)
This test is much like Test 1.1. The same CSV file sample-1000x3.csv is loaded using
\DTLloaddb{testdata}{sample-1000x3.csv}
but in this case the data is additionally sorted in the document
using the case-insensitive handler \dtlicompare
:
\dtlsort{Field1}{testdata}{\dtlicompare}
The document build is simply:
pdflatex test-csv-doc2
2.2 datatooltk
--sort-case-insensitive --sort-locale none
(1000x3)
This test is much like Test 1.4 but additionally sorts the data using datatooltk’s case-insensitive character code comparison. The build process is:
datatooltk --nomap-tex-specials --nocsv-escape --sort Field1 --sort-locale none --sort-case-insensitive --csv sample-1000x3.csv -o data.dbtex pdflatex -jobname test-dbtex-doc2 test-dbtex-doc
The compile time (shown in Table 2 shows this method is significantly faster and, when compared with the result for Test 1.4 in Table 1, the added sort action has an insignificant effect on the overall document build time.
2.3 datatooltk --sort-locale en-GB
(1000x3)
The previous test uses the case-insensitive character code
sort method to provide an analogous comparison to
Test 2.1 (which
uses the \dtlicompare
handler). This test, uses a
locale-sensitive sort instead. Since the data consists of randomly
generated words within the ASCII set, the resulting order is
much the same (punctuation may be treated differently).
The build process is now:
datatooltk --nomap-tex-specials --nocsv-escape --sort Field1 --sort-locale en-GB --csv sample-1000x3.csv -o data.dbtex pdflatex -jobname test-dbtex-doc3 test-dbtex-doc
This is marginally slower than the previous test, but the increase is merely in fractions of a second.
2.4 UTF-8 \DTLloaddb
and
\dltsort
(1000x3)
The 1000x3 CSV file from earlier was copied and some of the entries were replaced in the first column:
Field1,Field2,Field3 Øresund,premankind,bachelor Þríhyrningsvatn,undersighted,levels A\&M,amaurosis,world-plaguing A\textasciitilde M,childminder,whitestraits Ängelholm,mortalist,improducible ætherial,profitmongering,femorofibular œsophagus,Meigs,fluxes Łobez,overindustrialization,acanthi Ćmielów,tetracosane,semipacifistic Żelechów,thermistors,polysaprobic Ångström,emotionalise,Ulyanovsk lackluster,Longport,zinked
The new CSV file was saved as sample-1000x3-utf8.csv and
is UTF-8 encoded. Note that in addition to \&
there is also a similar entry with \textasciitilde
.
This is again loaded in the document using \DTLloaddb
but the document (test-csv-utf8-doc1.tex) requires UTF-8 support:
\documentclass{article} \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{datatool} \DTLloaddb{testdata}{sample-1000x3-utf8.csv} \dtlsort{Field1}{testdata}{\dtlicompare} \begin{document} \DTLforeach*{testdata} {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3} {\Fieldi, \Fieldii, \Fieldiii.\par} \end{document}
The build process is simply:
pdflatex test-csv-utf8-doc1
2.5 UTF-8 \DTLloadrawdb
and
\dltsort
(1000x3)
This test won’t work with old versions of datatool. The problem is
that \DTLloadrawdb
uses expansion to perform the
mapping, which is why it’s not supposed to be used with CSV files
containing TeX code. Unfortunately, the UTF-8 support provided by
inputenc works by turning the first octet of each UTF-8 character
into an active character. (This is a symptom of TeX’s pre-Unicode
origins, and is not a problem with modern Unicode-aware engines.)
These active octets caused expansion problems. This is fixed now
that \DTLloadrawdb
has switched to using protected
expansion, but it’s still important to remember what’s happening
behind the scenes. When comparing, say, “AMS” with “Ångström”, the
handler isn’t comparing the strings “AMS” and “Ångström”, but is
actually comparing “AMS
” with
“\IeC {\r A}ngstr\IeC {\"o}m
”.
The document (test-csv-utf8-doc2.tex) is now:
\documentclass{article} \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{datatool} \DTLloadrawdb{testdata}{sample-1000x3-utf8-spchar.csv} \dtlsort{Field1}{testdata}{\dtlicompare} \begin{document} \DTLforeach*{testdata} {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3} {\Fieldi, \Fieldii, \Fieldiii.\par} \end{document}
The build process again just involves a single LaTeX run:
pdflatex test-csv-utf8-doc2
2.6 UTF-8 datatooltk
--nomap-tex-specials --sort-locale en-GB
(1000x3)
This test uses the sample-1000x3-utf8.csv file containing UTF-8
characters and LaTeX commands (\&
and
\textasciicircum
). Again the document
(test-dbtex-utf8-doc.tex)
needs to include UTF-8 support:
\documentclass{article} \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{datatool} \DTLloaddbtex{\testdata}{data.dbtex} \begin{document} \DTLforeach*{\testdata} {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3} {\Fieldi, \Fieldii, \Fieldiii.\par} \end{document}
The build process is now:
datatooltk --nomap-tex-specials --nocsv-escape --sort Field1 --sort-locale en-GB --csv sample-1000x3-utf8.csv -o data.dbtex pdflatex -jobname test-dbtex-utf8-doc1 test-dbtex-utf8-doc
Note that datatooltk (unlike bib2gls) doesn’t try
interpreting the database contents while it’s sorting. (The TeX
Parser Library, which is also used by bib2gls, is only used when
datatooltk’s quick load fails or when importing
probsoln files.) This means that \&
is seen as two characters (\
and &
) by
datatooltk and \textasciitilde
is seen as 15 characters.
2.7 UTF-8 datatooltk
--map-tex-specials --sort-locale en-GB
(1000x3)
This is like the previous test except that it uses
sample-1000x3-utf8-spchar.csv, which contains the special characters
&
and ~
. Note, however, that
--map-tex-specials
is performed while the data is being
imported. This means that by the time the sorting occurs, the
special characters will already have been replaced by their
corresponding map values (\&
and
\textasciitilde
by default), so the ordering is the
same as the previous test.
The document build process is:
datatooltk --map-tex-specials --csv-escape '\' --sort Field1 --sort-locale en-GB --csv sample-1000x3-utf8-spchar.csv -o data.dbtex pdflatex -jobname test-dbtex-utf8-doc2 test-dbtex-utf8-doc
2.8 \DTLloaddb
and
\dtlsort
(1000x6)
To compare how well the methods scale, this test is like Test 2.1 but the document (test-csv-1000x6-doc3.tex) uses the larger sample-1000x6.csv file.
The document build is again just a single LaTeX run:
pdflatex test-csv-1000x6-doc3
2.9 datatooltk --sort-locale
en-GB
(1000x6)
This test is like Test 2.3 but uses the larger sample-1000x6.csv file.
The document build is:
datatooltk --nomap-tex-specials --nocsv-escape --sort Field1 --sort-locale en-GB --csv sample-1000x6.csv -o data.dbtex pdflatex -jobname test-dbtex-1000x6-doc6 test-dbtex-doc
2.10 datatooltk --sort-locale
en-GB --remove-columns
(1000x6)
This test is like the previous one but uses
--remove-columns
to remove the columns that aren’t used
in the document. As with Test 1.17,
the column removal reduces the build time.
The document build is:
datatooltk --nomap-tex-specials --nocsv-escape --sort Field1 --sort-locale en-GB --csv sample-1000x6.csv -o data.dbtex --remove-columns "4-" pdflatex -jobname test-dbtex-1000x6-doc7 test-dbtex-doc
3. Filter Speed
Filtering is quite often performed in conjunction with sorting, but for the purposes of comparison, no sorting is performed in these tests. If sorting is also required, see the previous section. As with the removal of unrequired columns, removing unwanted rows also reduces the overall build time.
Test | Build Time (minutes:seconds) |
---|---|
3.1 UTF-8 \DTLloaddb and
\DTLifstringlt* (1000x3) |
0:12.49 |
3.2 UTF-8 datatooltk
--filter (1000x3) |
0:00.43 |
3.1 UTF-8 \DTLloaddb
and
\DTLifstringlt*
(1000x3)
This test is much like Test 1.1 except that it uses the UTF-8 CSV file sample-1000x3-utf8.csv, which is loaded using
\DTLloaddb{testdata}{sample-1000x3-utf8.csv}
This means that UTF-8 support must be enabled:
\usepackage[utf8]{inputenc} \usepackage[T1]{fontenc}
The loop is now changed to filter the entries so that
only those rows where the text in the first column is less than “B”
are displayed in the document. This uses the starred version of
\DTLifstringlt
for a case-insensitive comparison:
\DTLforeach*{testdata} {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3} {% \DTLifstringlt*{\Fieldi}{B}% {\Fieldi, \Fieldii, \Fieldiii.\DTLpar}% {}% }This internally uses
\dtlicompare
, which is the
comparison handler used in the earlier sort tests with
\dtlsort
. Note the \DTLifstringlt
performs
a protected expansion on its arguments before passing them to the
handler, so the UTF-8 characters will be expanded into their
\IeC
form. As a result, all the rows that start with an
extended character are included because the sort handler considers
\IeC
less than “A”.
Since \DTLifstringlt
is a short command,
\par
can’t be used, so the paragraph
break has been changed to \DTLpar
.
The document build again is just a single LaTeX call:
pdflatex test-csv-filter-doc1
Note that the entire CSV file has to be parsed even though only a small portion of it is actually required in the document.
3.2 datatooltk --filter
(1000x3)
This method uses datatooltk to filter the same data
(sample-1000x3-utf8.csv). The --filter
option uses
the same comparison as the --sort
option, so even
though datatooltk isn’t being used to sort the data, the
--sort-locale
switch is still used.
The document (test-dbtex-utf8-doc.tex) just requires UTF-8 support. There’s no change to the loop as the filtering is performed by datatooltk:
\documentclass{article} \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{datatool} \DTLloaddbtex{\testdata}{data.dbtex} \begin{document} \DTLforeach*{\testdata} {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3} {\Fieldi, \Fieldii, \Fieldiii.\par} \end{document}
The document build is
datatooltk --nomap-tex-specials --nocsv-escape --filter Field1 lt B --filter-include --sort-locale en-GB --csv sample-1000x3-utf8.csv -o data.dbtex pdflatex -jobname test-dbtex-filter-doc1 test-dbtex-utf8-doc
The resulting document includes the rows containing Ängelholm, ætherial and Ångström, but not those containing any of the other extended characters.
Different locales produce different results. For example, if
--sort-locale en-GB
is changed to
--sort-locale is
(Icelandic) then “ætherial” is no
longer included in the document. Whereas --sort-locale
sv
(Swedish) omits Ängelholm, ætherial and Ångström.
4. Iteration Speed
The tests so far have compared using TeX vs datatooltk to process the data. This section now looks at the various ways of iterating over the data within the document. Since the load time isn’t being considered here, all the test documents in this section use the data.dbtex file created in Test 1.13 using:
datatooltk --in sample-1000x6-newdb.tex -o data.dbtex
This system call isn’t included in the times listed in Table 4. For easier comparison, the results from Test 1.14 and Test 1.17 are included. The build process is a single LaTeX call.
Test | Build Time (minutes:seconds) |
---|---|
1.14 \DTLforeach* |
0:01.11 |
1.17
\DTLforeach* (1000x6 reduced to 1000x3) |
0:01.10 |
4.1 Unstarred \DTLforeach |
0:07.78 |
4.2 \loop and
\dtlgetrow |
0:05.09 |
4.3 \loop and
\DTLassign |
0:05.14 |
4.4 \loop and
\DTLgetvalue |
0:07.05 |
4.1 Unstarred \DTLforeach
This is a slightly modified version of Test 1.13 that uses the unstarred
version of \DTLforeach
:
\DTLforeach{\testdata} {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3} {\Fieldi, \Fieldii, \Fieldiii.\par}
Other than that it’s the same as Test 1.13, but since the datatooltk call is omitted from the build times in this section, it needs to be compared with the result of Test 1.14 instead. This takes longer than the starred version, as the unstarred version permits editing on the current row, which means that the current row needs to be updated at the end of the iteration.
4.2 \loop
and
\dtlgetrow
This test uses \loop
to iterate over all the rows in
the database. In each iteration, \dtlgetrow
is used to
access the current row and \dtlgetentryfromcurrentrow
is used to access the given column of the current row:
\newcount\rowctr \loop \advance\rowctr by 1\relax \dtlgetrow{\testdata}{\rowctr}% \dtlgetentryfromcurrentrow{\Fieldi}{1}% \dtlgetentryfromcurrentrow{\Fieldii}{2}% \dtlgetentryfromcurrentrow{\Fieldiii}{3}% \Fieldi, \Fieldii, \Fieldiii.\par \ifnum\rowctr<\DTLrowcount{\testdata} \repeat
This takes longer than using \DTLforeach*
but not as
long as the unstarred version.
4.3 \loop
and
\DTLassign
This method also uses \loop
but now the data is
accessed using \DTLassign
:
\newcount\rowctr \loop \advance\rowctr by 1\relax \DTLassign{\testdata}{\rowctr}{\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3}% \Fieldi, \Fieldii, \Fieldiii.\par \ifnum\rowctr<\DTLrowcount{\testdata} \repeat
The difference in build time between this and the previous test is insignificant.
4.4 \loop
and
\DTLgetvalue
This method also uses \loop
but now the data is
accessed using \DTLgetvalue
:
\newcount\rowctr \loop \advance\rowctr by 1\relax \DTLgetvalue{\Fieldi}{\testdata}{\rowctr}{1}% \DTLgetvalue{\Fieldii}{\testdata}{\rowctr}{2}% \DTLgetvalue{\Fieldiii}{\testdata}{\rowctr}{3}% \Fieldi, \Fieldii, \Fieldiii.\par \ifnum\rowctr<\DTLrowcount{\testdata} \repeat
This test is slower than the previous two tests, but is
slightly faster than the unstarred \DTLforeach
.
5. Single Lookup
Accessing an individual row of data is a special case of filtering.
The datatool package provides a number of commands for
looking up a particular row or cell. If you know both the column
index and row index you can use
\DTLgetvalue{cs}{db}{row
idx}{column idx}
, which fetches the value and
stores it in the control sequence cs. The indexes start
from 1 for the first row/column of data. If you want to use the
column label \dtlcolumnindex{db}{column
label}
expands to the column index.
If you want to access an entire row, you can use commands like
\DTLassign{db}{row idx}{assign
list}
or \DTLassignfirstmatch{db}{col
key}{value}{assign list}
.
The tests here all lookup the row from the sample-1000x3.csv
data that contains the word “grammatically” in the first column
(identified by the label Field1
). This row has the index 327
(of 1000 rows of data, which corresponds to line 328 of the CSV file)
and is just under a third of the way through the data. If the data
happened to contain multiple rows that match, all the tests here, except
for Test 5.1, would only match
the first instance. A real-world set of data would typically have a
column containing a unique identifier, such as a custom number or
product code, which would be used to identify the required row.
(In general, don’t assume that the row indexes will remain constant if it’s likely that the data source may be edited. For example, data representing students may have John Smith’s data on row 25 in one semester, but if a student higher up in the data drops out and is deleted, the next semester Smith’s data may be on row 24 instead.)
The results for this section are summarised in Table 5. Again, the fastest method is to use datatooltk to preprocess the data.
Test | Build Time (minutes:seconds) |
---|---|
5.1 \DTLloaddb
and filtered loop |
0:12.04 |
5.2 \DTLloaddb
and filtered loop with break |
0:11.56 |
5.3 \DTLloaddb
and \DTLassignfirstmatch |
0:11.33 |
5.4 datatooltk
and \DTLassign |
0:00.41 |
5.5 datatooltk
and \DTLforeach* with break |
0:00.42 |
5.1 \DTLloaddb
and filtered
loop
This is just a variation of the filtered example in Test 3.1. As before, the test data sample-1000x3.csv is loaded using:
\DTLloaddb{testdata}{sample-1000x3.csv}
The data is then iterated over as with the filtered example.
The string comparison command \ifdefstring
is
provided by etoolbox,
which is automatically loaded by datatool.
\DTLforeach*{testdata} {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3} {% \ifdefstring{\Fieldi}{grammatically}% {\Fieldi, \Fieldii, \Fieldiii.\par}% {}% }
This is a rather inefficient method as it iterates over the entire set of data. The document build is simply:
pdflatex test-csv-filter-doc1
5.2 \DTLloaddb
and filtered
loop with break
This is an improvement on the previous example as it uses
\dtlbreak
to break out of the loop once the required
row has been found:
\DTLforeach*{testdata} {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3} {% \ifdefstring{\Fieldi}{grammatically}{\dtlbreak}{}% } \Fieldi, \Fieldii, \Fieldiii.
Knowing that only one row is required means that the code
doesn’t have to iterate over all 1000 rows. (In this case, the loop
is broken when it’s only about 32% through.) This also means that the
\DTLforeach
block of code can be moved to the preamble
and the assigned commands \Fieldi
, \Fieldii
and
\Fieldii
can be used throughout the document.
The document build is simply:
pdflatex test-csv-filter-doc1
The drawback with this method is that if the row isn’t found,
\Fieldi
, \Fieldii
and \Fieldiii
will have the value of the last row. You can add a test using the
internal conditional \if@endfor
provided by
xfor (which is used by
datatool for loop management).
5.3 \DTLloaddb
and
\DTLassignfirstmatch
This test uses \DTLassignfirstmatch
, which uses
\dtlgetrowindex
to find the index of the first row to
match the given value. Once the row index is found, the row
information is accessed using the same internal code as
\DTLassign
. This means that no loop is required.
\DTLassignfirstmatch{testdata}{Field1}{grammatically} {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3} \Fieldi, \Fieldii, \Fieldiii.
As with the previous test, the assignment command may be moved to the
preamble and the values will be available for the whole document.
An error occurs (“Undefined control sequence” for
\Fieldi
, \Fieldii
and
\Fieldiii
and “No match found for
\DTLassignfirstmatch
”) if the match is unsuccessful.
The document build is simply:
pdflatex test-csv-filter-doc1
5.4 datatooltk and
\DTLassign
This test uses datatooltk to fetch only the required row.
Since the document is now loading a database containing only one
row, I know the row index will be 1 (regardless of the row number of
the original data), so the data can simply be fetched with
\DTLassign
:
\DTLassign{\testdata}{1}{\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3} \Fieldi, \Fieldii, \Fieldiii.
Again the assignment command may be moved to the preamble. The document build is:
datatooltk --nomap-tex-specials --nocsv-escape --csv sample-1000x3.csv --filter Field1 eq grammatically -o data.dbtex pdflatex -jobname test-dbtex-single-doc1 test-dbtex-single-doc
If there’s no match, the database will be empty and the document will generate the error:
! Argument of \@dtl@getrow has an extra }. <inserted text> \par l.9 ...di=Field1,\Fieldii=Field2,\Fieldiii=Field3}
So you may want to consider checking the row count first. For example:
\ifcase\DTLrowcount{\testdata} \newcommand{\Fieldi}{??} \newcommand{\Fieldii}{??} \newcommand{\Fieldiii}{??} \or \DTLassign{\testdata}{1}{\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3} \else \GenericWarning{}{Multiple matches found} \DTLassign{\testdata}{1}{\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3} \fi
5.4 datatooltk and
\DTLforeach
with break
This is like the previous test but uses \DTLforeach*
instead of \DTLassign
:
\DTLforeach*{\testdata}{\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3} {\dtlbreak} \Fieldi, \Fieldii, \Fieldiii.
Remember that unlike Test 5.1 and
Test 5.2, the database loaded by the
document only has a single row, so even without
\dtlbreak
, there’s only a single iteration. This is
fractionally longer than Test 5.4
as a result of the loop initialisation.
The document build process is:
datatooltk --nomap-tex-specials --nocsv-escape --csv sample-1000x3.csv --filter Field1 eq grammatically -o data.dbtex pdflatex -jobname test-dbtex-single-doc2 test-dbtex-single-doc2
In this case, if there’s no match the data base will be empty, so
there are no rows to iterate over. This means that
\Fieldi
, \Fieldii
and
\Fieldiii
will be undefined.
Summary
To improve the document build process:
- Use datatooltk to parse the data and convert to the more
efficient .dbtex file, which can be input in the document using
\DTLloaddbtex
. - Remove unnecessary data by filtering, truncating or column removal.
- If the data needs sorting or column/row removal, combine this in the first step.
- Adjust the document build process so that datatooltk is only run when the data is changed (or if the sorting or filtering conditions change).
- If you have to iterate over a large database in your document,
use the starred form
\DTLforeach*
(where possible) rather than the slower unstarred form\DTLforeach
.