Bug Tracker RSS feed

ID 297🔗
Date: 2025-03-13 09:15:40
Last update: 2025-03-14 16:13:56
Status Open
Sign in if you want to bump this report.
Category datatool
Version 3.1
Summary Compilation time increase by factor 7 with new datatool version 3.1

Return to Search Results

Sign in to subscribe to notifications about this report.


I'm using datatool to read a CSV file and use its values in my LaTeX documents. I've noticed that compilation takes significantly longer version 3.1, an increase by factor 7, in fact.

I cannot be certain of the reason for this increase; however, I've noticed based on console output that datatool 3.1 seems to get "stuck" on creating the .aux file for a good while. This is noticeable even with relatively short and simple CSVs.

For example, an MWE with a CSV of 15 keys and 30 lines of values leads to the following compilation times:

Run 1: 0:04.65
Run 2: 0:04.80
Run 3: 0:05.08

Now this doesn't seem like much with an MWE, but it's nevertheless a factor 7 increase in compilation time. This would probably not be too much of a problem if I were compiling just one document (though still inconvenient), but I'm compiling over 60 documents in total, so that's quite a time difference. For me, the increase means the difference between 450 seconds (or 7 minutes, 30 seconds) and 3200 seconds (or 53 minutes, 20 seconds).

I would very much appreciate if you could look into the reason for the increase and, if possible, provide a fix.

Note: I also have an mwe CSV file, but I cannot provide it through your upload form.


Download (1.05K)

\usepackage{lipsum} % For generating lorem ipsum text

% Load the CSV file


\section*{Data from CSV}
  \textbf{Key1} & \textbf{Key2} & \textbf{Key3} & \textbf{Key4} & \textbf{Key5} & \textbf{Key6} & \textbf{Key7} & \textbf{Key8} & \textbf{Key9} & \textbf{Key10} & \textbf{Key11} & \textbf{Key12} & \textbf{Key13} & \textbf{Key14} & \textbf{Key15} \\
    \keyone & \keytwo & \keythree & \keyfour & \keyfive & \keysix & \keyseven & \keyeight & \keynine & \keyten & \keyeleven & \keytwelve & \keythirteen & \keyfourteen & \keyfifteen \\

\section*{Lorem Ipsum}



Update 2025-03-14: I've just uploaded datatool v3.2 to CTAN. It should reach the TeX distributions in a few days. When it's available, try:

This won't do any parsing of the cells and will assume the content is just text (with valid LaTeX syntax or no special characters), which should make loading faster.

Original evaluation:

If you want to include a csv file in your upload you can embed it in the filecontents environment (as long as it's not too big). Just put it before the \documentclass line. For example:


I've created a test CSV file with 30 rows and 15 columns of randomly generated numbers and called the file sample-data.csv:


Starting with your example and datatool v3.1 on TeXLive 2025 with no aux file the processing times for a single pdflatex call are:

real	0m7.720s
user	0m7.618s
sys	0m0.061s
The aux file simply contains:
\gdef \@abspage@last{2}
So I can't see any reason why there should be any delay in either reading or writing the aux file. I next tried with rollback to v2.23:

The processing times were:

real	0m0.819s
user	0m0.697s
sys	0m0.057s
so I agree there is certainly a difference. I think the place where it seems to be stuck is actually on the \DTLloaddb line just before the aux file is read.

To test the package load time, I just loaded the package without doing anything other than printing the lipsum text:


The processing time was:

real	0m0.563s
user	0m0.521s
sys	0m0.039s
with rollback the processing time was:
real	0m0.445s
user	0m0.392s
sys	0m0.050s
So the package loading time has gone up, but this is because new features have been added.

Now testing the time to just load the data:


The processing time for v3.1 was:

real	0m7.483s
user	0m7.410s
sys	0m0.042s

With rollback:

real	0m0.712s
user	0m0.661s
sys	0m0.047s

The CSV file loading has been rewritten to allow for additional features and improved parsing, which now use regular expressions for matching scientific notation and regular expressions do unfortunately slow things down a bit. The old \DTLloaddb command has been rewritten to use the new \DTLread command. Using \DTLread explicitly (with mapping):

(this matches the behaviour of the old \DTLloadrawdb command). The processing time was:
real	0m9.391s
user	0m9.317s
sys	0m0.038s
If you don't have any special characters in the CSV file that need converting, it's quicker to use:
(this matches the behaviour of \DTLloaddb, so this is more directly comparable to your MWE). The processing time:
real	0m7.460s
user	0m7.381s
sys	0m0.050s
Using \DTLsetup{store-datum} before \DTLread also slows things down a little:
real	0m7.592s
user	0m7.520s
sys	0m0.042s

The changes in v3.0 were primarily focused on improving the support for parsing data types (including allowing scientific notation and better localisation support) and rewriting the sorting functions.

You can speed things up a bit if the data doesn't need parsing. For example, if the first column contains only integers and the other 14 columns all contain decimals with a decimal point and no number group separator:


The processing times:

real	0m5.951s
user	0m5.894s
sys	0m0.034s

Now consider the numeric-data.csv test data, which has 1000 rows and four columns: integer, decimal, currency and scientific notation. The test document is now:


Processing time:

real	0m45.237s
user	0m45.014s
sys	0m0.053s
Again, this is much longer than with the equivalent using rollback:

Processing time:

real	0m19.381s
user	0m19.259s
sys	0m0.055s

However, the final column containing scientific notation is treated as containing only strings.

The improvements come when the data needs to be processed in some way after it has been loaded. Staying with the above example using rollback, but adding \dtlsort to sort the data according to the second field (which is a decimal value):


Processing time:

real	46m5.909s
user	45m54.246s
sys	0m0.660s
Now with v3.1:

Processing time:

real	0m57.222s
user	0m56.912s
sys	0m0.078s
This has reduced a single pdflatex run by 45 minutes.

Testing column aggregate (calculating standard deviation). First with rollback:

Standard deviation: \result.


Processing time:

real	0m31.807s
user	0m31.627s
sys	0m0.067s
Now with v3.1:
Standard deviation: \result.


Processing time:

real	0m46.172s
user	0m45.950s
sys	0m0.049s
This is slightly slower. Alternatively:
Standard deviation: \DTLuse{sd}.
Mean: \DTLuse{mean}.
Sum: \DTLuse{sum}.
Number of items: \DTLuse{count}.


Processing time:

real	0m45.994s
user	0m45.746s
sys	0m0.063s

There isn't much difference in this case, except that this not only obtains the standard deviation, but also the intermediate calculations.

So the changes made for v3.0 have the most significant improvements for documents with large databases that require sorting. The reduction in build time is the result of improvements in parsing to detect scientific notation and localised formatting.

Since parsing CSV files has an impact on build time, an alternative approach is to use datatooltk to convert the CSV file to a file that can be quickly loaded. The version currently available on CTAN only supports the old dbtex v2.0 format. There's a new version of datatooltk that's currently under development that supports the newer dbtex v3.0 format. The document needs to be changed slightly:


The document build is now (using the current development version of datatooltk):

datatooltk -o numeric-data.dbtex --csv-sep ',' --noliteral --csv numeric-data.csv --output-format dbtex-3
pdflatex mwe.tex

The processing time is now:

real	0m1.736s
user	0m3.934s
sys	0m0.167s
Bear in mind that the datatooltk call is only required when the CSV file is changed, so if your document requires multiple LaTeX calls then this will help to reduce the overall build time.

I'll look into providing an option to switch off all the extra parsing for documents that don't need it.



Add Comment

Name (optional):

Are you human? Please confirm the bug report ID (which can be found at the top of this page) or login if you have an account. All guest comments have to be manually checked before they appear on the page. There are too many bots trying to spam the site to allow unauthenticated users to post without verification.


You can use the following markup:


[pre]Displayed verbatim[/pre]
[quote]block quote[/quote]

In line:

[file]file/package/class name[/file]
[em]emphasized text[/em]
[b]bold text[/b]
[url]web address[/url] [sup]superscript[/sup]

Ordered list:
[li]first item[/li]
[li]second item[/li]

Unordered list:
[li]first item[/li]
[li]second item[/li]

You can use the Preview button to review your message formatting before submitting.

Page permalink: https://www.dickimaw-books.com/bugtracker.php?key=297

Return to Search Results