Home

Awesome

Tex2txt: a flexible LaTeX filter

General description | Principal limitations | Selected actions | Command line | Usage under Windows | Tool integration | Encoding problems | Declaration of LaTeX macros | Handling of displayed equations | Application as Python module | Remarks on implementation

***<br> *** THIS REPOSITORY IS ARCHIVED<br> ***<br> *** Please note the follow-on project <ins>YaLafi</ins>:<br> *** - Editor Emacs can be used via plug-in Emacs-langtool<br> *** - Editor Vim can be used via several plug-ins<br> *** - Emulation of LanguageTool server with integrated LaTeX filter<br> *** - Filter implementation more flexible, based on scanner / parser<br> ***

Summary and example. This Python program extracts plain text from LaTeX documents. Due to the following characteristics, it may be integrated with a proofreading software:

For instance, the LaTeX input

Only few people\footnote{We use
\textcolor{red}{redx colour.}}
is lazy.

will lead to the subsequent output from example application script shell.py described in section Application examples ahead. The script invokes LanguageTool as proofreading software, using a local installation or the Web server hosted by LanguageTool.

1.) Line 2, column 17, Rule ID: MORFOLOGIK_RULE_EN_GB
Message: Possible spelling mistake found
Suggestion: red; Rex; reds; redo; Red; Rede; redox; red x
Only few people is lazy.    We use redx colour. 
                                   ^^^^
2.) Line 3, column 1, Rule ID: PEOPLE_VBZ[1]
Message: If 'people' is plural here, don't use the third-person singular verb.
Suggestion: am; are; aren
Only few people is lazy.    We use redx colour. 
                ^^

<a name="example-html-report"></a> Run with option --html, the script produces an HTML report:

HTML report

Back to top

General description

Tex2txt.py is a modest, self-contained Python script or module for the extraction of plain text from LaTeX documents. In some sense, it relates to projects like OpenDetex, pandoc, plasTeX, pylatexenc, TeXtidote, and tex2txt. For the naming conflict with the latter tool, we want to apologise.

While virtually no text should be dropped by the filter, our aim is to provoke as few as possible “false” warnings when the result is fed into a proofreading software. The goal especially applies to documents containing displayed equations. Problems with interpunction and case sensitivity would arise, if equation environments were simply removed or replaced by fixed text. Altogether, the script can help to create a compact report from language examination of a single file or a complete document tree. Simple and more complete applications are addressed in sections Tool integration and Application as Python module below.

For ease of problem localisation, we implement a mechanism that tracks line number changes during the text manipulations. Unnecessary creation of empty lines therefore can be avoided, sentences and paragraphs remain intact. This is demonstrated in file Example.md. Reconstruction of both line and column numbers is possible with script option --char, which activates position tracking for each single character of input. File Example2.md shows such an application.

The first part of the Python script gathers LaTeX macros and environments with tailored treatment, which is shortly described in section Declaration of LaTeX macros. Some standard macros and environments are already included, but very probably the collections have to be complemented. With option --defs, definitions also can be extended by an additional file.

Unknown LaTeX macros and environments are silently ignored while keeping their arguments and bodies, respectively; script option --unkn will list them. Declared macros can be used recursively. As in TeX, macro expansion consumes white space (possibly including a line break) between macro name and next non-space character within the current paragraph.

Extra text flows like footnotes are normally appended to the end of the main text flow, each one separated by blank lines. The introductory summary above shows an example. Activation of this behaviour is demonstrated for macro \caption{...} in section Declaration of LaTeX macros. Script option --extr provides another possibility that is also useful for the extraction of foreign-language text.

An optional speciality is some parsing of LaTeX environments for displayed equations. Therefore, one may check embedded \text{...} parts (macro from LaTeX package amsmath), and trailing interpunction of these equations can be taken into account during language check of the main text flow. Further details are given in section Handling of displayed equations. An example is shown in file Example.md, operation is summarised in the script at label LAB:EQUATIONS.

Interface and examples for application as Python module are described in section Application as Python module below.

The Python script may be seen as an exercise in application of regular expressions. Its internal design could be more orderly. Currently, it is mainly structured by comments, and it mixes definitions of variables and functions with statements that actually perform text replacement operations. Moreover, it uses many global variables without clear naming convention, and some of them are even manipulated by the central module function. In section Remarks on implementation, some general techniques and problems are addressed.

If you use this tool and encounter a bug or have other suggestions for improvement, please leave a note under category Issues, or initiate a pull request. Many thanks in advance.

Happy TeXing!

Back to top

Principal limitations

The implemented parsing mechanism can only roughly approximate the behaviour of a real LaTeX system. Apart from many minor shortcomings, a list of major incompatibilities must contain at least the following points.

Please compare section Remarks on implementation, too.

Back to top

Selected actions

Here is a list of the most important script operations.

Back to top

Command line

The script expects the following parameters.

python3 tex2txt.py [--nums file] [--char] [--repl file] [--defs file]
                   [--extr list] [--lang xy] [--ienc enc] [--unkn]
                   [texfile]

Back to top

Usage under Windows

If Python is installed under Windows, then the main Python program tex2txt.py may be directly used in a Windows command console or script. Furthermore, at least the application script shell.py from section Application examples can be run, if option '--server lt' is used, or if Java and the LanguageTool software are locally present. For example, this could look like

py -3 shell.py --html t.tex > t.html

or

"c:\Program Files\Python\Python37\python.exe" shell.py --html t.tex > t.html

if the Python launcher has not been installed. The file tex2txt.py should reside in the current directory. Variable 'ltdirectory' in script shell.py has to be customised, unless option '--server lt' is used.

The software has been developed under Linux and additionally tested under Cygwin on Windows 7. In the latter case, a Windows Java installation is sufficient. Some possible encoding problems related to Windows are addressed in section Encoding problems.

Back to top

Tool integration

The Python script is meant as small utility that performs a limited task with good quality. Integration with a proofreading software and features like tracking of \input{...} directives have to be implemented “on top”. Apart from application in Bash scripts, extension is also possible like in section Application as Python module.

Simple scripts

A first Bash script that checks a single LaTeX file is given in file shell.sh. The command

bash shell.sh file_name

will read the specified LaTeX file and create plain text and line number files with additional extensions .txt and .lin, respectively. Then it will call LanguageTool and filter line numbers in output messages. File Example.md demonstrates the script.

A variant correcting both line and column numbers is given in file shell2.sh with application example in file Example2.md.

We assume that Java is installed, and that the directory with relative path ../LT/ contains an unzipped archive of the LanguageTool software. This archive, for example LanguageTool-4.4.zip, can be obtained from here.

More complete integration

A Bash script for language checking of a whole document tree is proposed in file checks.sh. For instance, the command

bash checks.sh Banach/*.tex > errs

will check the main text and extracted foreign-language parts in all these files. The result file 'errs' will contain names of files with problems together with filtered messages from the proofreader.

With option --recurse, file inclusions as \input{...} will be tracked recursively. Exceptions are listed at LAB:RECURSE in the Bash script. Note, however, the limitation sketched in issue #12.

It is assumed that the Bash script is invoked at the “root directory” of the LaTeX project, and that all LaTeX documents are placed directly there or in subdirectories. For safety, the script will refuse to create auxiliary files outside of the directory specified by variable $txtdir (see below). Thus, an inclusion like \input{../../generics.tex} probably won't work with option --recurse.

Apart from Python, the Bash script uses Java together with LanguageTool's desktop version for offline use, Hunspell, and some standard Linux tools. Before application, variables in the script have to be customised. For placement of intermediate text and line number files, the script uses an auxiliary directory designated by variable $txtdir. This directory and possibly necessary subdirectories will be created without request. They can be deleted with option --delete.

Actions of the Bash script

Usage of the Bash script

bash checks.sh [--recurse] [--adapt-lt] [--no-lt]
               [--columns] [--delete] [files]

Back to top

Encoding problems

For script tex2txt.py, the encoding of LaTeX input may be set with option --ienc; output encoding is fixed to UTF-8. In application Python script shell.py from section Application examples, this corresponds to option --encoding. The Bash scripts from section Tool integration currently expect plain ASCII or UTF-8 input.

Files with Windows style line endings (CRLF) are accepted, but the text output will be Unix style (LF only), unless a Windows Python interpreter is used. The output filters as in Bash script shell2.sh will work properly, however.

Under Cygwin with Java from the Windows installation, LanguageTool will produce Latin-1 output, even if option '--encoding utf-8' is specified. Therefore, a translator to UTF-8 has to be placed in front of a Python filter for line or column numbers. This is shown in Bash function LTfilter() in file checks.sh. A similar approach is taken in example Python script shell2.py.

With option --json, LanguageTool always delivers UTF-8 encoded text. JSON output is used in application script shell.py.

Similarly, Python's version for Windows by default prints Latin-1 encoded text to standard output. As this ensures proper work in a Windows command console, we do not change it for the example script shell.py when generating a text report. On option --html, we enforce UTF-8 output in order to determine the encoding of the generated HTML page. The stand-alone script tex2txt.py will produce UTF-8 output, too.

Back to top

Declaration of LaTeX macros

The first section of the Python script consists of collections for LaTeX macros and environments. The central “helper function” Macro() declares a LaTeX macro, see the synopsis below, and is applied in the collections parms.project_macros and parms.system_macros. Here is a short extract from the definition of standard LaTeX macros already included. (The lambda construct allows us to use variables and functions introduced only later.)

parms.system_macros = lambda: (
    Macro('caption', 'OA', extr=r'\2'),         # extract to end of text
    Macro('cite', 'A', '[1]'),
    Macro('cite', 'PA', r'[1, \1]'),
    Macro('color', 'A'),
    Macro('colorbox', 'AA', r'\2'),
    Macro('documentclass', 'OA'),
    ...

Other collections, e.g. for LaTeX environments, use functions similar to Macro(). Project specific extension of all these collections is possible with option --defs and an additional Python file. The corresponding collections there, for instance defs.project_macros, have to be defined using simple tuples without lambda construct; compare the example in section Command line.

Synopsis of Macro(name, args, repl='', extr=''):

Back to top

Handling of displayed equations

Displayed equations should be part of the text flow and include the necessary interpunction. The German version of LanguageTool (LT) will detect a missing dot in the following snippet. For English texts, see the comments in section Equation replacements in English documents ahead.

Wir folgern
\begin{align}
    a   &= b \\
    c   &= d
\end{align}
Daher ...

Here, 'a' to 'd' stand for arbitrary mathematical terms (meaning: “We conclude <maths> Therefore, ...”). In fact, LT complains about the capital “Daher” that should start a new sentence.

Trivial version

With the entry

    EnvRepl('align', repl=''),

in parms.environments of the Python script (but no 'align' entry in parms.equation_environments), the equation environment is simply removed. We get the following script output that will probably cause a problem, even if the equation ends with a correct interpunction sign.

Wir folgern
Daher ...

Simple version

With the entry

    EquEnv('align', repl='  Relation'),

in parms.equation_environments of the script, one gets:

Wir folgern
  Relation
Daher ...

Adding a dot '= d.' in the equation will lead to 'Relation.' in the output. This will also hold true, if the interpunction sign is followed by maths space or by macros as \label and \nonumber.

Full version

With the entry

    EquEnv('align'),

we obtain (“gleich” means equal, and option --lang en will print “equal”):

Wir folgern
  U-U-U  gleich V-V-V 
  V-V-V  gleich W-W-W 
Daher ...

The replacements 'U-U-U' to 'W-W-W' are taken from the collection in script variable parms.display_math that depends on option --lang, too. Now, LT will additionally complain about repetition of 'V-V-V'. Finally, writing '= b,' and '= d.' in the equation leads to the output:

Wir folgern
  U-U-U  gleich V-V-V, 
  W-W-W  gleich X-X-X. 
Daher ...

The rules for this equation parsing are described at LAB:EQUATIONS in the Python script. They ensure that variations like

    a   &= b \\
        &= c.

and

    a   &= b \\
        &\qquad -c.

also will work properly. In contrast, the text

    a   &= b \\
    -c  &= d.

will again produce an LT warning due to the missing comma after 'b', since the script replaces both 'b' and '-c' by 'V-V-V' without intermediate text.

In rare cases, manipulation with \LTadd{} or \LTskip{} may be necessary to avoid false warnings from the proofreader. See also file Example.md.

Inclusion of “normal” text

In variant “Full version”, the argument of \text{...} (variable for macro name in script: parms.text_macro) is directly copied. Outside of \text, only maths space like \; and \quad is considered as space. Therefore, one will get warnings from the proofreading program, if subsequent \text and maths parts are not properly separated. See file Example.md.

Equation replacements in English documents

The replacement collection in variable parms.display_math does not work well, if single letters are taken as replacements, compare Issue #22. We now have chosen replacements as 'B-B-B' for German and English texts.

Furthermore, the English version of LanguageTool (like other proofreading tools) rarely detects mistakenly capital words inside of a sentence; they are probably considered as proper names. Therefore, a missing dot at the end of a displayed equation is hardly found. An experimental hack is provided by option --equation-punctuation of application script shell.py described in section Application examples.

Back to top

Application as Python module

The script can be extended with Python's module mechanism. In order to use import tex2txt, this module has to reside in the same directory as the importing script, or environment variable PYTHONPATH has to be set accordingly.

Module interface

The module provides the following central function.

(plain, nums) = tex2txt.tex2txt(latex, options)

Argument 'latex' is the LaTeX text as string, return element 'plain' is the plain text as string. Array 'nums' contains the estimated original line or character positions, counting from one. Negative values indicate that the actual position may be larger. Argument 'options' can be created with class

tex2txt.Options(...)

that takes arguments similar to the command-line options of the script. They are documented at the definition of class 'Options', see LAB:OPTIONS. The parameters 'defs' and 'repl' for this class can be set using functions tex2txt.read_definitions(fn, enc) and tex2txt.read_replacements(fn, enc), both expecting 'None' or a file name as argument 'fn', and an encoding name for 'enc'.

Remark. Since the function tex2txt() modifies globals in its module, an application must only run it once at each point in time.

Two additional functions support translation of line and column numbers in case of character position tracking. Translation is performed by

ret = tex2txt.translate_numbers(latex, plain, nums, starts, lin, col)

with strings 'latex' and 'plain' containing LaTeX and derived plain texts. Argument 'nums' is the number array returned by function tex2txt(), 'lin' and 'col' are the integers to be translated. Argument 'starts' has to be obtained beforehand by the call

starts = tex2txt.get_line_starts(plain)

and contains positions in string 'plain' that start a new line. The return value 'ret' above is 'None', if translation was not successful. On success, 'ret' is a small object. Integers 'ret.lin' and 'ret.col' indicate line and column numbers, and boolean 'ret.flag' equals 'True', if the actual position may be larger.

Finally, function

tex2txt.myopen(filename, encoding, mode='r')

is similar to standard function open(), but it requires an explicit encoding specification and converts a possible exception into an error message.

Application examples

The module interface is demonstrated in function main() that is activated when running the script tex2txt.py directly.

Example Python script shell.py will generate a proofreading report in text or HTML format from filtering the LaTeX input and application of LanguageTool (LT). On option '--server lt', LT's Web server is contacted. Otherwise, Java has to be present, and the path to LT has to be customised in script variable 'ltdirectory'; compare the corresponding comment in script. Note that from version 4.8, LT does not fully support 32-bit systems any more. File tex2txt.py should reside in the current directory, see also the beginning of this section. Both LT and the script will print some progress messages to stderr. They can be suppressed with python3 shell.py ... 2>/dev/null.

python3 shell.py [--html] [--link] [--context number]
                 [--include] [--skip regex] [--plain] [--list-unknown]
                 [--language lang] [--t2t-lang lang] [--encoding ienc]
                 [--replace file] [--define file] [--extract macros]
                 [--disable rules] [--lt-options opts]
                 [--single-letters accept] [--equation-punctuation mode]
                 [--server mode] [--lt-server-options opts]
                 [--textgears apikey]
                 latex_file [latex_file ...] [> text_or_html_file]

Option names may be abbreviated. If present, options are also read from a configuration file designated by script variable config_file (one option per line, possibly with argument). Default option values are set at the Python script beginning.

Dictionary adaptation. LT evaluates the two files 'spelling.txt' and 'prohibit.txt' in directory

.../LanguageTool-?.?/org/languagetool/resource/<lang-code>/hunspell/

Additional words and words that shall raise an error can be appended here. LT version 4.8 introduced additional files 'spelling_custom.txt' and 'prohibit_custom.txt'.

HTML report. The idea of an HTML report goes back to Sylvain Hallé, who developed TeXtidote. Opened in a Web browser, the report displays excerpts from the original LaTeX text, highlighting the problems indicated by LT. The corresponding LT messages can be viewed when hovering the mouse over these marked places, see the introductory example above. With option --link, Web links provided by LT can be directly opened with left-click. Script option --context controls the number of lines displayed around each tagged region; a negative option value will show the complete LaTeX input text. If the localisation of a problem is unsure, highlighting will use yellow instead of orange colour. For simplicity, marked text regions that intertwine with other ones are separately repeated at the end. In case of multiple input files, the HTML report starts with an index.

Simpler demonstration script. A simpler Python application is shell2.py. It resembles Bash script shell2.sh from section Simple scripts, but it accepts multiple inputs and does not create auxiliary files.

Back to top

Remarks on implementation

Parsing with regular expressions is fun, but it remains a rather coarse approximation of the “real thing”. Nevertheless, it seems to work quite well for our purposes, and it inherits high flexibility from the Python environment. A stricter approach could be based on software like plasTeX or pylatexenc. Another attempt has been started with YaLafi.

In order to parse nested structures, some regular expressions are constructed by iteration. At the beginning, we hence check for instance, whether nested {} braces of the actual input text do overrun the corresponding regular expression. In that case, an error message is generated, and the variable parms.max_depth_br for maximum brace nesting depth has to be changed. Setting control variables for instance to 100 does work, but also increases resource usage.

A severe general problem is order of macro expansion. While TeX strictly evaluates from left to right, the order of treatment by regular expressions is completely different. Additionally, we mimic TeX's behaviour in skipping white space between macro name and next non-space character. This calls for hacks like the regular expression in variable skip_space_macro together with the temporary placeholder in mark_begin_env. It avoids that a macro without arguments consumes leading space inside of an already resolved following environment. Besides, that protects a line break, for instance in front of an equation environment. Another issue emerges with input text like '\y{a\z} b' that can lead to the output 'ab', if macro \z is expanded after macro \y{...} taking an argument. The workaround inserts the temporary placeholder in variable mark_deleted for each closing } brace or ] bracket, when a macro argument is expanded.

Our mechanism for line number tracking relies on a partial reimplementation of the substitution function re.sub() from the standard Python module for regular expressions. Here, the manipulated text string is replaced by a pair of this same string and an array of integers. These represent the estimated original line numbers of the lines in the current text string part. During substitution, the line number array is adjusted upon deletion or inclusion of line breaks. The tracking of character positions for option --char works similarly.

Since creation of new empty lines may break the text flow, we avoid it with a simple scheme. Whenever a LaTeX macro is expanded or an environment frame is deleted, the mark from variable mark_deleted is left in the text string. At the very end, these marks are deleted, and lines only consisting of space and such marks are removed completely. This also means that initially blank lines remain in the text (except those only containing a % comment).

Under category Issues, some known shortcomings are listed. Additionally, we have marked several problems as BUG in the script.

Back to top