Awesome
ocrd-page-to-alto
Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)
Introduction
This software converts PAGE XML files to the ALTO XML OCR result format. It enables using PAGE XML generating software in a context where ALTO is needed to display the results, i.e. in libraries.
Installation
In a Python virtualenv:
make install # or pip install .
# or to install from PyPI
pip install ocrd_page_to_alto
Usage
To convert the PAGE XML document example.xml
to ALTO:
page-to-alto example.xml > example.alto.xml
You can get an exhaustive list of page-to-alto's many options with --help
:
To process an OCR-D workspace, use ocrd_fileformat, which uses page-to-alto by default:
ocrd-fileformat-transform -I OCRD-OCR-OUTPUT-PAGE -O OCRD-OCR-OUTPUT-ALTO \
-P script-args "--dummy-word --no-check-words --no-check-border"
TODO
- AlternativeImage
- unmappable regions
- handle Border
- TextStyle
- ParagraphStyle
- table regions
- recursive regions for *Region
- Set
PAGECLASS
frompc:Page/@type
#4 - Layers / z-level via
StructureTag
? #4 -
<SP/>
-
<HYP/>
- rotation
- reading order
- input PAGE-XML not having words #5
- multiple pc:TextEquivs
- language
-
scriptno equivalent in ALTO :( -
kerningno equivalent in ALTO :( -
underlineStyleno equivalent in ALTO :( -
bgColourno equivalent in ALTO :( -
bgColourRgbno equivalent in ALTO :( -
reverseVideono equivalent in ALTO :( -
xHeightno equivalent in ALTO :( -
letterSpacedno equivalent in ALTO :( - ProcessingStep
- differentiate/select ALTO versions