Awesome
OCR conversion
Collection of scripts and stylesheets for conversion between various OCR formats.
You may also want to check out the excellent ocr-fileformat by @UB-Mannheim.
ABBYY
abbyy2hocr.xsl
- ABBYY FineReader XML to hOCR converter @Rod Pageabbyy2hocr.xsl
- ABBYY FineReader XML to hOCR converter by @Rod Page - updated by @OCR-Dabbyy-to-hocr
- ABBYY FineReader XML to hOCR converter by @merlijnteip5-v5.xsl
- Transform ABBYY Finereader XML into TEI @UPEIABBYY_to_TEI_by_XMLReader.php
- Convert ABBYY XML to TEI using PHP's XMLReader @able-projectocr_to_teifacsimile.xsl
- Generate page-level TEI facsimile from Abbyy OCR xml or METS/ALTO @readuxAbbyyToAlto.php
- PHP5 to convert Abbyy FineReader XML into ALTO XML @ironymarkAbbyyToAltoConverter.java
- Java library to convert abbyy.xml (v10) to alto.xml (v2) @abbyy-to-alto
ALTO
alto2tei.xsl
- Output TEI from ALTO input format @OpenConvertAltoToTeiA.xsl
- For Gale OCR XML or 18thConnect Typewright XML files @typewrightocr_to_teifacsimile.xsl
- Generate page-level TEI facsimile from Abbyy OCR xml or METS/ALTO @readuxalto2hocr.xsl
- Convert ALTO 2.0 / ALTO 2.1 to hOCR @filakalto2text.xsl
- Convert ALTO 2.0 / ALTO 2.1 to plain text @filakalto_ocr_text.py
- Extracts the text from an ALTO file and writes it to stdout @cneudALTO2HTML.bat
- Batch script to convert ALTO files to HTML @altomatordinglehopper-extract
- Extracts the text from ALTO and PAGE XML files @qurator-spk
hOCR
hOCR2ALTO.xsl
- Utilities to process and handle hOCR @ONB-RDhocr2alto2.0.xsl
- Convert hOCR to ALTO 2.0 @filakhocr2alto2.1.xsl
- Convert hOCR to ALTO 2.1 @filakhocr2tei.xsl
- Convert hOCR from Tesseract to basic TEI output @DH2015hocr2tei.xsl
- Convert hOCR from Tesseract to basic TEI output from @DH2015 - updated by @OCR-Dhocr2text.xsl
Convert hOCR to plain text @filakHocrConverter.py
- Create a PDF from an hOCR file and an image @jbrinley
PAGE
PageConverter.java
- Convert ALTO XML, FineReader XML, Google CV, and hOCR to the latest PAGE XML format @primaxml_to_box.xsl
- Convert PAGE XML to Tesseract box file @eMOPpage_to_text.py
- Extracts the text from a PAGE file and writes it to stdout @cneudPageToPdfConverter.java
- Convert PAGE XML files with layout and text content to PDF @primapage2tei-0.xsl
- Convert PAGE XML to TEI @dariokPageToAlto.xsl
- Convert PAGE XML to ALTO @Transkribuspage-to-alto
– Convert PAGE XML to ALTO (all versions) @kbadinglehopper-extract
- Extracts the text from ALTO and PAGE XML files @qurator-spk
TEI
tei2txt.xsl
- Convert DTA TEI-P5 to plain text @haoesstei2hocr.xsl
- Convert DTA TEI-P5 to hOCR @jbaiter
Other
iw2alto.xsl
- Convert ImageWare MyBib eL OCR to ALTO @karkraegtranskribus-xslt
- Various stylesheets from Transkribus @readcooptranskribus-to-prima
– Convert Transkribus dialect to official PAGE XML format @kbatextract2page
- Convert Amazon AWS Textract to PAGE XML @slubgcv2hocr
– Convert Google Cloud Vision to hOCR @dinosauria123