Awesome
img2table
img2table
is a simple, easy to use, table identification and extraction Python Library based on OpenCV image
processing that supports most common image file formats as well as PDF files.
Thanks to its design, it provides a practical and lighter alternative to Neural Networks based solutions, especially for usage on CPU.
Table of contents
Installation <a name="installation"></a>
The library can be installed via pip:
<code>pip install img2table</code>: Standard installation, supporting Tesseract<br> <code>pip install img2table[paddle]</code>: For usage with Paddle OCR<br> <code>pip install img2table[easyocr]</code>: For usage with EasyOCR<br> <code>pip install img2table[surya]</code>: For usage with Surya OCR<br> <code>pip install img2table[gcp]</code>: For usage with Google Vision OCR<br> <code>pip install img2table[aws]</code>: For usage with AWS Textract OCR<br> <code>pip install img2table[azure]</code>: For usage with Azure Cognitive Services OCR
Features <a name="features"></a>
- Table identification for images and PDF files, including bounding boxes at the table cell level
- Handling of complex table structures such as merged cells
- Handling of implicit content - see example
- Table content extraction by providing support for OCR services / tools
- Extracted tables are returned as a simple object, including a Pandas DataFrame representation
- Export extracted tables to an Excel file, preserving their original structure
Supported file formats <a name="supported-file-formats"></a>
Images <a name="images-formats"></a>
Images are loaded using the opencv-python
library, supported formats are listed below.
PDF <a name="pdf-formats"></a>
Both native and scanned PDF files are supported.
Usage <a name="usage"></a>
Documents <a name="documents"></a>
Images <a name="images-doc"></a>
Images are instantiated as follows :
from img2table.document import Image
image = Image(src,
detect_rotation=False)
<h4>Parameters</h4> <dl> <dt>src : str, <code>pathlib.Path</code>, bytes or <code>io.BytesIO</code>, required</dt> <dd style="font-style: italic;">Image source</dd> <dt>detect_rotation : bool, optional, default <code>False</code></dt> <dd style="font-style: italic;">Detect and correct skew/rotation of the image</dd> </dl><br> The implemented method to handle skewed/rotated images supports skew angles up to 45° and is based on the publication by <a href="https://www.mdpi.com/2079-9292/9/1/55">Huang, 2020</a>.<br> Setting the <code>detect_rotation</code> parameter to <code>True</code>, image coordinates and bounding boxes returned by other methods might not correspond to the original image.
PDF <a name="pdf-doc"></a>
PDF files are instantiated as follows :
from img2table.document import PDF
pdf = PDF(src,
pages=[0, 2],
detect_rotation=False,
pdf_text_extraction=True)
<h4>Parameters</h4> <dl> <dt>src : str, <code>pathlib.Path</code>, bytes or <code>io.BytesIO</code>, required</dt> <dd style="font-style: italic;">PDF source</dd> <dt>pages : list, optional, default <code>None</code></dt> <dd style="font-style: italic;">List of PDF page indexes to be processed. If None, all pages are processed</dd> <dt>detect_rotation : bool, optional, default <code>False</code></dt> <dd style="font-style: italic;">Detect and correct skew/rotation of extracted images from the PDF</dd> <dt>pdf_text_extraction : bool, optional, default <code>True</code></dt> <dd style="font-style: italic;">Extract text from the PDF file for native PDFs</dd> </dl>
PDF pages are converted to images with a 200 DPI for table identification.
OCR <a name="ocr"></a>
img2table
provides an interface for several OCR services and tools in order to parse table content.<br>
If possible (i.e for native PDF), PDF text will be extracted directly from the file and the OCR service/tool will not be called.
from img2table.ocr import TesseractOCR
ocr = TesseractOCR(n_threads=1,
lang="eng",
psm=11,
tessdata_dir="...")
<h4>Parameters</h4> <dl> <dt>n_threads : int, optional, default <code>1</code></dt> <dd style="font-style: italic;">Number of concurrent threads used to call Tesseract</dd> <dt>lang : str, optional, default <code>"eng"</code></dt> <dd style="font-style: italic;">Lang parameter used in Tesseract for text extraction</dd> <dt>psm : int, optional, default <code>11</code></dt> <dd style="font-style: italic;">PSM parameter used in Tesseract, run <code>tesseract --help-psm</code> for details</dd> <dt>tessdata_dir : str, optional, default <code>None</code></dt> <dd style="font-style: italic;">Directory containing Tesseract traineddata files. If None, the <code>TESSDATA_PREFIX</code> env variable is used.</dd> </dl>
Usage of Tesseract-OCR requires prior installation. Check documentation for instructions. <br> For Windows users getting environment variable errors, you can check this tutorial <br>
</details> <details> <summary>PaddleOCR<a name="paddle"></a></summary> <br><a href="https://github.com/PaddlePaddle/PaddleOCR">PaddleOCR</a> is an open-source OCR based on Deep Learning models.<br> At first use, relevant languages models will be downloaded.
from img2table.ocr import PaddleOCR
ocr = PaddleOCR(lang="en",
kw={"kwarg": kw_value, ...})
<h4>Parameters</h4> <dl> <dt>lang : str, optional, default <code>"en"</code></dt> <dd style="font-style: italic;">Lang parameter used in Paddle for text extraction, check <a href="https://github.com/Mushroomcat9998/PaddleOCR/blob/main/doc/doc_en/multi_languages_en.md#5-support-languages-and-abbreviations">documentation for available languages</a></dd> <dt>kw : dict, optional, default <code>None</code></dt> <dd style="font-style: italic;">Dictionary containing additional keyword arguments passed to the PaddleOCR constructor.</dd> </dl><br> <b>NB:</b> For usage of PaddleOCR with GPU, the CUDA specific version of paddlepaddle-gpu has to be installed by the user manually as stated in this <a href="https://github.com/PaddlePaddle/PaddleOCR/issues/7993">issue</a>.
# Example of installation with CUDA 11.8
pip install paddlepaddle-gpu==2.5.0rc1.post118 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
pip install paddleocr img2table
If you get an error trying to run PaddleOCR on Ubuntu, please check this <a href="https://github.com/PaddlePaddle/PaddleOCR/discussions/9989#discussioncomment-6642037">issue</a> for a working solution.
<br> </details> <details> <summary>EasyOCR<a name="easyocr"></a></summary> <br><a href="https://github.com/JaidedAI/EasyOCR">EasyOCR</a> is an open-source OCR based on Deep Learning models.<br> At first use, relevant languages models will be downloaded.
from img2table.ocr import EasyOCR
ocr = EasyOCR(lang=["en"],
kw={"kwarg": kw_value, ...})
<h4>Parameters</h4> <dl> <dt>lang : list, optional, default <code>["en"]</code></dt> <dd style="font-style: italic;">Lang parameter used in EasyOCR for text extraction, check <a href="https://www.jaided.ai/easyocr">documentation for available languages</a></dd> <dt>kw : dict, optional, default <code>None</code></dt> <dd style="font-style: italic;">Dictionary containing additional keyword arguments passed to the EasyOCR <code>Reader</code> constructor.</dd> </dl><br> </details> <details> <summary>docTR<a name="docTR"></a></summary> <br>
<a href="https://github.com/mindee/doctr">docTR</a> is an open-source OCR based on Deep Learning models.<br> In order to be used, docTR has to be installed by the user beforehand. Installation procedures are detailed in the package documentation
from img2table.ocr import DocTR
ocr = DocTR(detect_language=False,
kw={"kwarg": kw_value, ...})
<h4>Parameters</h4> <dl> <dt>detect_language : bool, optional, default <code>False</code></dt> <dd style="font-style: italic;">Parameter indicating if language prediction is run on the document</dd> <dt>kw : dict, optional, default <code>None</code></dt> <dd style="font-style: italic;">Dictionary containing additional keyword arguments passed to the docTR <code>ocr_predictor</code> method.</dd> </dl><br> </details> <details> <summary>Surya OCR<a name="surya"></a></summary> <br>
<b><i>Only available for <code>python >= 3.10</code></i></b><br> <a href="https://github.com/VikParuchuri/surya">Surya</a> is an open-source OCR based on Deep Learning models.<br> At first use, relevant models will be downloaded.
from img2table.ocr import SuryaOCR
ocr = SuryaOCR(langs=["en"])
<h4>Parameters</h4> <dl> <dt>langs : list, optional, default <code>["en"]</code></dt> <dd style="font-style: italic;">Lang parameter used in Surya OCR for text extraction</dd> </dl><br> </details> <details> <summary>Google Vision<a name="vision"></a></summary> <br>
Authentication to GCP can be done by setting the standard GOOGLE_APPLICATION_CREDENTIALS
environment variable.<br>
If this variable is missing, an API key should be provided via the api_key
parameter.
from img2table.ocr import VisionOCR
ocr = VisionOCR(api_key="api_key", timeout=15)
<h4>Parameters</h4> <dl> <dt>api_key : str, optional, default <code>None</code></dt> <dd style="font-style: italic;">Google Vision API key</dd> <dt>timeout : int, optional, default <code>15</code></dt> <dd style="font-style: italic;">API requests timeout, in seconds</dd> </dl><br> </details> <details> <summary>AWS Textract<a name="textract"></a></summary> <br>
When using AWS Textract, the DetectDocumentText API is exclusively called.
Authentication to AWS can be done by passing credentials to the TextractOCR
class.<br>
If credentials are not provided, authentication is done using environment variables or configuration files.
Check boto3
documentation for more details.
from img2table.ocr import TextractOCR
ocr = TextractOCR(aws_access_key_id="***",
aws_secret_access_key="***",
aws_session_token="***",
region="eu-west-1")
<h4>Parameters</h4> <dl> <dt>aws_access_key_id : str, optional, default <code>None</code></dt> <dd style="font-style: italic;">AWS access key id</dd> <dt>aws_secret_access_key : str, optional, default <code>None</code></dt> <dd style="font-style: italic;">AWS secret access key</dd> <dt>aws_session_token : str, optional, default <code>None</code></dt> <dd style="font-style: italic;">AWS temporary session token</dd> <dt>region : str, optional, default <code>None</code></dt> <dd style="font-style: italic;">AWS server region</dd> </dl><br> </details> <details> <summary>Azure Cognitive Services<a name="azure"></a></summary> <br>
from img2table.ocr import AzureOCR
ocr = AzureOCR(endpoint="abc.azure.com",
subscription_key="***")
<h4>Parameters</h4> <dl> <dt>endpoint : str, optional, default <code>None</code></dt> <dd style="font-style: italic;">Azure Cognitive Services endpoint. If None, inferred from the <code>COMPUTER_VISION_ENDPOINT</code> environment variable.</dd> <dt>subscription_key : str, optional, default <code>None</code></dt> <dd style="font-style: italic;">Azure Cognitive Services subscription key. If None, inferred from the <code>COMPUTER_VISION_SUBSCRIPTION_KEY</code> environment variable.</dd> </dl><br> </details>
Table extraction <a name="table-extract"></a>
Multiple tables can be extracted at once from a PDF page/ an image using the extract_tables
method of a document.
from img2table.ocr import TesseractOCR
from img2table.document import Image
# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")
# Instantiation of document, either an image or a PDF
doc = Image(src)
# Table extraction
extracted_tables = doc.extract_tables(ocr=ocr,
implicit_rows=False,
implicit_columns=False,
borderless_tables=False,
min_confidence=50)
<h4>Parameters</h4> <dl> <dt>ocr : OCRInstance, optional, default <code>None</code></dt> <dd style="font-style: italic;">OCR instance used to parse document text. If None, cells content will not be extracted</dd> <dt>implicit_rows : bool, optional, default <code>False</code></dt> <dd style="font-style: italic;">Boolean indicating if implicit rows should be identified - check related <a href="/examples/Implicit.ipynb" target="_self">example</a></dd> <dt>implicit_columns : bool, optional, default <code>False</code></dt> <dd style="font-style: italic;">Boolean indicating if implicit columns should be identified - check related <a href="/examples/Implicit.ipynb" target="_self">example</a></dd> <dt>borderless_tables : bool, optional, default <code>False</code></dt> <dd style="font-style: italic;">Boolean indicating if <a href="/examples/borderless.ipynb" target="_self">borderless tables</a> are extracted <b>on top of</b> bordered tables.</dd> <dt>min_confidence : int, optional, default <code>50</code></dt> <dd style="font-style: italic;">Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)</dd> </dl>
<b>NB</b>: Borderless table extraction can, by design, only extract tables with 3 or more columns.
Method return
The ExtractedTable
class is used to model extracted tables from documents.
<h4>Attributes</h4> <dl> <dt>bbox : <code>BBox</code></dt> <dd style="font-style: italic;">Table bounding box</dd> <dt>title : str</dt> <dd style="font-style: italic;">Extracted title of the table</dd> <dt>content : <code>OrderedDict</code></dt> <dd style="font-style: italic;">Dict with row indexes as keys and list of <code>TableCell</code> objects as values</dd> <dt>df : <code>pd.DataFrame</code></dt> <dd style="font-style: italic;">Pandas DataFrame representation of the table</dd> <dt>html : <code>str</code></dt> <dd style="font-style: italic;">HTML representation of the table</dd> </dl><br>
In order to access bounding boxes at the cell level, you can use the following code snippet :
for id_row, row in enumerate(table.content.values()):
for id_col, cell in enumerate(row):
x1 = cell.bbox.x1
y1 = cell.bbox.y1
x2 = cell.bbox.x2
y2 = cell.bbox.y2
value = cell.value
<h5 style="color:grey">Images</h5>
extract_tables
method from the Image
class returns a list of ExtractedTable
objects.
output = [ExtractedTable(...), ExtractedTable(...), ...]
<h5 style="color:grey">PDF</h5>
extract_tables
method from the PDF
class returns an OrderedDict
object with page indexes as keys and lists of ExtractedTable
objects.
output = {
0: [ExtractedTable(...), ...],
1: [],
...
last_page: [ExtractedTable(...), ...]
}
Excel export <a name="xlsx"></a>
Tables extracted from a document can be exported to a xlsx file. The resulting file is composed of one worksheet per extracted table.<br>
Method arguments are mostly common with the extract_tables
method.
from img2table.ocr import TesseractOCR
from img2table.document import Image
# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")
# Instantiation of document, either an image or a PDF
doc = Image(src)
# Extraction of tables and creation of a xlsx file containing tables
doc.to_xlsx(dest=dest,
ocr=ocr,
implicit_rows=False,
implicit_columns=False,
borderless_tables=False,
min_confidence=50)
<h4>Parameters</h4> <dl> <dt>dest : str, <code>pathlib.Path</code> or <code>io.BytesIO</code>, required</dt> <dd style="font-style: italic;">Destination for xlsx file</dd> <dt>ocr : OCRInstance, optional, default <code>None</code></dt> <dd style="font-style: italic;">OCR instance used to parse document text. If None, cells content will not be extracted</dd> <dt>implicit_rows : bool, optional, default <code>False</code></dt> <dd style="font-style: italic;">Boolean indicating if implicit rows should be identified - check related <a href="/examples/Implicit.ipynb" target="_self">example</a></dd> <dt>implicit_rows : bool, optional, default <code>False</code></dt> <dd style="font-style: italic;">Boolean indicating if implicit columns should be identified - check related <a href="/examples/Implicit.ipynb" target="_self">example</a></dd> <dt>borderless_tables : bool, optional, default <code>False</code></dt> <dd style="font-style: italic;">Boolean indicating if <a href="/examples/borderless.ipynb" target="_self">borderless tables</a> are extracted. It requires to provide an OCR to the method in order to be performed - <b>feature in alpha version</b></dd> <dt>min_confidence : int, optional, default <code>50</code></dt> <dd style="font-style: italic;">Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)</dd> </dl> <h4>Returns</h4> If a <code>io.BytesIO</code> buffer is passed as dest arg, it is returned containing xlsx data
Examples <a name="examples"></a>
Several Jupyter notebooks with examples are available :
<ul> <li> <a href="/examples/Basic_usage.ipynb" target="_self">Basic usage</a>: generic library usage, including examples with images, PDF and OCRs </li> <li> <a href="/examples/borderless.ipynb" target="_self">Borderless tables</a>: specific examples dedicated to the extraction of borderless tables </li> <li> <a href="/examples/Implicit.ipynb" target="_self">Implicit content</a>: illustrated effect of the parameter <code>implicit_rows</code>/<code>implicit_columns</code> of the <code>extract_tables</code> method </li> </ul>