Awesome
DocBank Loader
DocBank Loader is a dataset loader for DocBank, and can convert DocBank to the Object Detection models' format.
The DocBank GitHub repositoriy is https://github.com/doc-analysis/DocBank
Usage
Dependency
pip install -U Pillow tqdm
Installation
git clone https://github.com/doc-analysis/DocBankLoader.git
cd DocBankLoader
pip install .
DocBankLoader
from docbank_loader import DocBankLoader
txt_dir = '/path/to/DocBank_500K_txt'
img_dir = '/path/to/DocBank_500K_ori_img'
loader = DocBankLoader(txt_dir=txt_dir, img_dir=img_dir)
# Load all the examples
examples = loader.read_all()
# <list of docbank_loader.docbank_loader.Example>
# Sample N examples
examples = loader.sample_n(n=5)
# <list of docbank_loader.docbank_loader.Example>
# Load example by basename/filename
example = loader.get_by_filename('295.tar_1712.06217.gz_main_6_ori.jpg')
# <docbank_loader.docbank_loader.Example>
# Load examples by the index file
examples = loader.read_by_index('path/to/index/file')
# <list of docbank_loader.docbank_loader.Example>
DocBankConverter
from docbank_loader import DocBankLoader, DocBankConverter
txt_dir = '/path/to/DocBank_500K_txt'
img_dir = '/path/to/DocBank_500K_ori_img'
loader = DocBankLoader(txt_dir=txt_dir, img_dir=img_dir)
converter = DocBankConverter(loader)
# Convert all the examples
examples = converter.read_all()
# <list of docbank_loader.docbank_converter.CVExample>
# Sample N examples and convert
examples = converter.sample_n(n=5)
# <list of docbank_loader.docbank_converter.CVExample>
# Convert example by basename/filename
example = converter.get_by_filename('295.tar_1712.06217.gz_main_6_ori.jpg')
# <docbank_loader.docbank_converter.CVExample>
Example
Each document page and the corresponding DocBank annotation or CV annotation form an Example.
Example for DocBank
example = loader.sample_n(n=1)[0] # Sample a example
type(example)
# <docbank_loader.docbank_loader.Example>
im = example.plot() # Plot the DocBank annotation and return a PIL.Image.Image
bboxes = example.denormalized_bboxes() # return the denormalized bounding boxes of tokens
example.filepath # The image filepath
example.pagesize # The image size
example.words # The tokens
example.bboxes # The normalized bboxes
example.rgbs # The RGB values
example.fontnames # The fontnames
example.structures # The structure labels
Example for Object Detection models' format
example = converter.sample_n(n=1)[0] # Sample a example
type(example)
# <docbank_loader.docbank_converter.CVExample>
im = example.plot() # Plot the DocBank annotation and return a PIL.Image.Image
im = example.plot_bbox() # Plot the CV annotation and return a PIL.Image.Image
print(example.print_bbox()) # Print the bounding boxes and labels
example.filepath # The image filepath
example.pagesize # The image size
example.bboxes # The bboxes in normalized coordinates