Home

Awesome

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

CORD: A Consolidated Receipt Dataset for Post-OCR Parsing

We introduce a novel dataset called CORD, which stands for a COnsolidated Receipt Dataset for post-OCR parsing.

teaser

Abstract [paper]

OCR is inevitably linked to NLP since its final output is in text. Advances in document intelligence are driving the need for a unified technology that integrates OCR with various NLP tasks, especially semantic parsing. Since OCR and semantic parsing have been studied as separate tasks so far, the datasets for each task on their own are rich, while those for the integrated post-OCR parsing tasks are relatively insufficient. In this study, we publish a consolidated dataset for receipt parsing as the first step towards post-OCR parsing tasks. The dataset consists of thousands of Indonesian receipts, which contains images and box/text annotations for OCR, and multi-level semantic labels for parsing. The proposed dataset can be used to address various OCR and parsing tasks.

Updates

Key Features

Data Specification (for the whole dataset)

Class Definition (total 30)

NoCategoryTag field (subclasses)Description
1menu (14)menu.nmname of menu
2menu.numidentification # of menu
3menu.unitpriceunit price of menu
4menu.cntquantity of menu
5menu.discountpricediscounted price of menu
6menu.pricetotal price of menu
7menu.itemsubtotalprice of each menu after discount applied
8menu.vatynwhether the price includes tax or not
9menu.etcothers
10menu.sub_nmname of submenu
11menu.sub_numidentification # of submenu
12menu.sub_unitpriceunit price of submenu
13menu.sub_cntquantity of submenu
14menu.sub_discountpricediscounted price of submenu
15menu.sub_pricetotal price of submenu
16menu.sub_etcothers
17void menu (2)void_menu.nmname of menu
18voidmenu.numidentification # of menu
19voidmenu.unitpriceunit price of menu
20voidmenu.cntquantity of menu
21void_menu.pricetotal price of menu
22voidmenu.etcothers
23subtotal (6)subtotal.subtotal_pricesubtotal price
24subtotal.discount_pricediscounted price in total
25subtotal.subtotal_countTotal number of items
26subtotal.service_priceservice charge
27subtotal.othersvc_priceadded charge other than service charge
28subtotal.tax_pricetax amount
29subtotal.tax_and_servicetax + service
30subtotal.etcothers
31void total (0)voidtotal.subtotal_pricevoid subtotal price
32voidtotal.tax_pricevoid tax price
33voidtotal.total_pricetotal void price
34voidtotal.etcvoid etc information
35total (8)total.total_pricetotal price
36total.total_etcothers
37total.cashpriceamount of price paid in cash
38total.changepriceamount of change in cash
39total.creditcardpriceamount of price paid in credit/debit card
40total.emoneypriceamount of price paid in emoney, point
41total.menutype_cnttotal count of type of menu
42total.menuqty_cnttotal count of quantity

Json Hierarchy

Attribute NameDescription
valid_linewordsquadFour coordinates of quadrilateral
is_keyFlag to indicates the text used as a key or not
row_idLine index
textIncorporating text of the corresponding box
categoryParse class label
group_idGroup id to which the valid_line belongs
------------------------------------------------------------------------------------------
metaversionDataset version
image_idCorresponding image id
split'train' or 'dev' or 'test'
image_sizeSize of the image (by pixel)
------------------------------------------------------------------------------------------
roi*Four coordinates that encompass the area of receipt region
------------------------------------------------------------------------------------------
repeating_symbolquadFour coordinates of quadrilateral
text= or - or . or etc.

*A blank 'roi' value means the entire area of the image.

Download Link

VersionNameTotal# train# dev# testrelease date
v0sample (zip)1,00080010010026 Dec 2019
v1Hugging Face Datasets Link1,00080010010020 Jul 2022
v2Hugging Face Datasets Link1,00080010010020 Jul 2022

Citation

CORD: A Consolidated Receipt Dataset for Post-OCR Parsing

@article{park2019cord,
  title={CORD: A Consolidated Receipt Dataset for Post-OCR Parsing},
  author={Park, Seunghyun and Shin, Seung and Lee, Bado and Lee, Junyeop and Surh, Jaeheung and Seo, Minjoon and Lee, Hwalsuk}
  booktitle={Document Intelligence Workshop at Neural Information Processing Systems}
  year={2019}
}

Post-OCR parsing: building simple and robust parser via BIO tagging

@article{hwang2019post,
  title={Post-OCR parsing: building simple and robust parser via BIO tagging},
  author={Hwang, Wonseok and Kim, Seonghyeon and Yim, Jinyeong and Seo, Minjoon and Park, Seunghyun and Park, Sungrae and Lee, Junyeop and Lee, Bado and Lee, Hwalsuk}
  booktitle={Document Intelligence Workshop at Neural Information Processing Systems}
  year={2019}
}

OCR-free Document Understanding Transformer 🍩

@article{kim2021donut,
   title={OCR-free Document Understanding Transformer},
   author={Kim, Geewook and Hong, Teakgyu and Yim, Moonbin and Nam, JeongYeon and Park, Jinyoung and Yim, Jinyeong and Hwang, Wonseok and Yun, Sangdoo and Han, Dongyoon and Park, Seunghyun},
   journal={arXiv preprint arXiv:2111.15664},
   year={2021}
}