When we want to digitalize a document, we need to do actually two things:
- extract the text contained in an image as characters (OCR)
- label the content of the text and give it a signification within a document (labeling/classification)
To label the content, we can use natural language processing (based on text extracted with the OCR) and/or document layout analysis (based on the input image).
The objective of this research is to extract the content of train tickets and correctly label it (origin station, destination station, price of the ticket, company, line, date of emission, date of validity, etc.)
The dataset contains around 150 images of train tickets. Some are scanned, some are photographed, but they are all in a relatively good quality (enough to apply OCR at least partially).
Click here for examples
Labels:
"origin","destination","line","company","price","issued_on","available_from","expire_on"
"甲府","町田","はまかいじ号","JR","5190","2007-09-16","2007-09-16","2007-09-16"
Labels:
"origin","destination","line","company","price","issued_on","available_from","expire_on"
"立川","","中央ライナー","","510","2018-09-21","2018-09-21","2018-09-21"
Labels:
"origin","destination","line","company","price","issued_on","available_from","expire_on"
"","","","東京都交通局","1590","2019-01-13","2019-01-13","2019-01-13"
Labels:
"origin","destination","line","company","price","issued_on","available_from","expire_on"
"池袋","所沢","","西武鉄道","400","2018-12-23","2018-12-23","2018-12-23"
Labels:
"origin","destination","line","company","price","issued_on","available_from","expire_on",""
"三田","尼崎","","","630","2011-03-25","2011-03-25","2011-03-25"
"origin","destination","line","company","price","issued_on","available_from","expire_on"
"盛岡","東京","新幹線","JR","14240","2012-10-07","2012-10-08","2012-10-08"
We ask to submit a script to predict values for the entities from the test images, that will be downloaded under data/test
during the CI pipeline. The results must be formatted in a specified way so that we can evaluate it.
Thus, the script has to:
- walk through the directory
data/test
, and load found images (named*.png
) - extract the content of the image as text and classify it
- output to the standard output the results and the original file names in CSV format (see below)
Stdout must return a result formatted as:
filename,origin,destination,line,company,price,issued_on,available_from,expire_on
1.png,都区内,都区内,,JR,730,2010-01-19,2010-01-19,2010-01-19
2.png,甲府,町田,はまかいじ号,JR,5190,2007-09-16,2007-09-16,2007-09-16
4.png,町田,土合,楊浜線,JR,3570,2007-11-04,2007-11-23,2007-11-25
5.png,水上,大宮,EL&SL奥利根号,JR,510,2007-11-04,2007-11-23,2007-11-23