Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md

README.md

鉄道乗車券画像からの情報抽出 / Entity Extraction from Train Ticket Images

背景 / Background

When we want to digitalize a document, we need to do actually two things:

extract the text contained in an image as characters (OCR)
label the content of the text and give it a signification within a document (labeling/classification)

To label the content, we can use natural language processing (based on text extracted with the OCR) and/or document layout analysis (based on the input image).

課題 / Task

The objective of this research is to extract the content of train tickets and correctly label it (origin station, destination station, price of the ticket, company, line, date of emission, date of validity, etc.)

テストデータ / Data

The dataset contains around 150 images of train tickets. Some are scanned, some are photographed, but they are all in a relatively good quality (enough to apply OCR at least partially).

Click here for examples

Labels:

"origin","destination","line","company","price","issued_on","available_from","expire_on"
"甲府","町田","はまかいじ号","JR","5190","2007-09-16","2007-09-16","2007-09-16"

Labels:

"origin","destination","line","company","price","issued_on","available_from","expire_on"
"立川","","中央ライナー","","510","2018-09-21","2018-09-21","2018-09-21"

Labels:

"origin","destination","line","company","price","issued_on","available_from","expire_on"
"","","","東京都交通局","1590","2019-01-13","2019-01-13","2019-01-13"

Labels:

"origin","destination","line","company","price","issued_on","available_from","expire_on"
"池袋","所沢","","西武鉄道","400","2018-12-23","2018-12-23","2018-12-23"

Labels:

"origin","destination","line","company","price","issued_on","available_from","expire_on",""
"三田","尼崎","","","630","2011-03-25","2011-03-25","2011-03-25"

"origin","destination","line","company","price","issued_on","available_from","expire_on"
"盛岡","東京","新幹線","JR","14240","2012-10-07","2012-10-08","2012-10-08"

提出 / Submission

We ask to submit a script to predict values for the entities from the test images, that will be downloaded under data/test during the CI pipeline. The results must be formatted in a specified way so that we can evaluate it.

Thus, the script has to:

walk through the directory data/test, and load found images (named *.png)
extract the content of the image as text and classify it
output to the standard output the results and the original file names in CSV format (see below)

出力例 / Sample Output

Stdout must return a result formatted as:

filename,origin,destination,line,company,price,issued_on,available_from,expire_on
1.png,都区内,都区内,,JR,730,2010-01-19,2010-01-19,2010-01-19
2.png,甲府,町田,はまかいじ号,JR,5190,2007-09-16,2007-09-16,2007-09-16
4.png,町田,土合,楊浜線,JR,3570,2007-11-04,2007-11-23,2007-11-25
5.png,水上,大宮,EL&SL奥利根号,JR,510,2007-11-04,2007-11-23,2007-11-23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

02-entity_extraction

02-entity_extraction

README.md

鉄道乗車券画像からの情報抽出 / Entity Extraction from Train Ticket Images

背景 / Background

課題 / Task

テストデータ / Data

提出 / Submission

出力例 / Sample Output

Files

02-entity_extraction

Directory actions

More options

Directory actions

More options

Latest commit

History

02-entity_extraction

Folders and files

parent directory

README.md

鉄道乗車券画像からの情報抽出 / Entity Extraction from Train Ticket Images

背景 / Background

課題 / Task

テストデータ / Data

提出 / Submission

出力例 / Sample Output