Skip to content

Marginalia and Machine Learning: Handwritten text recognition for Marginalia Collections

License

Notifications You must be signed in to change notification settings

ektavats/Project-Marginalia

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Marginalia-HTR

PyTorch implementation of a Handwritten Text Recognition (HTR) system that focuses on automatic detection and recognition of handwritten marginalia texts i.e., text written in margins or handwritten notes. Faster R-CNN network is used for detection of marginalia and AttentionHTR is used for word recognition. The data comes from early book collections (printed) found in the Uppsala University Library, with handwritten marginalia texts.

For more details, refer to our paper here, or arXiv.

Liang Cheng, Jonas Frankemölle, Adam Axelsson and Ekta Vats, Uncovering the Handwritten Text in the Margins: End-to-end Handwritten Text Detection and Recognition. In Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024), pp. 111-120. (co-located with the 18th Conference of the European Chapter of the Association for Computational Linguistic (EACL 2024)).

Dependencies

To run the code, run the following

python3 -m venv marginalia-env
source marginalia-env/bin/activate
pip install --upgrade pip
python3 -m pip install -r Project-Marginalia/requirements.txt

Demo of our pre-trained model

Marginalia prediction

  • Download the dataset from here.
  • Download the pre-trained model faster_r_cnn_weights.pt from here and place it into /Project-Marginalia/model/.
  • Create the folder Project-Marginalia/model/results/
  • To detect and visualize the marginalias, run python3 model/test.py

image

Word recognition using AttentionHTR

  • To recognise the words with AttentionHTR, follow the instructions from here

Acknowledgements

  • This work was partially supported by the Uppsala-Durham Strategic Development Fund: "Marginalia and Machine Learning: a Study of Durham University and Uppsala University Marginalia Collections".
  • The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) partially funded by the Swedish Research Council through grant agreement no. 2022-06725.
  • The authors would like to thank Raphaela Heil and Peter Heslin for valuable suggestions and feedback.
  • The authors would like to thank Uppsala University Library (Alvin) for offering the dataset and Vasiliki Sampa for the help in preparing the dataset annotation.

References

[1]: Dmitrijs Kass and Ekta Vats. "AttentionHTR: Handwritten Text Recognition Based on Attention Encoder-Decoder Networks." International Workshop on Document Analysis Systems. Springer, Cham, 2022. Link Code

Contact

Adam Axelsson (adam.axelssons@gmail.com)

Liang Cheng (chengliang653@gmail.com)

Jonas Frankemölle (frankemoelle.jonas@gmail.com)

Ekta Vats (ektavats@gmail.com)

About

Marginalia and Machine Learning: Handwritten text recognition for Marginalia Collections

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.0%
  • Python 1.0%