Text area detection in handwritten documents scanned for further processing

Main Article Content

Pach Jakub Leszek
Krupa Artur
Antoniuk Izabella

Keywords : text area detection, handwritten text, machine learning, optical character recognition, text recognition

In this paper we present an approach to text area detection using binary images, Constrained Run Length Algorithm and other noise reduction methods of removing the artefacts. Text processing includes various activities, most of which are related to preparing input data for further operations in the best possible way, that will not hinder the OCR algorithms. This is especially the case when handwritten manuscripts are considered, and even more so with very old documents. We present our methodology for text area detection problem, which is capable of removing most of irrelevant objects, including elements such as page edges, stains, folds etc. At the same time the presented method can handle multi-column texts or varying line thickness. The generated mask can accurately mark the actual text area, so that the output image can be easily used in further text processing steps.

Article Details

How to Cite
Jakub Leszek, P., Artur, K., & Izabella, A. . (2020). Text area detection in handwritten documents scanned for further processing. Machine Graphics and Vision, 29(1/4), 21–31. https://doi.org/10.22630/MGV.2020.29.1.2

S. S. Bukhari, T. M. Breuel, A. Asi, and J. El-Sana. Layout analysis for Arabic historical document images using machine learning. In Proc. 2012 Int. Conf. on Frontiers in Handwriting Recognition, pages 639–644, Bari, Italy, 18-20 Sept. 2012. IEEE. https://doi.org/10.1109/ICFHR.2012.227. (Crossref)

M. Bulacu, R. Van Koert, L. Schomaker, and T. van der Zant. Layout analysis of handwritten historical documents for searching the archive of the cabinet of the Dutch queen. In Proc. 9th Int. Conf. on Document Analysis and Recognition ICDAR 2007, volume 1, pages 357–361, Parana, Brazil, 23-26 Sept. 2007. IEEE. https://doi.org/10.1109/ICDAR.2007.4378732. (Crossref)

B.-S. Chien, B.-S. Jeng, S.-W. Sun, G.-H. Chang, K.-H. Shyu, and C.-H. Shih. Novel block segmentation and processing for Chinese-English document. In Proc. Visual Communications and Image Processing’91: Image Processing, volume 1606 of Proc. SPIE, pages 588–598, 1 Nov. 1991. https://doi.org/10.1117/12.50377. (Crossref)

B. Gatos, G. Louloudis, and N. Stamatopoulos. Segmentation of historical handwritten documents into text zones and text lines. In Proc. 2014 14th Int. Conf. on Frontiers in Handwriting Recognition, pages 464–469, Heraklion, Greece, 1-4 Sept. 2014. IEEE. https://doi.org/10.1109/ICFHR.2014.84. (Crossref)

S. H. Kim, S. Jeong, G. S. Lee, and C. Y. Suen. Gap metrics for handwritten Korean word segmentation. Electronics Letters, 37(14): 892–893, 2001. https://doi.org/10.1049/el:20010596. (Crossref)

H. I. Koo and N. I. Cho. Text-line extraction in handwritten Chinese documents based on an energy minimization framework. IEEE Trans. on Image Processing, 21(3): 1169–1175, 2011. https://doi.org/10.1109/TIP.2011.2166972. (Crossref)

G. Louloudis, B. Gatos, I. Pratikakis, and C. Halatsis. Text line and word segmentation of handwritten documents. Pattern Recognition, 42(12): 3169–3183, 2009. https://doi.org/10.1016/j.patcog.2008.12.016. (Crossref)

V. Malleron, V. Eglin, H. Emptoz, S. Dord-Crouslé, and P. Régnier. Text lines and snippets extraction for 19th century handwriting documents layout analysis. In Proc. 10th Int. Conf. On Document Analysis and Recognition ICDAR 2009, pages 1001–1005, Barcelona, Spain, 26-29 Jul. 2009. IEEE. https://doi.org/10.1109/ICDAR.2009.199. (Crossref)

K. Mirul. Object counting using connected component labelling. In It’s Science – Blog., 2020. [Online; accessed 16 Jan. 2020]. http://k-sience.blogspot.com/2017/06/object-counting-using-connected.html.

S. Nicolas, T. Paquet, and L. Heutte. Complex handwritten page segmentation using contextual models. In Proc. 2nd Int. Conf. on Document Image Analysis for Libraries DIAL’06, pages 46–59, Lyon, France, 27-28 Apr. 2006. IEEE. https://doi.org/10.1109/DIAL.2006.8. (Crossref)

M. Otsu. A threshold selection method from gray-level histograms. IEEE Trans. Systems, Manand Cybernetics, 9(1): 62–66, 1979. https://doi.org/10.1109/TSMC.1979.4310076. (Crossref)

J. L. Pach. Analysis of lossless data compression methods (in Polish). Technical report, Warsaw University of Life Sciences – SGGW, Faculty of Applied Informatics and Mathematics – WZIM, Warsaw, 2011.

J. L. Pach. Identification of the author of Latin manuscripts with the use of image processing methods (in Polish). PhD thesis, Warsaw University of Technology, Faculty of Electronics and Information Technology, Warsaw, Poland, 2019.

J. L. Pach and P. Bilski. Robust method for the text line detection and splitting of overlapping text in the Latin manuscripts. Machine Graphics & Vision, 23(3/4): 11–22, 2014. http://mgv.wzim.sggw.pl/MGV23.html#3-11. (Crossref)

D. Pountain. Run-length encoding. Byte, 12(6): 317–319, 1987. https://archive.org/details/byte-magazine-1987-06.

J. Ryu, H. I. Koo, and N. I. Cho. Language-independent text-line extraction algorithm for handwritten documents. IEEE Signal Processing Letters, 21(9): 1115–1119, 2014. https://doi.org/10.1109/LSP.2014.2325940. (Crossref)

F. M. Wahl, K. Y. Wong, and R. G. Casey. Block segmentation and text extraction in mixed text/image documents. Computer Graphics and Image Processing, 20(4): 375–390, 1982. https://doi.org/10.1016/0146-664X(82)90059-4. (Crossref)



Download data is not yet available.
Recommend Articles