Radiotekhnika
Publishing house Radiotekhnika

"Publishing house Radiotekhnika":
scientific and technical literature.
Books and journals of publishing houses: IPRZHR, RS-PRESS, SCIENCE-PRESS


Тел.: +7 (495) 625-9241

 

Methods of post-correction of OCR errors in problems of automatic processing of archival documents

Keywords:

S.V. Smirnov – Post-graduate Student, SPIIRAS


Historical documents are thus a set of documents for mass recognition that not enough is to use only the existing recognition systems. Because if the data contains spelling errors or inaccuracies recognition, they are heavy to handle, and no standard system of full text search will not return them to the user. This implies that to meet the needs of the application, the results obtained at the output of the recognition system must be subjected to further post-processing in order to correct mistakes and committed errors of recognition. The quality of the adjustment process is largely dependent on the accuracy of detecting errors and their correct classification. All types of errors can be divided into two categories: the vocabulary and non-verbal mistakes. Preprocessing methods of recognition results include methods for the replacement pattern, adjusting alphanumeric errors, clean up noise and excessive punctuation error correction does not detect hyphenation. Before attempting to adjust their recognition errors should be identified. Methods for detection of recognition errors are divided into two groups: methods of detection of word errors and non-verbal methods of detecting errors. The most important methods for the detection of word errors: assessment of the likelihood OCR, n-gram analysis, dictionary lookup. There are the following categories of non-verbal methods for correcting errors: measurement of the minimum distance between words, the definition of similarity between the words, the replacement of symbols according to predefined rules, n-grams of characters, probabilistic methods and neural network. All OCR error correction methods can be divided into two groups: the adjustment is based on comparing the results of several OCR systems, and adjustment using OCR recognition results of one system. The most commonly used methods and algorithms for error correction OCR are: Levenshtein distance, anagram hashing, OCR-key, n-gram analysis, neural networks. The best results are achieved when building systems that use a combination of the above methods.
References:

  1. Karen Kukich «Techniques for Automatically Correcting Words in Text». ACM Comput. Surv., 24(4):377 – 439, 1992.
  2. Kai Niklas «Unsupervised Post-Correction of OCR Errors». Leibniz Universit at Hannover, 2010.
  3. Jones M. A, Story G. A., Ballard B. W.1991. Integrating multiple knowledge sourcesm a Bayesian OCR post-processor. In Proceedings of IDCAR-91 (St Malo, France), 925 – 933.
  4. Tesseract-ocr - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google. Google Project Hosting // http://code.google.com/p/tesseract-ocr/.
  5. Volk, Martin; Furrer, Lenz; Sennrich, Rico Strategies for reducing and correcting OCR error. University of Zurich, 2011.
  6. Programma dlya raspoznavaniya teksta ABBYY FineReader // http://www.abbyy.ru/finereader/.
  7. Wing-Soon Wilson Lian. «Heuristic-Based OCR Post-Correction for Smart Phone Applications». Honors thesis, 2009
  8. Rusell R. C. Odell M. K. Patent Numbers, 1,261,167 (1918) and 1,435,663 (1922). U.S. Patent Office, 1918.
  9. Lawrence Philips. The Double Metaphone Search Algorithm. C/C++ Users J., 18(6):38–43, 2000.
  10. Joseph J. Pollock and Antonio Zamora. Automatic Spelling Correction in Scientific and Scholarly Text. Commun. ACM, 27(4):358–368, 1984.
  11. Eric Mays, Fred J. Damerau, and Robert L. Mercer. Context Based Spelling Correction. Inf. Process. Manage., 27(5):517 – 522, 1991.
  12. Google Web 1T Data. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13.
  13. Davide Fossati and Barbara Di Eugenio A Mixed Trigrams Approach for Context Sensitive Spell Checking. In CICLing ’07: Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing, pages 623 – 633, Berlin, Heidelberg, 2007. Springer-Verlag.
  14. Filip Ginter, Jorma Boberg, Jouni J¨arvinen, and Tapio Salakoski New Techniques for Disambiguation in Natural Language and Their Application to Biological Text. J. Mach. Learn. Res., 5:605–621, 2004.
  15. Wick M., Ross M., and Learned-Miller E. Context-Sensitive Error Correction: Using Topic Models to Improve OCR. In ICDAR ’07: Proceedings of the Ninth International Conference on Document Analysis and Recognition, pages 1168 – 1172, Washington, DC, USA, 2007. IEEE Computer Society.
  16. Christian M. Strohmaier. Methoden der lexikalischen Nachkorrektur OCR-erfasster Dokumente, 2004.
  17. Programma dlya raspoznavaniya teksta OmniPage // http://www.nuance.com/for-business/by-product/omnipage/index.htm.
  18. Levenshtein V. I. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Soviets Physics Doklady, 10(8): 707 – 710, 1966.
  19. Martin Reynaert Text Induced Spelling Correction. In COLING ’04: Proceedings of the 20th international conference on Computational Linguistics, page 834, Morristown, NJ, USA, 2004. Association for Computational Linguistics.

June 24, 2020
May 29, 2020

© Издательство «РАДИОТЕХНИКА», 2004-2017            Тел.: (495) 625-9241                   Designed by [SWAP]Studio