Radiotekhnika
Publishing house Radiotekhnika

"Publishing house Radiotekhnika":
scientific and technical literature.
Books and journals of publishing houses: IPRZHR, RS-PRESS, SCIENCE-PRESS


Тел.: +7 (495) 625-9241

 

OCR-systems benchmarking for the task of building a search engine for archival documents images

Keywords:

S. V. Smirnov - Ph.D. (Eng.), St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS). E-mail: serge.smir@gmail.com


In this paper we consider the actual task of selecting a suitable project-specific OCR system. Comparative analysis and selection is made in the context of OCR recognition problem of Russian documents archive fund. The aim is to recognize the results of the subsequent full-text indexing for full-text search on document content. In comparison involving several recognized leaders of the commercial systems («Abbyy Finereader», «Nuance Omnipage»), one less popular commercial system («IRIS Readiris») and all freeware that support recognition of the Russian language («Cuneiform Linux», «Cuneiform Windows», «Tesseract»). When analyzing the resulting outcomes focuses on the evaluation criteria that operate on words, since these criteria more accurately suited to tasks found for images. «Abbyy Finereader» reaches maximum performance on all data sets and is the undisputed leader. «Cuneiform Linux» on the contrary shows the worst results of recognition. Remaining systems occupy an intermediate position, with minor deviations with respect to each other. System «Tesseract» ranking of all the freely distributed systems only provides information about the coordinates of the words in the image and shows the high quality relative to other systems, with the exception of «Abbyy Finereader». Thus, the system «Tesseract» is a good choice for the task of building a search engine for images. Quality of the final recognition results largely depends on the pretreatment procedures and post-correction. To improve the quality of open-source and commercial systems required to develop and implement methods for pre and post processing. For a more detailed assessment of the quality of the recognition results in the context of building the search engines for images requires the development of additional criteria , taking into account the needs and characteristics of the search engine.
References:

 

  1. Smirnov S.V., Belozerova M.V. Ocifrovka, katalogizacija, khranenie i poisk arkhivnojj dokumentacii // Informacionno-izmeritelnye i upravljajushhie sistemy. 2010. T.8. № 7. S. 97-101.
  2. ABBYY FineReader.http://www.abbyy.ru/finereader/
  3. AnyDoc Software.http://www.anydocsoftware.com/
  4. Cvision ocr. https://www.cvisiontech.com
  5. Dynamsoft OCR SDK. http://www.dynamsoft.com
  6. ExperVision TypeReader & RTK. http://www.expervision.com
  7. IRIS Readiris.http://www.irislink.com
  8. LEADTOOLS OCR SDK.http://www.leadtools.com
  9. Nuance OmniPage.http://www.nuance.com
  10. Transym OCR. http://www.transym.com
  11. Clara OCR. http://freecode.com/projects/claraocr
  12. Cuneiform Linux.https://launchpad.net/cuneiform-linux
  13. Cuneiform Windows.http://cognitiveforms.com/ru/products_and_services/cuneiform
  14. GOCR. http://jocr.sourceforge.net
  15. Javaocr. http://sourceforge.net/projects/javaocr/
  16. LOCR. http://www.math.northwestern.edu/~mlerma/locr/
  17. Ocrad. http://www.gnu.org/software/ocrad/
  18. OCRchie. http://www.eecs.berkeley.edu/~fateman/kathey/ocrchie.html
  19. Ocre. http://lem.eui.upm.es/ocre.html
  20. OCRFeeder. https://wiki.gnome.org/action/show/Apps/OCRFeeder
  21. Ocropus. https://code.google.com/p/ocropus/.
  22. SimpleOCR. http://www.simpleocr.com/.
  23. Tesseract-ocr.http://code.google.com/p/tesseract-ocr/.
  24. Wikipedia, Cuneiform.http://ru.wikipedia.org/wiki/CuneiForm
  25. Cuneiform Linux repozitorijj. http://bazaar.launchpad.net/~jpakkane/cuneiform-linux/trunk/revision/536?start_revid=536
  26. The hOCR Embedded OCR Workflow and Output Format. https://docs.google.com/document/d/1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0/preview#
  27. Smirnov S.V. Kriterii ocenki kachestva rezultatov opticheskogo raspoznavanija // Sbornik materialov XVI Mezhdunar. nauchno-prakt. konf. «Perspektivy razvitija informacionnykh tekhnologijj». Novosibirsk. 2013. S. 33–38. 

 

June 24, 2020
May 29, 2020

© Издательство «РАДИОТЕХНИКА», 2004-2017            Тел.: (495) 625-9241                   Designed by [SWAP]Studio