Creating text corpora for special purposes on the basis of extended TXM platform


A.M. Lavrentiev – Ph.D.(Philol.), CNRS & ENS de Lyon (France)
I.V. Smirnov – Ph.D.(Phys.-Math.), Head of Department, FRC «Computer Science and Control» of RAS (Moscow)
M.I. Suvorova – Research Scientist, FRC «Computer Science and Control» of RAS (Moscow)
F.N. Solov'ev – Research Scientist, Institute of Physical and Technical Informatics (Protvino)
A.I. Fokina – Student, HSE (Moscow)
A.M. Chepovskiy – Dr.Sc.(Eng.), Professor, HSE (Moscow)

TXM platform suggests a wide range of corpus analysis capabilities including correspondence analysis, clusterization, lexical table construction, parametrized subcorpus selection. The default structural unit of analysis for the TXM platform is a token. However it is possible to supply each token with a number of features enabling more sophisticated, complex while flexible corpus analysis. The only extension available by default is the TreeTagger augmenting TXM platform with automated token morphological analysis capability. In this work we present a number of tools for even more extensive and complex corpus analysis relying both on our previously developed tools as well as on publicly available tools.

