text in normal form
nationals hierarchy of a text
thematic clustering of texts
P.P. Kokorin, Boumedyen A.N. Shannaq, E. V. Schelkunova
The Internet technologies automate collecting, accumulating, distribution and processing of data presented in various text formats. All of them are rather intended to process of documents than inside text meaning. Those technologies can’t be applied to fulfil the intelligent services like e-learining, automatic text annotation, as well as the establishing of a semantic, associative and notional equivalence of texts.
When a user interacts with information system, there is a conflict of notional terminology in fields of knowledge and thesauri of databases and knowledge as they are represented in information systems. Interpreting and understanding arises as an iterative process, the dialogue (interface) prior terminological basis (dictionary) with developing a specific notional terminology.
Interface interpretation, equivalence of notions are supported with the hierarchical notions cores of thematic texts topics in knowledge domain. More over, it requires continuous adaptation a glossary that in general usage with knowledge domain’s glossary.
In this paper we propose the approaches and thematic clustering methods, which are part of infological-based approach. The Infological approach consists of the iterative process of forming the thematic knowledge by identifying thematic anthologies, revealing their thesauri and glossaries, creating a hierarchy of ontological notions and form the semantic clouds of selected texts.
The format of texts representation in normal form (TNF) is the basis of the proposed approach. Text in normal form is plain text (TXT), where all the words are in the normal (base) form of the word, as well as from the text uninformative words are used as conjunctions, prepositions, pronouns, etc. (known as stop-words) are excluded. Text in normal form is used to build hierarchies of notions of texts, that used in the proposed method for thematic text clustering. The proposed method of thematic text clustering is based on subgraph searching methods. The measure of thematic proximity of texts is closeness of their notions hierarchies. Testing the infological approach proposed in this paper was accomplished with the following information sources: Clustering of research topics into scientific areas, News system, Abstract system (museum system, selfreference). Conclusion: Testing of the system of thematic clustering for these information sources (scientific resources, news, and museum objects) showed the applicability and effectiveness of the proposed format of normalized text (TNF). Experiments have shown high efficiency of the proposed method of clustering for technical texts and news in Russian. Using the proposed method in news system allows for thematic clustering stream of news, thereby reducing the redundancy of news stream and volume of input data streams.