New approach to text analysis
Let us outline the principles of a new approach in more detail. In the new hybrid method, the text is considered as a sequence of symbols organized into words and sentences. This sequence is moved through a window of variable length (from two to twenty symbols can be seen simultaneously), shifting it by one symbol at a time. The snapshots of the text fragments visible through the window are recorded in dynamically added neurons. The created hierarchical neural network contains several layers: those fragments that occur in text more than once are stored in neurons that belong to the higher levels of the network. This neural network realizes frequency-based multi-level dictionaries of different text elements (letters, syllables, stems, morphemes, words, and phrases). Words are selected as basic operational elements, while other elements are used as auxiliary information during the analysis.
Ideally, one wishes to get rid of all supplementary and common place words, which carry no semantic meaning. Also one would like to identify stems of the words, while separating prefixes, suffixes, and endings (morphemes). This step is called preprocessing. All further work can be carried out with stems only, thus improving the quality of the analysis. For example, the words "mean" and "meaningful" will be identified as having the same stem by this system.
In fact, obtaining an efficient preprocessing mechanism requires fine-tuning the system to a specified language in order to efficiently filter out supplementary words and morphemes native to this language. One could utilize the same hierarchical neural network in order to build a filter for unwanted elements. When processing a large corpus of texts from diverse subjects, supplementary words and morphemes are the fragments appearing most frequently in the text. By working with various fragments of words, the hierarchical neural network allows one to automatically catch both supplementary words and morphemes at the same time. Note that this preprocessing is the only place where language dependency enters in the discussion of the new analysis technique and where some human analyst guidance is desirable. All other components of this technology are language independent and work equally well with texts in any alphabet-based language. Applying a threshold to the neural network developed on such a corpus of texts, one creates a filter that can be used later for separating the stems of semantically important words for further analysis. While performing the analysis with individual stems, the network still holds the information about complete words.
Let us assume that we managed to filter out meaningless elements and process the significant information. The nodes of the developed neural network now hold all important words and word combinations from the text with the frequencies of their occurrence. Simultaneously, the same network assesses frequencies of joint occurrence of different semantic elements within certain structural text units, for example sentences. One obtains a graph-like structure that contains statistical weights of words in the nodes and statistical weights of joint occurrences of these words in the links.
This graph does not provide an accurate semantic picture of the analyzed text yet. One still needs to adjust individual statistical weights of the words and relations between them to provide a consistent text representation. The weights of those words, which are strongly related to other frequent words in the text should be boosted, and vice versa. This is accomplished by assigning the statistical weights of individual words to the nodes in a one-dimensional Hopfield-like neural network where all neurons are completely interconnected. Simultaneously, the statistical weights of relations between words are assigned to the links between individual nodes in this network. When released, this Hopfield- like network evolves by changing the weights assigned to the nodes and links between them to a stable configuration corresponding to the minimum of an energy-like function characterizing the network. The renormalized weights of words and relations between them are called semantic weights and the resulting reshaped graph-like structure is called a semantic network (which is a list of the most important words and word combinations from the text and relations between them). Since the analysis of a text has been performed with no recourse to any background knowledge of the subject of interest, the meaning of a word in the created semantic network is defined purely by those other words, which are related to it in the network. Correspondingly, the words and word combinations comprising a semantic network have a special name - semantic concepts.
The semantic network represents a linguistically accurate and concise picture of the analyzed text. This construction can lie in the foundation of many further analysis techniques implementing user-needed text processing functionality.
TextAnalyst: natural language text analysis software
The new text mining system, TextAnalyst, implements a variety of important analysis functions based on utilizing an automatically created semantic network of the investigated text. This system is built on the results of twenty years of research and development of a new paradigm by a team of mathematical linguists. The key advantage of
TextAnalyst against other text analysis and information retrieval systems is that it can distill the semantic network of a text completely autonomously, without prior development of a subject-specific dictionary by a human expert. The user does not have to provide
TextAnalyst with any background knowledge of the subject – the system acquires this knowledge automatically.
TextAnalyst empowers the user with the following functionality:
In TextAnalyst, concepts stored in the semantic network are hyperlinked to those sentences where they have been encountered, and the sentences are in turn hyperlinked to the places in the original text from where they have been retrieved. Thus the automatically created semantic network provides an efficient navigation through the texts stored in the textbase. Keeping in mind that thousands of texts can be processed simultaneously, the outlined semantic navigation might turn out to be a very handy capability.
The system can identify the most important concepts from the semantic network and transform the network into a tree-like list of nested topics of descending importance by breaking links representing weak relations and substituting certain indirect relations with direct ones. This transformation reveals the hierarchy of themes in the investigated text.
This function goes a step further and eliminates those links in the topic structure whose strength falls below a certain threshold value. In this way a joint topic structure of a collection of texts breaks into islands representing certain largely independent themes, which help understand the clusters of information in the investigated textbase. Then individual documents can be assigned to different thematic groups, thus facilitating clustering of the documents in a textbase. Of course, occasionally large documents might have several parts corresponding to different thematic clusters. Such documents can be treated as multi-topic, or they can be split in separate parts.
The semantic network can be utilized to score individual sentences in the investigated text. The larger the number of important semantic concepts in a sentence is and the stronger these concepts are related with each other, the higher the semantic weight of the sentence itself is. Then the system collects only those sentences which have a semantic weight higher than a certain adjustable threshold value (Figure 2). This results in summarizing the investigated text. The size of the summary is controlled through changing the sentence selection threshold. An advanced algorithm used for developing an accurate semantic network ensures the high quality and relevance of the created summary.
Natural language information retrieval
The system determines whether an issued natural language query contains words present in the developed semantic network of the investigated text. After that, the sentences containing the identified words are retrieved. Thus one does not have to come up with a predetermined list of key words for a search: the system automatically extracts from a natural language query the best words to utilize (Figure 3). Still more important, the system displays a subtree of concepts that are related to the theme of the query in the context of the analyzed text. These concepts are taken from an immediate neighborhood in the semantic network of the text of the words distilled from the query. This feature allows the user to view an immediate semantic context of the searched theme in the textbase and dive into related subjects to refine the search.
In addition to these important functions one can utilize the described technology of automated creation of an accurate semantic network of the text to provide the user with many other crucial text analysis capabilities. Currently the development team of TextAnalyst is working on implementing automated classification of documents. Measuring the similarity of individual texts is another future feature under consideration.