Text corpus analysis tool

This tool allows you to analyse the words, terms and collocations that appear in several text documents. Lists are displayed which list the words appearing in the texts ranked by frequency of occurrence or generality together with statistics for each word or collocation.

List of words are displayed for:-

  1. the sum of all texts, ordered by frequency.

  2. The sum of all texts, ordered by generality (the number of different texts that a word occurs in.)

  3. the difference between one text and the sum of the others

  4. the commonality between one text and the sum of the others

The user can specify particular words or morphemes of interest and in such cases 2 word collocations can be analysed. This is useful for discovering multi-word terms and testing their grammatical category.

All lists are useful for classifying text without reading and understanding them, or for finding spelling errors, or looking at the vocabulary of a particular author or subject field. The commonality and differencing functions are useful for finding words in a new text that have already been seen before (say by a translator) in previous texts and finding those words which are new (and so may need research).

The generality ranking is useful for separating general terms from specialist terms regardless of frequency.

Here are some screen shots of the text corpus analysis tool.

This tool has been implemented and is available to members of Club Cycom.