home > Analyst

CopyCatch Analyst is a set of programs which together offer detailed numerical, statistical, phrasal and vocabulary analysis of texts.

This program is used in our consultancy work and is of interest the public and private legal sectors and in particular Forensic Linguistics specialists.

The programs have been used for authorship identification, witness and suspect statement analysis and stylometry.

History
Under the name of the Vocalyse Toolkit, CopyCatch Analyst has been in use within CFL's consultancy operation for the last 10 years, and has been assembled into a single package since 2003. The suite of programs allows the rapid analysis of suspect documents from 250 words upwards, producing statistical, vocabulary and phrasal information about the texts. It is specifically designed for use with the shorter texts associated with Forensic Linguistics but has also been used on much longer documents, particularly in the area of historical investigation of anonymous texts. The programs are not designed to undertake statistical authorship attribution in themselves, but they do supply the data which can be used in such systems.

Analyser
The initial screen allows the choice of any number of files, which can be different document types and from multiple directories. Clicking the Read button gets the program to read each of the files and produce a set of statistics relating to file composition. Each file also has associated word lists and phrasal similarity information.

Function words
The program comes with a list of English Function words built in, although it is simple to change to another list or indeed another language. The lists are plain text, so are easily maintained. The built-in list contains around 450 words, selected in terms of their functional role, not simply on the basis of frequency. All the programs in the suite make use of this list in some way.

Core Vocabulary
This list is also built-in and at present there is no way to use another list, although that will shortly be available as well. The words on this list are the most common content words which have appeared in writing for children over the last century. That is to say, they are not necessarily the most frequent words in general corpora or in writing for adults. They are included because empirical testing has shown that they appear with different total frequencies in a wide range of writing, and that the different usage is frequently associated with authorship. In this way they act as another objective discriminator.

Token/Type ratio
Standard division of all the words in the texts (Tokens) by the vocabulary used (Types), to give an average word usage value. Vocalyse shows the overall TTR and the TTR's for Content words and Function words.

Proportions
You can see how different texts have different proportions of Content to Function words both at the full text (Token) level and at the vocabulary (Type) level.

Richness
Three measures of Richness are calculated. Richness itself uses the Honoré measurement, adjusting hapax legomena for text length. Content Richness and Content Only are designed for use with short forensic texts by discounting the effect of the function words.

Twice ratio
This is the Sichel measurement, the vocabulary used twice as a percentage of the full vocabulary.

Core percentage
This is the percentage of all the content words represented by the total occurrences of the words on the Core vocabulary list.

Saving Results
The results for all the files under review can be saved together. The output is a comma delimited text file which can be loaded directly into a spreadsheet or statistical analysis package.

Vocabulary listings
These are available for all three vocabulary types, Function, Core and Content.

Phrase listings
Phrases comprising mainly Content words and phrases comprising mainly Function words are shown in separately scrollable windows on the Phrases page. These listing are useful when looking for authorial habits or unexpected repetitions.

Hapax distribution
One of the central principles behind the Richness calculations is that the distribution of hapax items through a text will be reasonably regular. Since many texts even up to 50,000 words have at least 50% of their vocabulary types used only once, this is not an unreasonable assumption. However, it is always prudent to check that the texts under consideration are not disrupted unduly at any point. The hapax distribution program allows either a full listing of hapax legomena in a window of different sizes set by the user, or a sample of such windows, for use in statistical analysis programs.

Full text mark up
This page shows you the full text with the different word types and usages shown in colour. Words are identified for once or twice usage by Content and by Function description. This allows you to see at a glance the composition and to some extent the relative complexity of a text.

More details from CFL Software Development.

top