Powerful plagiarism and collusion detection
Used by the professions, where data security is a prime concern.
Uses 'fuzzy-matching' to detect re-writing as well as direct copying.
Sophisticated searching
A completely new way of searching large numbers of documents
No Keywords, no Proximity setting, no AND's or OR's.
Just use a whole document or set of documents for the search entry.
Investigator does the rest by identifying similar sentences in the documents, and presenting the results ordered by the strongest sentence links between documents.
Fast
Scaleable
GUI or automatic
Multi-threaded, multi-processor capability
Multi-platform - written in Java.
Purpose
CopyCatch Investigator looks for similarity between sentences in documents without using keywords or any other user entered search patterns. This makes it different from most search engines and databases in three ways.
- Whole documents or sets of documents are used as the search data instead of keywords, phrases or Boolean operators.
- The program uses the level of similarity required by the user to examine each document against the index selected.
- It looks for similarity, not identity, so it insensitive to changes in word order, the use of a thesaurus to change some words, and the insertion or deletion of material. It finds identity as well, of course, as the extreme case of similarity.
Interface
The interface has been designed in consultation with users, and is extremely simple to operate, with only two main screens.
- The Searching tab allows you to choose the files and indexes, set parameters and to review the results.
- The Indexing tab allows you to create indexes, switch languages and use other word lists.
- The Content Words tab shows how many words are shared and how many are only in one file or another.
- The Statistics tab gives a summary of the amount of sharing of words and sentences in the pair.
Presentation
Sentences identified as similar are shown side by side, in the order of the document being used as a query. The similar sentences are cross-referenced to the position in the current indexed file which has been found to share material. You have the option of seeing both files side by side fully marked up, so that related sentences can be seen in the context of the different or less similar sentences. In the screen shot above, you can see that sentence 17 on the right is a modified cut and paste of 46 on the left, (or vice versa), whereas 19 is almost certainly a contraction of 48. You can also see that neither example has long successive runs of words in common. The program does not take account of word order, either, so substantial re-writing can be identified.
Indexing
Investigator is built with the recognition that different users have different requirements
- Forensic Analysts and Plagiarism Investigators might need to index every word and get reports at a low similarity level.
- Lawyers might have very long and complex documents, and only need to index longer sentences, with some common terminology ignored.
Reporting
Levels of reporting are also chosen by the user.
- The minimum number of sentences in common can be selected, when limited use of source material is known or expected.
- The minimum number of words which must match in a sentence can also be set. If you only want longer sentences, this can be set high; if you want all sentences you set it low.
- The level of sentence similarity can also be set. Do you want at least 50% matching or do you need 30%? This depends on what you are looking for and what you know or find out about the way the indexed material is being used, so you just move the slider to the required level.
Search
Both web searchers and database search engines are very fast at delivering answers once you have formulated the questions. What users and the suppliers of the search software tend to overlook is that the total search time involves
- Constructing an appropriate query
- Searching.
- Considering the answers returned.
Multilingual
The program uses lists of function words to assist the discrimination process, so if you have such a list in a plain text file then you can switch languages with a couple of mouse clicks. We have a number of such lists available on request. Note: It can't find similarities between documents written in two different languages