peterlane/uhferret-gem: Copy-detection tool to analyse large sets of documents to find pairs of documents with substantial amounts of lexical copying.

Copy-detection tool to analyse large sets of documents to find pairs of documents with substantial amounts of lexical copying. https://rubygems.org/gems/uhferret

38 Commits

1 Branches

3 Publications

Peter Lane 07645f9110 updated links in README		il y a 2 ans
bin	9ae6251f8d Release version 1.3.7	il y a 4 ans
examples	6b0a03e6c9 finished uhferret-ruby	il y a 13 ans
ext	6b0a03e6c9 finished uhferret-ruby	il y a 13 ans
lib	277ab18015 Updated and completed documentation	il y a 4 ans
COPYING.txt	24d3e18c6d Updated link/email in gemspec	il y a 5 ans
README.rdoc	07645f9110 updated links in README	il y a 2 ans
uhferret.gemspec	9ae6251f8d Release version 1.3.7	il y a 4 ans

		
			
			
				README.rdoc
			
		
		
	
			
				= UHFerret 

homepage:: https://peterlane.codeberg.page/ferret/
source:: https://codeberg.org/peterlane/uhferret-gem/tags

== Description

UHFerret is a copy-detection tool, supporting the analysis of large sets of
documents to find pairs of documents with substantial amounts of lexical
copying. Documents containing either natural language (e.g. English) or
computer programs (in C-family) may be processed.  

This library provides a Ruby wrapper around uhferret suitable for 
scripting, a command-line executable, 'uhferret', and a simple 
server version, 'uhferret-server'.

NB: to install uhferret, Ruby must be able to compile and build C extensions.

== Use 

=== Command Line

    Usage: uhferret [options] file1 file2 ...
        -h, --help                       help message
        -c, --code                       process documents as code
        -t, --text                       process documents as text (default)
        -d, --data-table                 output similarity table (default)
        -l, --list-trigrams              output trigram list
        -a, --all-comparisons            output list of all comparisons
        -x, --xml-report FILE            generate xml report from two documents
        -f, --definition-file FILE       read document names from file

To compute the similarities of a set of files, use:

   $ uhferret file1.txt file2.txt ...

An xml output can be generated for a pair of files using:

   $ uhferret -x outfile.xml file1.txt file2.txt

The xml output can be displayed in a browser using the style sheet 
'uhferret.xsl' in the examples folder, and then printed from the browser.

=== Program

Ferret can also be used as a library, and called from within a program. 
For example:

  ferret = Ferret.new
  ferret.add 'filename1.txt'
  ferret.add 'filename2.txt'
  ferret.run
  ferret.output_similarity_table

Will create a new instance of Ferret, add two documents, run and then output the 
similarity between the two.

=== Server

    Usage: uhferret-server [options]
        -h, --help                       help message
        -p, --port n                     port number
        -f, --folder FOLDER              base folder

The folder to store the processed files will default to 
'FerretFiles' and the port to 2000.
Initial address: http://localhost:2000/ferret/home

NB: The server uses some \*nix commands, and so currently does not work 
under Windows.

== Acknowledgements

UHFerret has been developed at the University of Hertfordshire by members of
the Plagiarism Detection Group.  The original concept of using trigrams for
measuring copying was developed by Caroline Lyon and James Malcolm.  JunPeng
Bao, Ruth Barrett and Bob Dickerson also contributed to the development of
earlier versions of Ferret.