123456789101112131415161718192021 |
- Libtextcat is a library with functions that implement the
- classification technique described in Cavnar & Trenkle, "N-Gram-Based
- Text Categorization". It was primarily developed for language
- guessing, a task on which it is known to perform with near-perfect
- accuracy.
- The central idea of the Cavnar & Trenkle technique is to calculate a
- "fingerprint" of a document with an unknown category, and compare this
- with the fingerprints of a number of documents of which the categories
- are known. The categories of the closest matches are output as the
- classification. A fingerprint is a list of the most frequent n-grams
- occurring in a document, ordered by frequency. Fingerprints are
- compared with a simple out-of-place metric. See the article for more
- details.
- Considerable effort went into making this implementation fast and
- efficient. The language guesser processes over 100 documents/second on
- a simple PC, which makes it practical for many uses. It was developed
- for use in our webcrawler and search engine software, in which it it
- handles millions of documents a day.
|