README 1.0 KB

123456789101112131415161718192021
  1. Libtextcat is a library with functions that implement the
  2. classification technique described in Cavnar & Trenkle, "N-Gram-Based
  3. Text Categorization". It was primarily developed for language
  4. guessing, a task on which it is known to perform with near-perfect
  5. accuracy.
  6. The central idea of the Cavnar & Trenkle technique is to calculate a
  7. "fingerprint" of a document with an unknown category, and compare this
  8. with the fingerprints of a number of documents of which the categories
  9. are known. The categories of the closest matches are output as the
  10. classification. A fingerprint is a list of the most frequent n-grams
  11. occurring in a document, ordered by frequency. Fingerprints are
  12. compared with a simple out-of-place metric. See the article for more
  13. details.
  14. Considerable effort went into making this implementation fast and
  15. efficient. The language guesser processes over 100 documents/second on
  16. a simple PC, which makes it practical for many uses. It was developed
  17. for use in our webcrawler and search engine software, in which it it
  18. handles millions of documents a day.