skvl/rapidfuzz-cpp

No Description

Max Bachmann 5ed9a7f318 add unicode benchmark		1 year ago
.github	571925a22d Ignore invalid warning on gcc < 10 and add gcc test to CI (#79)	2 years ago
bench	5ed9a7f318 add unicode benchmark	1 year ago
cmake	138b289190 improve cmake config	2 years ago
docs	ad2bcd7c9c replace bitpal with lcs in InDel distance	2 years ago
examples	4bd72a2e9e reduce compiler warnings (#78)	2 years ago
extras	1406b083d5 fix banded implementation	1 year ago
fuzzing	ea90dd443b disable broken banded Levenshtein implementation	1 year ago
rapidfuzz	1406b083d5 fix banded implementation	1 year ago
test	ea90dd443b disable broken banded Levenshtein implementation	1 year ago
tools	2d99e3f5f9 fix recursion error	1 year ago
.clang-format	5e8931e761 use ranges to wrap iterators (#86)	1 year ago
.gitattributes	b2240f75cb fix syntax highlighting	2 years ago
.gitignore	5e8931e761 use ranges to wrap iterators (#86)	1 year ago
CHANGELOG.md	2278c3d737 release v1.1.0	1 year ago
CMakeLists.txt	2278c3d737 release v1.1.0	1 year ago
Doxyfile	ad2bcd7c9c replace bitpal with lcs in InDel distance	2 years ago
LICENSE	8c147308b0 refactor SentenceView	3 years ago
README.md	4bd72a2e9e reduce compiler warnings (#78)	2 years ago

Rapid fuzzy string matching in C++ using the Levenshtein Distance

Description • Installation • Usage • License

Description

RapidFuzz is a fast string matching library for Python and C++, which is using the string similarity calculations from FuzzyWuzzy. However there are two aspects that set RapidFuzz apart from FuzzyWuzzy: 1) It is MIT licensed so it can be used whichever License you might want to choose for your project, while you're forced to adopt the GPL license when using FuzzyWuzzy 2) It is mostly written in C++ and on top of this comes with a lot of Algorithmic improvements to make string matching even faster, while still providing the same results. More details on these performance improvements in form of benchmarks can be found here

The Library is splitted across multiple repositories for the different supported programming languages:

The C++ version is versioned in this repository
The Python version can be found at maxbachmann/rapidfuzz

CMake Integration

There are severals ways to integrate rapidfuzz in your CMake project.

By Installing it

git clone https://github.com/maxbachmann/rapidfuzz-cpp.git rapidfuzz-cpp
cd rapidfuzz-cpp
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build .
cmake --build . --target install

Then in your CMakeLists.txt:

find_package(rapidfuzz REQUIRED)
add_executable(foo main.cpp)
target_link_libraries(foo rapidfuzz::rapidfuzz)

Add this repository as a submodule

git submodule add https://github.com/maxbachmann/rapidfuzz-cpp.git 3rdparty/RapidFuzz

Then you can either:

include it as a subdirectory cmake add_subdirectory(3rdparty/RapidFuzz) add_executable(foo main.cpp) target_link_libraries(foo rapidfuzz::rapidfuzz)

build it at configure time with FetchContent

FetchContent_Declare(
  rapidfuzz
  SOURCE_DIR ${CMAKE_SOURCE_DIR}/3rdparty/RapidFuzz
  PREFIX ${CMAKE_CURRENT_BINARY_DIR}/rapidfuzz
  CMAKE_ARGS -DCMAKE_INSTALL_PREFIX:PATH=<INSTALL_DIR> "${CMAKE_OPT_ARGS}"
)
FetchContent_MakeAvailable(rapidfuzz)
add_executable(foo main.cpp)
target_link_libraries(foo PRIVATE rapidfuzz::rapidfuzz)

Download it at configure time

If you don't want to add rapidfuzz-cpp as a submodule, you can also download it with FetchContent:

FetchContent_Declare(rapidfuzz
  GIT_REPOSITORY https://github.com/maxbachmann/rapidfuzz-cpp.git
  GIT_TAG master)
FetchContent_MakeAvailable(rapidfuzz)
add_executable(foo main.cpp)
target_link_libraries(foo PRIVATE rapidfuzz::rapidfuzz)

It will be downloaded each time you run CMake in a blank folder.

CMake option

There are CMake options available:

BUILD_TESTS : to build test (default OFF and requires Catch2)
BUILD_BENCHMARKS : to build benchmarks (default OFF and requires Google Benchmark)

Usage

#include "rapidfuzz/fuzz.hpp"

Simple Ratio

using rapidfuzz::fuzz::ratio;

// score is 96.55171966552734
double score = rapidfuzz::fuzz::ratio("this is a test", "this is a test!");

Partial Ratio

// score is 100
double score = rapidfuzz::fuzz::partial_ratio("this is a test", "this is a test!");

Token Sort Ratio

// score is 90.90908813476562
double score = rapidfuzz::fuzz::ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

// score is 100
double score = rapidfuzz::fuzz::token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

Token Set Ratio

// score is 83.8709716796875
double score = rapidfuzz::fuzz::token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")

// score is 100
double score = rapidfuzz::fuzz::token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")

Process

In the Python implementation there is a module process, which is used to compare e.g. a string to a list of strings. In Python this both saves the time to implement those features yourself and can be a lot more efficient than repeated type conversions between Python and C++. Implementing a similar function in C++ using templates is not easily possible and probably slower than implementing them on your own. Thats why this section describes how users can implement those features with a couple lines of code using the C++ library.

extract

The following function compares a query string to all strings in a list of choices. It returns all elements with a similarity over score_cutoff. Generally make use of the cached implementations when comparing a string to multiple strings.

template <typename Sentence1,
          typename Iterable, typename Sentence2 = typename Iterable::value_type>
std::vector<std::pair<Sentence2, double>>
extract(const Sentence1& query, const Iterable& choices, const double score_cutoff = 0.0)
{
  std::vector<std::pair<Sentence2, double>> results;

  rapidfuzz::fuzz::CachedRatio<typename Sentence1::value_type> scorer(query);

  for (const auto& choice : choices) {
    double score = scorer.similarity(choice, score_cutoff);

    if (score >= score_cutoff) {
      results.emplace_back(choice, score);
    }
  }

  return results;
}

extractOne

The following function compares a query string to all strings in a list of choices.

template <typename Sentence1,
          typename Iterable, typename Sentence2 = typename Iterable::value_type>
std::optional<std::pair<Sentence2, double>>
extractOne(const Sentence1& query, const Iterable& choices, const double score_cutoff = 0.0)
{
  bool match_found = false;
  double best_score = score_cutoff;
  Sentence2 best_match;

  rapidfuzz::fuzz::CachedRatio<typename Sentence1::value_type> scorer(query);

  for (const auto& choice : choices) {
    double score = scorer.similarity(choice, best_score);

    if (score >= best_score) {
      match_found = true;
      best_score = score;
      best_match = choice;
    }
  }

  if (!match_found) {
    return nullopt;
  }

  return std::make_pair(best_match, best_score);
}

multithreading

It is very simple to use those scorers e.g. with open OpenMP to achieve better performance.

template <typename Sentence1,
          typename Iterable, typename Sentence2 = typename Iterable::value_type>
std::vector<std::pair<Sentence2, double>>
extract(const Sentence1& query, const Iterable& choices, const double score_cutoff = 0.0)
{
  std::vector<std::pair<Sentence2, double>> results(choices.size());

  rapidfuzz::fuzz::CachedRatio<typename Sentence1::value_type> scorer(query);

  #pragma omp parallel for
  for (size_t i = 0; i < choices.size(); ++i) {
    double score = scorer.similarity(choices[i], score_cutoff);
    results[i] = std::make_pair(choices[i], score);
  }

  return results;
}

License

RapidFuzz is licensed under the MIT license since I believe that everyone should be able to use it without being forced to adopt the GPL license. Thats why the library is based on an older version of fuzzywuzzy that was MIT licensed as well. This old version of fuzzywuzzy can be found here.

README.md