Skip to content

pluots/stringmetrics

Repository files navigation

Stringmetrics

This is a Rust library for approximate string matching that implements simple algorithms such has Hamming distance, Levenshtein distance, Jaccard similarity, and more.

Here are some useful quick links:

Algorithms

The main purpose of this library is to provide a variety of string metric functions. Included algorithms are:

  • Levenshtein Distance
  • Limited & Weighted Levenshtein Distance
  • Jaccard Similarity
  • Hamming Distance

See the documentation for full information. Some examples are below:

// Basic levenshtein distance
use stringmetrics::levenshtein;

assert_eq!(levenshtein("kitten", "sitting"), 3);
// Levenshtein distance with a limit to save computation time
use stringmetrics::levenshtein_limit;

assert_eq!(levenshtein_limit("a very long string", "short!", 4), 4);
// Set custom weights
use stringmetrics::{levenshtein_weight, LevWeights};

// This struct holds insertion, deletion, and substitution costs
let weights = LevWeights::new(4, 3, 2);
assert_eq!(levenshtein_weight("kitten", "sitting", 100, &weights), 8);
// Basic hamming distance
use stringmetrics::hamming;

let a = "abcdefg";
let b = "aaadefa";
assert_eq!(hamming(a, b), Ok(3));

Future Algorithms & Direction

Eventually, this library aims to add support for more algorithms. Intended work includes:

  1. Update levenshtein distance to have a more performant algorithm for short (<64 characters) and long (>100 characters) strings
  2. Add the Damerau–Levenshtein distance
  3. Add the Jaro–Winkler distance
  4. Add the Tversky index
  5. Add Cosine similarity
  6. Add some useful tokenizers with examples

License

See the LICENSE file for license information. The provided license does allow for proprietary use and adaptation; that being said, I kindly suggest that if you come up with an improvement, you submit a pull request and help us all out :)