Home

Awesome

liblevenshtein-cpp

Action TypeBuild Status
Operating SystemTest UbuntuTest MacOSTest Windows
Engineering ExcellenceCoverage StatusLinterCodeQL
Demo AppRun Demo
DocumentationDeploy static content to Pages

A library for generating Finite State Transducers based on Levenshtein Automata.

NOTE: This library is currently in rc phase. I'll have it production ready as soon as possible. Currently, there is >90% test coverage over the sources and the library is usable as described below.

Due to limited resources on my part, this library requires C++20 features (or whichever is the latest standard). If you need compatibility with an older standard, please either submit a pull request (preferably) or create an issue stating the standard you need compatibility with and I will comply if I can.

For a demonstration, please reference the example app.

For API documentation, please reference the GitHub Pages.

Development and Installation

For instructions how to develop and install liblevenshtein, please reference the wiki.

Usage

Algorithms

liblevenshtein supports three variations of Levenshtein distance, where each variation is defined by the elementary operations it supports. An elementary operation is an edit operation that errs in a penalty of 1 unit.

  1. liblevenshtein::Algorithm::STANDARD
  1. liblevenshtein::Algorithm::TRANSPOSITION
  1. liblevenshtein::Algorithm::MERGE_AND_SPLIT

Results

liblevenshtein supports returning results in two formats:

  1. std::string
  1. liblevenshtein::Candidate

Example

# file: CMakeLists.txt

cmake_minimum_required(VERSION 3.20 FATAL_ERROR)

project(liblevenshtein-demo
  VERSION 1.0.0
  DESCRIPTION "Demonstrates how to use liblevenshtein-cpp."
  HOMEPAGE_URL "https://github.com/universal-automata/liblevenshtein-cpp"
  LANGUAGES CXX)

set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_EXTENSIONS OFF)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

SET(CMAKE_CXX_FLAGS_DEBUG "-g -O0")
SET(CMAKE_C_FLAGS_DEBUG "-g -O0")

set(CMAKE_COMPILE_WARNING_AS_ERROR ON)

set(CMAKE_VERBOSE_MAKEFILE ON)

include(GNUInstallDirs)

find_package(Protobuf REQUIRED)
find_package(liblevenshtein REQUIRED)

add_executable(${PROJECT_NAME}
  "main.cpp")

target_link_libraries(${PROJECT_NAME}
  PRIVATE
    protobuf::libprotobuf
    levenshtein)
// file: main.cpp

#include <algorithm>
#include <cstddef>
#include <string>
#include <utility>
#include <vector>

#include <google/protobuf/stubs/common.h>

#include <liblevenshtein/collection/dawg.h>
#include <liblevenshtein/collection/sorted_dawg.h>
#include <liblevenshtein/transducer/algorithm.h>
#include <liblevenshtein/transducer/transducer.h>
#include <liblevenshtein/serialization/serializer.h>

namespace ll = liblevenshtein;

int main(int argc, char *argv[]) {
  // Verify that the version of the library that we linked against is
  // compatible with the version of the headers we compiled against.
  GOOGLE_PROTOBUF_VERIFY_VERSION;

  // path to file containing serialized dictionary
  std::string serialization_path;

  ll::Dawg *dawg = ll::deserialize_protobuf(serialization_path);

  if (dawg == nullptr) {
    std::vector<std::string> terms; // populate with your spelling candidates
    std::sort(terms.begin(), terms.end()); // must be sorted for now

    // NOTE: If (dawg == nullptr) then the construction of the dictionary
    // failed, probably because terms wasn't sorted lexicographically in
    // ascending order.
    dawg = ll::sorted_dawg(terms.begin(), terms.end());
  }

  /**
   * Template arguments:
   * 1. ll::Algorithm to use for searching (options: STANDARD, TRANSPOSITION, or MERGE_AND_SPLIT)
   * 2. Return type for spelling candidates (options: std::string or ll::Candidate)
   *
   * NOTE: ll::Candidate is an alias for std::pair<std::string, std::size_t>
   */
  ll::Transducer<ll::Algorithm::TRANSPOSITION, ll::Candidate> transduce(dawg->root());

  std::string query_term;    // assign the term whose spelling you wish to correct

  std::size_t max_distance = 2;  // maximum number of operations allowed to transform
                                 // a spelling candidate into query_term (edit distance)

  // NOTE: ll:Candidate is an alias for std::pair<std::string, std::size_t>
  for (const ll::Candidate& candidate : transduce(query_term, max_distance)) {
    const std::string& term = candidate.first;     // spelling candidate for query_term

    const std::size_t& distance = candidate.second;  // minimum number of operations required
                                                     // to transform query_term into term
  }

  /**
   * If you had initialized the transducer as
   * ll::Transducer<ll::Algorithm::TRANSPOSITION, std::string>, you'd iterate
   * over the results as follows:
   * for (const std::string& term : transduce(query_term, max_distance)) {
   *   // do something with term, which is guaranteed to require no more
   *   // than max_distance operations to transform it into the query_term.
   * }
   */

  // save the dictionary for reuse
  ll::serialize_protobuf(dawg, serialization_path);

  delete dawg;

  // Optional:  Delete all global objects allocated by libprotobuf.
  google::protobuf::ShutdownProtobufLibrary();

  return 0;
}