TextDistance
TextDistance -- python library for comparing distance between two or more sequences by many algorithms.
Features:
30+ algorithms
Pure python implementation
Simple usage
More than two sequences comparing
Some algorithms have more than one implementation in one class.
Optional numpy usage for maximum speed.
Algorithms
Edit based
Algorithm
Class
Functions
Hamming
hamming
Mlipns
mlipns
Levenshtein
levenshtein
DamerauLevenshtein
damerau_levenshtein
JaroWinkler
jaro_winkler, jaro
StrCmp95
strcmp95
NeedlemanWunsch
needleman_wunsch
Gotoh
gotoh
SmithWaterman
smith_waterman
Token based
Algorithm
Class
Functions
Jaccard
jaccard
Sorensen
sorensen, sorensen_dice, dice
Tversky
tversky
MongeElkan
monge_elkan
Sequence based
Compression based
Normalized compression distance with different compression algorithms.
Classic compression algorithms:
Algorithm
Class
Function
ArithNCD
arith_ncd
RLENCD
rle_ncd
BWTRLENCD
bwtrle_ncd
Normal compression algorithms:
Algorithm
Class
Function
Square Root
SqrtNCD
sqrt_ncd
EntropyNCD
entropy_ncd
Work in progress algorithms that compare two strings as array of bits:
Algorithm
Class
Function
BZ2NCD
bz2_ncd
LZMANCD
lzma_ncd
ZLIBNCD
zlib_ncd
See blog post for more details about NCD.
Phonetic
Algorithm
Class
Functions
MRA
mra
Editex
editex
Simple
Algorithm
Class
Functions
Prefix similarity
Prefix
prefix
Postfix similarity
Postfix
postfix
Length distance
Length
length
Identity similarity
Identity
identity
Matrix similarity
Matrix
matrix
Installation
Stable
Only pure python implementation:
pip install textdistance
With extra libraries for maximum speed:
pip install textdistance[extras]
With all libraries (required for benchmarking and testing):
pip install textdistance[benchmark]
With algorithm specific extras:
pip install textdistance[Hamming]
Algorithms with available extras: DamerauLevenshtein, Hamming, Jaro, JaroWinkler, Levenshtein.
Dev
Via pip:
pip install -e git+https://github.com/orsinium/textdistance.git#egg=textdistance
Or clone repo and install with some extras:
git clone https://github.com/orsinium/textdistance.git
pip install -e .[benchmark]
Usage
All algorithms have 2 interfaces:
Class with algorithm-specific params for customizing.
Class instance with default params for quick and simple usage.
All algorithms have some common methods:
.distance(*sequences) -- calculate distance between sequences.
.similarity(*sequences) -- calculate similarity for sequences.
.maximum(*sequences) -- maximum possible value for distance and similarity. For any sequence: distance + similarity == maximum.
.normalized_distance(*sequences) -- normalized distance between sequences. The return value is a float between 0 and 1, where 0 means equal, and 1 totally different.
.normalized_similarity(*sequences) -- normalized similarity for sequences. The return value is a float between 0 and 1, where 0 means totally different, and 1 equal.
Most common init arguments:
qval -- q-value for split sequences into q-grams. Possible values:
1 (default) -- compare sequences by chars.
2 or more -- transform sequences to q-grams.
None -- split sequences by words.
as_set -- for token-based algorithms:
True -- t and ttt is equal.
False (default) -- t and ttt is different.
Example
import textdistance
textdistance.hamming('test', 'text')
# 1
textdistance.hamming.distance('test', 'text')
# 1
textdistance.hamming.similarity('test', 'text')
# 3
textdistance.hamming.normalized_distance('test', 'text')
# 0.25
textdistance.hamming.normalized_similarity('test', 'text')
# 0.75
textdistance.Hamming(qval=2).distance('test', 'text')
# 2
Any other algorithms have same interface.
Extra libraries
For main algorithms textdistance try to call known external libraries (fastest first) if available (installed in your system) and possible (this implementation can compare this type of sequences). Install textdistance with extras for this feature.
You can disable this by passing external=False argument on init:
import textdistance
hamming = textdistance.Hamming(external=False)
hamming('text', 'testit')
# 3
Supported libraries:
Algorithms:
DamerauLevenshtein
Hamming
Jaro
JaroWinkler
Levenshtein
Benchmarks
Without extras installation:
algorithm
library
function
time
DamerauLevenshtein
jellyfish
damerau_levenshtein_distance
0.00965294
DamerauLevenshtein
pyxdameraulevenshtein
damerau_levenshtein_distance
0.151378
DamerauLevenshtein
pylev
damerau_levenshtein
0.766461
DamerauLevenshtein
textdistance
DamerauLevenshtein
4.13463
DamerauLevenshtein
abydos
damerau_levenshtein
4.3831
Hamming
Levenshtein
hamming
0.0014428
Hamming
jellyfish
hamming_distance
0.00240262
Hamming
distance
hamming
0.036253
Hamming
abydos
hamming
0.0383933
Hamming
textdistance
Hamming
0.176781
Jaro
Levenshtein
jaro
0.00313561
Jaro
jellyfish
jaro_distance
0.0051885
Jaro
py_stringmatching
jaro
0.180628
Jaro
textdistance
Jaro
0.278917
JaroWinkler
Levenshtein
jaro_winkler
0.00319735
JaroWinkler
jellyfish
jaro_winkler
0.00540443
JaroWinkler
textdistance
JaroWinkler
0.289626
Levenshtein
Levenshtein
distance
0.00414404
Levenshtein
jellyfish
levenshtein_distance
0.00601647
Levenshtein
py_stringmatching
levenshtein
0.252901
Levenshtein
pylev
levenshtein
0.569182
Levenshtein
distance
levenshtein
1.15726
Levenshtein
abydos
levenshtein
3.68451
Levenshtein
textdistance
Levenshtein
8.63674
Total: 24 libs.
Yeah, so slow. Use TextDistance on production only with extras.
Textdistance use benchmark's results for algorithm's optimization and try to call fastest external lib first (if possible).
You can run benchmark manually on your system:
pip install textdistance[benchmark]
python3 -m textdistance.benchmark
TextDistance show benchmarks results table for your system and save libraries priorities into libraries.json file in TextDistance's folder. This file will be used by textdistance for calling fastest algorithm implementation. Default libraries.json already included in package.
Test
You can run tests via tox:
sudo pip3 install tox
tox