TextDistance
TextDistance
-- python library for comparing distance between two or more
sequences by many algorithms.
Features:
30+ algorithms
Pure python implementation
Simple usage
More than two sequences comparing
Some algorithms have more than one implementation in one class.
Optional numpy usage for maximum speed.
Algorithms
Edit based
Algorithm
Class
Functions
Hamming
hamming
MLIPNS
|
Mlipns
|
mlipns
Levenshtein
|
Levenshtein
|
levenshtein
Damerau-
Levenshtein
|
DamerauLevenshtein
|
damerau_levenshtein
Jaro-Winkler
|
JaroWinkler
|
jaro_winkler
,
jaro
Strcmp95
|
StrCmp95
|
strcmp95
Needleman-
Wunsch
|
NeedlemanWunsch
|
needleman_wunsch
Gotoh
|
Gotoh
|
gotoh
Smith-
Waterman
|
SmithWaterman
|
smith_waterman
Token based
Algorithm
Class
Functions
jaccard
Sørensen–Dice
coefficient
|
Sorensen
|
sorensen
,
sorensen_dice
,
dice
Tversky index
|
Tversky
|
tversky
Overlap coefficient
|
Overlap
|
overlap
Tanimoto
distance
|
Tanimoto
|
tanimoto
Cosine similarity
|
Cosine
|
cosine
Monge-Elkan
|
MongeElkan
|
monge_elkan
Bag
distance
|
Bag
|
bag
Sequence based
Algorithm | Class | Functions
---|---|---
longest common subsequence
similarity
|
LCSSeq
|
lcsseq
longest common substring
similarity
|
LCSStr
|
lcsstr
Ratcliff-Obershelp
similarity
|
RatcliffObershelp
|
ratcliff_obershelp
Compression based
Normalized compression
distance
with different compression algorithms.
Classic compression algorithms:
Algorithm
Class
Function
ArithNCD
arith_ncd
RLENCD
rle_ncd
BWT RLE
|
BWTRLENCD
|
bwtrle_ncd
Normal compression algorithms:
Algorithm
Class
Function
Square Root
SqrtNCD
sqrt_ncd
EntropyNCD
entropy_ncd
Work in progress algorithms that compare two strings as array of bits:
Algorithm
Class
Function
BZ2NCD
bz2_ncd
LZMANCD
lzma_ncd
ZLIBNCD
zlib_ncd
See
blog post
for more details about
NCD.
Phonetic
Algorithm
Class
Functions
MRA
mra
Editex
editex
Simple
Algorithm
Class
Functions
Prefix similarity
Prefix
prefix
Postfix similarity
Postfix
postfix
Length distance
Length
length
Identity similarity
Identity
identity
Matrix similarity
Matrix
matrix
Installation
Stable
Only pure python implementation:
pip install textdistance
With extra libraries for maximum speed:
pip install textdistance[extras]
With all libraries (required for benchmarking and testing):
pip install textdistance[benchmark]
With algorithm specific extras:
pip install textdistance[Hamming]
Algorithms with available extras:
DamerauLevenshtein
,
Hamming
,
Jaro
,
JaroWinkler
,
Levenshtein
.
Dev
Via pip:
pip install -e git+https://github.com/orsinium/textdistance.git#egg=textdistance
Or clone repo and install with some extras:
git clone https://github.com/orsinium/textdistance.git
pip install -e .[benchmark]
Usage
All algorithms have 2 interfaces:
Class with algorithm-specific params for customizing.
Class instance with default params for quick and simple usage.
All algorithms have some common methods:
.distance(*sequences)
-- calculate distance between sequences.
.similarity(*sequences)
-- calculate similarity for sequences.
.maximum(*sequences)
-- maximum possible value for distance and similarity. For any sequence:
distance + similarity == maximum
.
.normalized_distance(*sequences)
-- normalized distance between sequences. The return value is a float between 0 and 1, where 0 means equal, and 1 totally different.
.normalized_similarity(*sequences)
-- normalized similarity for sequences. The return value is a float between 0 and 1, where 0 means totally different, and 1 equal.
Most common init arguments:
qval
-- q-value for split sequences into q-grams. Possible values:
1 (default) -- compare sequences by chars.
2 or more -- transform sequences to q-grams.
None -- split sequences by words.
as_set
-- for token-based algorithms:
True --
t
and
ttt
is equal.
False (default) --
t
and
ttt
is different.
Example
import textdistance
textdistance.hamming('test', 'text')
# 1
textdistance.hamming.distance('test', 'text')
# 1
textdistance.hamming.similarity('test', 'text')
# 3
textdistance.hamming.normalized_distance('test', 'text')
# 0.25
textdistance.hamming.normalized_similarity('test', 'text')
# 0.75
textdistance.Hamming(qval=2).distance('test', 'text')
# 2
Any other algorithms have same interface.
Extra libraries
For main algorithms textdistance try to call known external libraries (fastest
first) if available (installed in your system) and possible (this
implementation can compare this type of sequences). Install textdistance with
extras for this feature.
You can disable this by passing
external=False
argument on init:
import textdistance
hamming = textdistance.Hamming(external=False)
hamming('text', 'testit')
# 3
Supported libraries:
Algorithms:
DamerauLevenshtein
Hamming
Jaro
JaroWinkler
Levenshtein
Benchmarks
Without extras installation:
algorithm
library
function
time
DamerauLevenshtein
jellyfish
damerau
levenshtein
distance
0.00965294
DamerauLevenshtein
pyxdameraulevenshtein
damerau
levenshtein
distance
0.151378
DamerauLevenshtein | pylev | damerau
levenshtein | 0.766461
DamerauLevenshtein |
textdistance
| DamerauLevenshtein | 4.13463
DamerauLevenshtein | abydos | damerau
levenshtein | 4.3831
Hamming | Levenshtein | hamming | 0.0014428
Hamming | jellyfish | hamming
distance | 0.00240262
Hamming | distance | hamming | 0.036253
Hamming | abydos | hamming | 0.0383933
Hamming |
textdistance
| Hamming | 0.176781
Jaro | Levenshtein | jaro | 0.00313561
Jaro | jellyfish | jaro
distance | 0.0051885
Jaro | py
stringmatching | jaro | 0.180628
Jaro |
textdistance
| Jaro | 0.278917
JaroWinkler | Levenshtein | jaro
winkler | 0.00319735
JaroWinkler | jellyfish | jaro
winkler | 0.00540443
JaroWinkler |
textdistance
| JaroWinkler | 0.289626
Levenshtein | Levenshtein | distance | 0.00414404
Levenshtein | jellyfish | levenshtein
distance | 0.00601647
Levenshtein | py_stringmatching | levenshtein | 0.252901
Levenshtein | pylev | levenshtein | 0.569182
Levenshtein | distance | levenshtein | 1.15726
Levenshtein | abydos | levenshtein | 3.68451
Levenshtein |
textdistance
| Levenshtein | 8.63674
Total: 24 libs.
Yeah, so slow. Use TextDistance on production only with extras.
Textdistance use benchmark's results for algorithm's optimization and try to
call fastest external lib first (if possible).
You can run benchmark manually on your system:
pip install textdistance[benchmark]
python3 -m textdistance.benchmark
TextDistance show benchmarks results table for your system and save libraries
priorities into
libraries.json
file in TextDistance's folder. This file will
be used by textdistance for calling fastest algorithm implementation. Default
libraries.json
already included in package.
Test
You can run tests via
tox
:
sudo pip3 install tox
tox