java版dedupe_CHANGELOG.md

最新推荐文章于 2022-11-04 15:51:10 发布

weixin_39793794

最新推荐文章于 2022-11-04 15:51:10 发布

阅读量72

点赞数

文章标签： java版dedupe

本文链接：https://blog.csdn.net/weixin_39793794/article/details/114964273

版权

# 2.0.6

- fixed bug that was preventing learning of index predicates in Dedupe mode

# 2.0.3

- Improved memory performance of connected components

# 2.0

- Python 3 only

- Static typing and type Hints

- Incorporate sqlite to extend normal API to millions of records

- Multiprocessing enabled for Windows

- Multiprocessing mode changed to spawn for Mac OS X

- Moved from CamelCase to lowercase_with_underscore for method names.

- Dropped ability to save indices in save settings.

- Moved from Deduper.match -> Dedupe.partition, RecordLink.match -> RecordLink.join, Gazetteer.match -> Gazetteer.search

- Renamed Matching.blocker -> Matching.fingerprinter

- Moved to autodoc for documentation

- Dropped threshold methods

- matchBlocks has been replaced by score, which takes pairs of records not blocks

# 1.10.0

- Dropped python 2.7 support

# 1.9.4

- Cleaned up block learning

# 1.9.3

- Improved performance of connected components algorithm with very large components

- Fixed pickling unpickling bug of Index predicate classes

# 1.9.0

- Implemented a disagreement based active labeler to improve blocking recall

# 1.8.2

- removed shelve-backed persistence in blocking data in favor of an improved in-memory implementation

# 1.8.0

- matchBlocks is not a generator; match is now optionally a generator. If the

generator option is turned of for the Gazette match is lazy

# 1.7.8

- Speed up blocking, on our way to 3-predicates

# 1.7.5

- Significantly reduced memory footprint during connected_components

# 1.7.3

- Significantly reduced memory footprint during scoreDuplicates

# 1.7.2

- Improper release

# 1.7.1

- TempShelve class that addresses various bugs related to cleaning up tempoary shelves

# 1.7.0

- Added `target` argument to blocker and predicates for changing the behavior

of the predicates for the target and source dataset if we are linking.

# 1.6.8

- Use file-backed blocking with dbm, dramatically increases size of data that can be handled without special programming

# 1.6.7

- Reduce memory footprint of matching

# 1.6.0

- Simplify .train method

# 1.5.5

- Levenshtein search based index predicates thanks to @mattandahalfew

# 1.5.0

- simplified the sample API, this might be a breaking change for some

- the active learner interface is now more modular to allow for a different learner

- random sampling of pairs has been improved for linking case and

dedupe case, h/t to @MarkusShepherd

## 1.4.15

- frozendicts have finally been removed

- first N char predicates return their entire length if length is less

than N, instead of nothing

- crossvalidation is skipped in active learning if using default rlr learner

## 1.4.5

- Block indexes can now be persisted by using the index=True argument

in the writeSettings method

## 1.4.1

- Now uses C version of double metaphone for speed

- Much faster compounding of blocks in block learning

## 1.4.0

- Block learning now tries to minimize the total number of comparisons

not just the comparisons of distinct records. This decouples makes

block learning from learning classifier learning. This change has

requires new, different arguments to the train method.

## 1.3.8

- Console labeler now shows fields in the order they are defined in

the data model. The labeler also reports number of labeled examples

- `pud` argument added to the `train` method. Proportion of uncovered

dupes. This deprecates `uncovered_dupes` argument

## 1.3.0

- If we have enough training data, consider Compound predicates of length 3 in addition to predicates of length 2

## 1.1.1

- None now treated as missing data indicator. Warnings for deprecations of older types of missing data indicators

## 1.1.0

Features

- Handle FuzzyCategoricalType in datamodel

## 1.0.0

Features

- Speed up learning

- Parallelize sampling

- Optional [CRF Edit Distance](https://dedupe.readthedocs.io/en/latest/Variable-definition.html#optional-edit-distance)

## 0.8.0

Support for Python 3.4 added. Support for Python 2.6 dropped.

Features

- Windows OS supported

- train method has argument for not considering index predicates

- TfIDFNGram Index Predicate added (for shorter string)

- SuffixArray Predicate

- Double Metaphone Predicates

- Predicates for numbers, OrderOfMagnitude, Round

- Set Predicate OrderOfCardinality

- Final, learned predicates list will now often be smaller without

loss of coverage

- Variables refactored to support external extensions like

https://github.com/datamade/dedupe-variable-address

- Categorical distance, regularized logistic regression, affine gap

distance, canonicalization have been turned into separate libraries.

- Simplejson is now dependency

## 0.7.5

Features

- Individual record cluster membership scores

- New predicates

- New Exists Variable Type

Bug Fixes

- Latlong predicate fixed

- Set TFIDF canopy working properly

## 0.7.4

Features

- Sampling methods now use blocked sampling

## 0.7.0

Version 0.7.0 is backwards compatible, except for the match method of Gazetteer class

Features

- new index, unindex, and match methods in Gazetter Matching. Useful for

streaming matching

## 0.6.0

Version 0.6.0 is *not* backwards compatible.

Features :

- new Text, ShortString, and exact string types

- multiple variables can be defined on same field

- new Gazette linker for matching dirty records against a master list

- performance improvements, particularly in memory usage

- canonicalize function in dedupe.convenience for creating a canonical representation of a cluster of records

- tons of bugfixes

API breaks

- when initializing an ActiveMatching object, `variable_definition` replaces `field_definition` and is a list of dictionaries instead of a dictionary. See the documentation for details

- also when initializing a Matching object, `num_processes` has been replaced by `num_cores`, which now defaults to the

number of cpus on the machine

- when initializing a StaticMatching object, `settings_file` is now expected to be a file object not a string. The `readTraining`, `writeTraining`, `writeSettings` methods also all now expect file objects

## 0.5

Version 0.5 is *not* backwards compatible.

Features :

- Special case code for linking two datasets that, individually are unique

- Parallel processing using python standard library multiprocessing

- Much faster canopy creation using zope.index

- Asynchronous active learning methods

API breaks :

- `duplicateClusters` has been removed, it has been replaced by

`match` and `matchBlocks`

- `goodThreshold` has been removed, it has been replaced by

`threshold` and `thresholdBlocks`

- the meaning of `train` has changed. To train from training file use `readTraining`. To use console labeling, pass a dedupe instance to the `consoleLabel` function

- The convenience function dataSample has been removed. It has been replaced by

the `sample` methods

- It is no longer necessary to pass `frozendicts` to `Matching` classes

- `blockingFunction` has been removed and been replaced by the `blocker` method

一键复制

编辑

Web IDE

原始数据

按行查看

历史