python中文模糊关键词提取 flashtext_FlashText 该模块可用于替换句子中的关键字或从句子中提取关键字。...

FlashText是一个用于快速替换句子中的关键词或提取关键词的Python库。它基于FlashText算法,支持对大小写敏感,并能提供关键词的跨度信息。安装后,可以方便地通过KeywordProcessor类进行操作,如添加、删除关键词,以及提取和替换关键词。
摘要由CSDN通过智能技术生成

FlashText

This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm.

Installation

$ pip install flashtext

API doc

Documentation can be found at FlashText Read the Docs.

Usage

Extract keywords

>>> from flashtext import KeywordProcessor

>>> keyword_processor = KeywordProcessor()

>>> # keyword_processor.add_keyword(, )

>>> keyword_processor.add_keyword('Big Apple', 'New York')

>>> keyword_processor.add_keyword('Bay Area')

>>> keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')

>>> keywords_found

>>> # ['New York', 'Bay Area']

Replace keywords

>>> keyword_processor.add_keyword('New Delhi', 'NCR region')

>>> new_sentence = keyword_processor.replace_keywords('I love Big Apple and new delhi.')

>>> new_sentence

>>> # 'I love New York and NCR region.'

Case Sensitive example

>>> from flashtext import KeywordProcessor

>>> keyword_processor = KeywordProcessor(case_sensitive=True)

>>> keyword_processor.add_keyword('Big Apple', 'New York')

>>> keyword_processor.add_keyword('Bay Area')

>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.')

>>> keywords_found

>>> # ['Bay Area']

Span of keywords extracted

>>> from flashtext import KeywordProcessor

>>> keyword_processor = KeywordProcessor()

>>> keyword_processor.add_keyword('Big Apple', 'New York')

>>> keyword_processor.add_keyword('Bay Area')

>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.', span_info=True)

>>> keywords_found

>>> # [('New York', 7, 16), ('Bay Area', 21, 29)]

Get Extra information with keywords extracted

>>> from flashtext import KeywordProcessor

>>> kp = KeywordProcessor()

>>> kp.add_keyword('Taj Mahal', ('Monument', 'Taj Mahal'))

>>> kp.add_keyword('Delhi', ('Location', 'Delhi'))

>>> kp.extract_keywords('Taj Mahal is in Delhi.')

>>> # [('Monument', 'Taj Mahal'), ('Location', 'Delhi')]

>>> # NOTE: replace_keywords feature won't work with this.

No clean name for Keywords

>>> from flashtext import KeywordProcessor

>>> keyword_processor = KeywordProcessor()

>>> keyword_processor.add_keyword('Big Apple')

>>> keyword_processor.add_keyword('Bay Area')

>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.')

>>> keywords_found

>>> # ['Big Apple', 'Bay Area']

Add Multiple Keywords simultaneously

>>> from flashtext import KeywordProcessor

>>> keyword_processor = KeywordProcessor()

>>> keyword_dict = {

>>> "java": ["java_2e", "java programing"],

>>> "product management": ["PM", "product manager"]

>>> }

>>> # {'clean_name': ['list of unclean names']}

>>> keyword_processor.add_keywords_from_dict(keyword_dict)

>>> # Or add keywords from a list:

>>> keyword_processor.add_keywords_from_list(["java", "python"])

>>> keyword_processor.extract_keywords('I am a product manager for a java_2e platform')

>>> # output ['product management', 'java']

To Remove keywords

>>> from flashtext import KeywordProcessor

>>> keyword_processor = KeywordProcessor()

>>> keyword_dict = {

>>> "java": ["java_2e", "java programing"],

>>> "product management": ["PM", "product manager"]

>>> }

>>> keyword_processor.add_keywords_from_dict(keyword_dict)

>>> print(keyword_processor.extract_keywords('I am a product manager for a java_2e platform'))

>>> # output ['product management', 'java']

>>> keyword_processor.remove_keyword('java_2e')

>>> # you can also remove keywords from a list/ dictionary

>>> keyword_processor.remove_keywords_from_dict({"product management": ["PM"]})

>>> keyword_processor.remove_keywords_from_list(["java programing"])

>>> keyword_processor.extract_keywords('I am a product manager for a java_2e platform')

>>> # output ['product management']

To check Number of terms in KeywordProcessor

>>> from flashtext import KeywordProcessor

>>> keyword_processor = KeywordProcessor()

>>> keyword_dict = {

>>> "java": ["java_2e", "java programing"],

>>> "product management": ["PM", "product manager"]

>>> }

>>> keyword_processor.add_keywords_from_dict(keyword_dict)

>>> print(len(keyword_processor))

>>> # output 4

To check if term is present in KeywordProcessor

>>> from flashtext import KeywordProcessor

>>> keyword_processor = KeywordProcessor()

>>> keyword_processor.add_keyword('j2ee', 'Java')

>>> 'j2ee' in keyword_processor

>>> # output: True

>>> keyword_processor.get_keyword('j2ee')

>>> # output: Java

>>> keyword_processor['colour'] = 'color'

>>> keyword_processor['colour']

>>> # output: color

Get all keywords in dictionary

>>> from flashtext import KeywordProcessor

>>> keyword_processor = KeywordProcessor()

>>> keyword_processor.add_keyword('j2ee', 'Java')

>>> keyword_processor.add_keyword('colour', 'color')

>>> keyword_processor.get_all_keywords()

>>> # output: {'colour': 'color', 'j2ee': 'Java'}

For detecting Word Boundary currently any character other than this \w [A-Za-z0-9_] is considered a word boundary.

To set or add characters as part of word characters

>>> from flashtext import KeywordProcessor

>>> keyword_processor = KeywordProcessor()

>>> keyword_processor.add_keyword('Big Apple')

>>> print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))

>>> # ['Big Apple']

>>> keyword_processor.add_non_word_boundary('/')

>>> print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))

>>> # []

Test

$ git clone https://github.com/vi3k6i5/flashtext

$ cd flashtext

$ pip install pytest

$ python setup.py test

Build Docs

$ git clone https://github.com/vi3k6i5/flashtext

$ cd flashtext/docs

$ pip install sphinx

$ make html

$ # open _build/html/index.html in browser to view it locally

Why not Regex?

Time taken by FlashText to find terms in comparison to Regex.

Time taken by FlashText to replace terms in comparison to Regex.

Link to code for benchmarking the Find Feature and Replace Feature.

The idea for this library came from the following StackOverflow question.

Citation

The original paper published on FlashText algorithm.

@ARTICLE{2017arXiv171100046S,

author = {{Singh}, V.},

title = "{Replace or Retrieve Keywords In Documents at Scale}",

journal = {ArXiv e-prints},

archivePrefix = "arXiv",

eprint = {1711.00046},

primaryClass = "cs.DS",

keywords = {Computer Science - Data Structures and Algorithms},

year = 2017,

month = oct,

adsurl = {http://adsabs.harvard.edu/abs/2017arXiv171100046S},

adsnote = {Provided by the SAO/NASA Astrophysics Data System}

}

The article published on Medium freeCodeCamp.

Contribute

License

The project is licensed under the MIT license.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值