readability: 英文文本数据可读性库

readability文本可读性的公式最初都是为英语开发而来,所以目前仅支持英文文本数据。

文档 https://pypi.org/project/readability/

安装

pip install readability
Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Collecting readability
  Downloading https://mirrors.aliyun.com/pypi/packages/26/70/6f8750066255d4d2b82b813dd2550e0bd2bee99d026d14088a7b977cd0fc/readability-0.3.1.tar.gz (34 kB)
Building wheels for collected packages: readability
  Building wheel for readability (setup.py) ... [?25ldone
[?25h  Created wheel for readability: filename=readability-0.3.1-py3-none-any.whl size=35459 sha256=e920a8d6510bd1211df79a944ff03c94f2fea220ae4e5f430e930a52d75595ee
  Stored in directory: /Users/thunderhit/Library/Caches/pip/wheels/90/29/a7/726a69748065b8c306b4a935ac2c57e9bc492cb23f355c8e03
Successfully built readability
Installing collected packages: readability
Successfully installed readability-0.3.1

快速上手

import readability

text = 'Note that tokens are separated by spaces and sentences by newlines.'
results = readability.getmeasures(text, lang='en')
results
OrderedDict([('readability grades',
              OrderedDict([('Kincaid', 7.442500000000003),
                           ('ARI', 5.825624999999999),
                           ('Coleman-Liau', 9.532550312500003),
                           ('FleschReadingEase', 55.95250000000002),
                           ('GunningFogIndex', 10.700000000000001),
                           ('LIX', 39.25),
                           ('SMOGIndex', 9.70820393249937),
                           ('RIX', 2.5),
                           ('DaleChallIndex', 9.954550000000001)])),
             ('sentence info',
              OrderedDict([('characters_per_word', 4.9375),
                           ('syll_per_word', 1.6875),
                           ('words_per_sentence', 8.0),
                           ('sentences_per_paragraph', 2.0),
                           ('type_token_ratio', 0.9375),
                           ('characters', 79),
                           ('syllables', 27),
                           ('words', 16),
                           ('wordtypes', 15),
                           ('sentences', 2),
                           ('paragraphs', 1),
                           ('long_words', 5),
                           ('complex_words', 3),
                           ('complex_words_dc', 6)])),
             ('word usage',
              OrderedDict([('tobeverb', 2),
                           ('auxverb', 0),
                           ('conjunction', 1),
                           ('pronoun', 2),
                           ('preposition', 2),
                           ('nominalization', 1)])),
             ('sentence beginnings',
              OrderedDict([('pronoun', 1),
                           ('interrogative', 0),
                           ('article', 0),
                           ('subordination', 0),
                           ('conjunction', 0),
                           ('preposition', 0)]))])

返回的信息包括

  • readability grades可读性指标

  • sentence info 句子信息

  • word usage 词语使用

  • sentence beginnings句子开始

可读性指标

results['readability grades']
OrderedDict([('Kincaid', 7.442500000000003),
             ('ARI', 5.825624999999999),
             ('Coleman-Liau', 9.532550312500003),
             ('FleschReadingEase', 55.95250000000002),
             ('GunningFogIndex', 10.700000000000001),
             ('LIX', 39.25),
             ('SMOGIndex', 9.70820393249937),
             ('RIX', 2.5),
             ('DaleChallIndex', 9.954550000000001)])

可读性Kincaid指标

results['readability grades']['Kincaid']
7.442500000000003

同理其他指标都可以以字典的方式获取

往期文章Pandas时间序列数据操作
Matplotlib中的plt和ax都是啥?

70G上市公司定期报告数据集
5个小问题带你理解列表推导式
文本数据清洗之正则表达式
Python网络爬虫与文本数据分析
综述:文本分析在市场营销研究中的应用
如何批量下载上海证券交易所上市公司年报
Numpy和Pandas性能改善的方法和技巧
漂亮~pandas可以无缝衔接Bokeh
YelpDaset: 酒店管理类数据集10+G

先有收获,再点在看!

公众号后台回复关键词 20200520 可获得项目代码

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值