01. pyhton 统计句子的长度

拙小拙

已于 2023-05-05 21:29:31 修改

阅读量824

点赞数 1

分类专栏： # python类文章标签： python 机器学习深度学习

于 2021-01-17 12:29:22 首次发布

本文链接：https://blog.csdn.net/qq_35182128/article/details/112733703

版权

python类专栏收录该内容

16 篇文章 1 订阅

订阅专栏

Python的一些应用 jupyter notebook源码

总体思想：使用字符串的split()划分单词，利用set()特性，进行字符数统计。

代码：

统计一句英文句子的长度：(包括是否去除重复单词)

# 假设英文句子为t
t ='After updating from 2.0.40 to 2.0.42, all POST-request to the cgi-bin are \
    broken, and return the script source-code! GET-request to the same scripts \
    function normal.\
    This is not a config issue, worked up to 2.0.40, and works for GET in 2.0.42'

# 通过split划分英文句子中的单词  t.split()是list形式，采用len()方法计算长度
len(t.split())   

# 去掉重复单词的句子长度 set()可去除重复值
len(set(t.split()))

统计一句中文句子的长度：

import jieba

# 假设中文文本为t_c
t_c = '1中文语句。。。。。。。。。。。。省略。。。。。。'


# 利用jieba分词
t = jieba.cut(t_c)
# 将分词通过空格拼接
res = '  '.join(t)  

# 统计字符数（包含空格）
len(res)  # 结果为242 与word中的统计一致 如下图
# 若想统计不计空格的数目，使用res = ' '.join(t)  将分词进行拼接

统计csv文件中每一行文本的长度：

import pandas as pd

data = pd.read_csv('G:/Pycharm/key/dataset/GCC.csv') #这里是我文件的路径
data[['Bug ID','Summary']] # 我取出文件中的两列，一列是ID,一列是文本

# 定义一个获取长度的函数 也就是将1中的代码封装为函数
def getLen(t):
    t = str(t)
    return len(set(t.split()))

# 将计算得到的长度存储到Len这一列
data['Len'] = data.Summary.apply(getLen)  # apply函数可实现批量操作 具体使用方法自行百度

# 查看相关信息
data[['Bug ID','Summary','Len']].head()
# 或存入csv文件
data.to_csv('./res.csv')