【Python】使用 pandas 的 `groupby` + `collections.Count` 统计（TopK）词频

最新推荐文章于 2022-02-13 17:12:54 发布

FeatureOverload

最新推荐文章于 2022-02-13 17:12:54 发布

阅读量596

点赞数

分类专栏： Pandas(数据处理) # Python 文章标签： Python 代码优化词频统计数据处理 Counter

本文链接：https://blog.csdn.net/qq_29757283/article/details/118081052

版权

Python 同时被 2 个专栏收录

70 篇文章 1 订阅

订阅专栏

Pandas(数据处理)

2 篇文章 0 订阅

订阅专栏

前几天 review 一份 统计词频 的代码，提了一些优化建议，觉得对写 Python 经验还比较少的同学应该有帮助，所以这边记录一下。

Overview

提交的代码

def word_frequency(data, top):
  """
      生成top20词频词语
  """
  if data is None or data.empty:
    return None
  # ...some code...

  # 统计词频
  df_res = pd.DataFrame()
  for s in data['数据来源'].unique():
    temp = data[data['数据来源'] == s]
    res_temp = pd.DataFrame()
    for t in temp['发生时间'].unique():
      temp1 = temp[temp['发生时间'] == t]
      wordlist = []
      for i in temp1['分词']:
        wordlist.extend(i)

      wordatare = {}
      for i in wordlist:
        if i not in stopword_list:
          wordatare.setdefault(i, 0)
          wordatare[i] += 1

      res = pd.DataFrame(pd.Series(wordatare), columns=["词频"]).reset_index()
      res.rename(columns={'index': '词语'}, inplace=True)
      res['发生时间'] = t
      res['数据来源'] = s

      res = res.sort_values(by=['词频'], ascending=False)[:top]
      res_temp = res_temp.append(res)
    df_res = df_res.append(res_temp)

  df_res = df_res.reset_index(drop=True)
  return df_res

一些可能关心的其它代码：

stopword_list = ['-',
                 '+',
                 ...,
                 '......',
                 ...,
                 '完成',
                 '使用',
                 ...]

stopword_list 被我建议改成了 set。

最终结果


def word_frequency(data, top):
  """生成top词频词语"""
  if data is None or data.empty:
    return None
  # ...some code...

  # 统计词频
  word_cnt_df = pd.DataFrame()
  for _, df_block in data.groupby(['数据来源', '发生时间']):
    words = [word for word_list in df_block['分词'] for word in word_list]
    wordatare = Counter(filter(lambda word: word not in stopword_list, words))
    topk_df = pd.DataFrame(wordatare.most_common(top), columns=['词语', '词频'])
    topk_df['发生时间'] = df_block.iloc[0]['发生时间']
    topk_df['数据来源'] = df_block.iloc[0]['数据来源']

    word_cnt_df = word_cnt_df.append(topk_df)

  word_cnt_df.reset_index(drop=True, inplace=True)
  return word_cnt_df

Reference

FeatureOverload

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
【Python】使用 pandas 的 `groupby` + `collections.Count` 统计（TopK）词频

前几天 review 一份统计词频的代码，提了一些优化建议，觉得对写 Python 经验还比较少的同学应该有帮助，所以这边记录一下。Overview提交的代码最终结果Reference提交的代码def word_frequency(data, top): """ 生成top20词频词语 """ if data is None or data.empty: return None # ...some code... # 统计词频 df_res = .
复制链接

扫一扫