使用Python编写数据集批量预处理脚本

最新推荐文章于 2024-04-18 15:39:18 发布

PhoneixYANG

最新推荐文章于 2024-04-18 15:39:18 发布

阅读量483

点赞数

分类专栏：开发语言案例文章标签： python 经验分享

本文链接：https://blog.csdn.net/weixin_47267405/article/details/124122952

版权

开发语言案例专栏收录该内容

3 篇文章 0 订阅

订阅专栏

话不多说,先贴代码

import json
datadir='temp.json'
datatarget='test.json'
rating=0
sum=0
all_review=[] #用于去重
with open(datadir, 'r') as f1:
    #with open('sum.txt','w') as f:
            for line in f1:
              review = json.loads(line)
              temp_text=[review['text']]
              if len(review['media'])!=2 and review['text'] not in all_review and review['media'].find('video')==-1 : #去重，去视频，去纯文字评论
                  uid=str(review['user'])
                  tid=str(review['tweet_id'])
                  text=review['text']
                  all_review.extend(temp_text)
                  picture=review['media']
                  if(len(review['text'])>100):
                      sum+=1
                  i=1
                  l=[]
                  t_str=str(tid)+'_1'
                  t_list=[{"_id":t_str}]
                  l.extend(t_list)
                  for ch in picture: #判断图片数量
                      if ch==',':
                         i+=1
                         t_str=str(tid)+'_'+str(i)
                         t_list=[{"_id":t_str}] #list[dict] 存放图片编号
                         l.extend(t_list)

                  temp_d={'Rating':rating,
                          'UserId':uid,
                          'Text':text,
                          'Photos':l, 
                          'tweet_id':tid,
                  }
                  pic_d={
                      'photo':picture
                  }
                  with open(datatarget, 'a') as f2:
                       json.dump(temp_d,f2)
                       f2.write('\n')
                       json.dump(pic_d,f2)
                       f2.write('\n')
            print(sum)

然后是数据集格式介绍,数据集是json格式,源数据是从推特上爬的推文,处理成了json格式,摘一条放下面.

{"user": 1497469538526543873, "tweet_id": 1507470350900203522, "time": "2022-03-25 21:31:53", "text": "RT @Gadhwara27: #UkraineRussianWar #Ukraine #UkraineWar somewhere in Ukraine, UA soldiers captured Russian Grad. https://t.co/meJNTUeMZl", "media": "['http://pbs.twimg.com/ext_tw_video_thumb/1507397882306961408/pu/img/PyQVTaa_Z0UQ5n5p.jpg']"}

可以看到，包含用户编号，推特编号，时间，文本，图像链接啥的

我们需要的是文本，用户编号，推特编号及图像链接，放一条处理好的数据如下：

{"Rating": 0, "UserId": "2738363938", "Text": "RT @Sputnik_Not: LEGO releases set recreating famous Ukraine war scene https://t.co/ZYJvLJgvYL", "Photos": [{"_id": "1507470361042067456_1"}], "tweet_id": "1507470361042067456"}
{"photo": "['http://pbs.twimg.com/media/FOnvSyQXMAgLk9T.jpg']"}

这样就基本是我需要的格式了，只用做一些文本的格式处理（去@啥的）然后根据图片链接把图片下载下来再根据图片id命名，最后打上标签就行了。

接下来仔细看一下代码里面重要的部分：

with open(datadir, 'r') as f1:
            for line in f1:
              review = json.loads(line)

首先是逐行json读入

if len(review['media'])!=2 and review['text'] not in all_review and review['media'].find('video')==-1 : #去重，去视频，去纯文字评论

然后按我的需求进行了一些处理

                  uid=str(review['user'])
                  tid=str(review['tweet_id'])
                  text=review['text']
                  picture=review['media']

提一些必要的信息

                  t_str=str(tid)+'_1'
                  t_list=[{"_id":t_str}]
                  l.extend(t_list)
                  for ch in picture: #判断图片数量
                      if ch==',':
                         i+=1
                         t_str=str(tid)+'_'+str(i)
                         t_list=[{"_id":t_str}] #list[dict] 存放图片编号
                         l.extend(t_list)

根据图片数量和tid自动生成图片id（tid不会重复，可以用作主键）

                  temp_d={'Rating':rating,
                          'UserId':uid,
                          'Text':text,
                          'Photos':l, 
                          'tweet_id':tid,
                  }
                  pic_d={
                      'photo':picture
                  }

根据需求封装dict

                  with open(datatarget, 'a') as f2:
                       json.dump(temp_d,f2)
                       f2.write('\n')
                       json.dump(pic_d,f2)
                       f2.write('\n')

格式化写入，搞定。

PhoneixYANG

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
使用Python编写数据集批量预处理脚本

话不多说,先贴代码import jsondatadir='temp.json'datatarget='test.json'rating=0sum=0all_review=[] #用于去重with open(datadir, 'r') as f1: #with open('sum.txt','w') as f: for line in f1: review = json.loads(line) temp_
复制链接

扫一扫