pandas 读取包含多个字典的txt数据文件

本文介绍如何使用pandas读取和转换爬虫获取的包含多个字典的TXT电影数据文件,将其转化为适合统计分析的DataFrame格式,便于后续的数据处理和可视化。
摘要由CSDN通过智能技术生成

pandas 读取包含多个字典的txt数据文件

爬虫得到的IMDB电影数据文件,是包含多个字典的txt数据文件,如下:

{“movie_id”: 111161, “movie_name”: “The Shawshank Redemption”, “year”: 1994, “movie_link”: “/title/tt0111161/”, “movie_rate”: 9.222352298932535}
{“movie_id”: 68646, “movie_name”: “The Godfather”, “year”: 1972, “movie_link”: “/title/tt0068646/”, “movie_rate”: 9.149459830968244}
{“movie_id”: 71562, “movie_name”: “The Godfather: Part II”, “year”: 1974, “movie_link”: “/title/tt0071562/”, “movie_rate”: 8.982003465767905}
{“movie_id”: 468569, “movie_name”: “The Dark Knight”, “year”: 2008, “movie_link”: “/title/tt0468569/”, “movie_rate”: 8.969818219072174}
{“movie_id”: 50083, “movie_name”: “12 Angry Men”, “year”: 1957, “movie_link”: “/title/tt0050083/”, “movie_rate”: 8.92419107925773}
{“movie_id”: 108052, “movie_name”: “Schindler’s List”, “year”: 1993, “movie_link”: “/title/tt0108052/”, “movie_rate”: 8.903202955255063}
{“movie_id”: 167260, “movie_name”: “The Lord of the Rings: The Return of the King”, “year”: 2003, “movie_link”: “/title/tt0167260/”, “movie_rate”: 8.880342018614696}
{“movie_id”: 110912, “movie_name”: “Pulp Fiction”, “year”: 1994, “movie_link”: “/title/tt0110912/”, “movie_rate”: 8.85114611887013}
{“movie_id”: 60196, “movie_name”: “Il buono, il brutto, il cattivo”, “year”: 1966, “movie_link”: “/title/tt0060196/”, “movie_rate”: 8.801450831661567}
……

#!/usr/bin/python 
# -*- coding: utf-8 -*-

#用pandas读取后,也就是这个样子
import pandas as pd
pd.set_option('max_colwidth',150) #解决列显示不全,设置value的显示长度为150,默认为50

data = pd.read_csv("movie12.txt",sep = "\t",header=None) #得到数组,header=None表示文件没有表头
data.head()
0
0 {"movie_id": 111161, "movie_name": "The Shawshank Redemption", "year": 1994, "movie_link": "/title/tt0111161/", "movie_rate": 9.222352298932535}
1 {"movie_id": 68646, "movie_name": "The Godfather", "year": 1972, "movie_link": "/title/tt0068646/", "movie_rate": 9.149459830968244}
2 {"movie_id": 71562, "movie_name": "The Godfather: Part II", "year": 1974, "movie_link": "/title/tt0071562/", "movie_rate": 8.982003465767905}
3 {"movie_id": 468569, "movie_name": "The Dark Knight", "year": 2008, "movie_link": "/title/tt0468569/", "movie_rate": 8.969818219072174}
4 {"movie_id": 50083, "movie_name": "12 Angry Men", "year": 1957, "movie_link": "/title/tt0050083/", "movie_rate": 8.92419107925773}

我们希望的格式是这样的,便于python执行下一步统计分析

data = pd.read_csv("movie12.csv")
data.head()
Unnamed: 0 movie_id movie_name year movie_link movie_rate
0 0 111161 The Shawshank Redemption 1994 /title/tt0111161/ 9.222352
1 1 68646 The Godfather 1972 /title/tt0068646/ 9.149460
2 2 71562 The Godfather: Part II 1974 /title/tt0071562/ 8.982003
3 3 468569 The Dark Knight 2008 /title/tt0468569/ 8.969818
4 4 50083 12 Angry Men 1957 /title/tt0050083/ 8.924191

#直接上代码

#!/usr/bin/python 
# -*- coding: utf-8 -*-

import pandas as pd
f = open("movie12.txt",'r') #打开文件
lines = f.readlines() #逐行读取,成为列表,但里面包含了一些冗余字符
str_lines = str(lines).replace("'","").replace(r"\n","") #转字符串,便于删除冗余字符
list_dict = eval(str_lines) #从字符串转回包含字典的列表[{'Key1':'Value1_1','Key2':'Value2_1','Key3':'Value3_1',……},{},{},{}]
df = pd.DataFrame(list_dict)
df.to_csv("movie12out.csv",encoding = "utf-8-sig")
df.head()
movie_id movie_name year movie_link movie_rate
0 111161 The Shawshank Redemption 1994 /title/tt0111161/ 9.222352
1 68646 The Godfather 1972 /title/tt0068646/ 9.149460
2 71562 The Godfather: Part II 1974 /title/tt0071562/ 8.982003
3 468569 The Dark Knight 2008 /title/tt0468569/ 8.969818
4 50083 12 Angry Men 1957 /title/tt0050083/ 8.924191

pandas读取类似字典格式的txt文件

有时候,我们得到数据文件是这样的格式,txt格式,所有的列名和数值是纵向排列。

它类似与字典格式,键与值之间以:\t相隔,但缺少了{}""

movie_id: 111161
movie_name: The Shawshank Redemption
year: 1994
movie_link: /title/tt0111161/
movie_rate: 9.222352298932535}
movie_id: 68646
movie_name: The Godfather
year: 1972
movie_link: /title/tt0068646/
movie_rate: 9.149459830968244}
movie_id: 71562
movie_name: The Godfather: Part II
year: 1974
movie_link: /title/tt0071562/
movie_rate: 8.982003465767905}
movie_id: 468569
movie_name: The Dark Knight
year: 2008
movie_link: /title/tt0468569/
movie_rate: 8.969818219072174}
movie_id: 50083
movie_name: 12 Angry Men
year: 1957
movie_link: /title/tt0050083/
movie_rate: 8.92419107925773}
……

#!/usr/bin/python 
# -*- coding: utf-8 -*-

#pandas读取文件格式如下:
import pandas as pd
data = pd.read_csv("movie13.txt",sep = "\t",header=None) #得到数组,header=None表示文件没有表头
#data = pd.read_clipboard(sep = "\t",header=None)
data.head()
<
0
0 movie_id: 111161
1 movie_name: The Shawshank Redemption
2 year: 1994
  • 2
    点赞
  • 21
    收藏
    觉得还不错? 一键收藏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值