pandas 读取包含多个字典的txt数据文件
爬虫得到的IMDB电影数据文件,是包含多个字典的txt数据文件,如下:
{“movie_id”: 111161, “movie_name”: “The Shawshank Redemption”, “year”: 1994, “movie_link”: “/title/tt0111161/”, “movie_rate”: 9.222352298932535}
{“movie_id”: 68646, “movie_name”: “The Godfather”, “year”: 1972, “movie_link”: “/title/tt0068646/”, “movie_rate”: 9.149459830968244}
{“movie_id”: 71562, “movie_name”: “The Godfather: Part II”, “year”: 1974, “movie_link”: “/title/tt0071562/”, “movie_rate”: 8.982003465767905}
{“movie_id”: 468569, “movie_name”: “The Dark Knight”, “year”: 2008, “movie_link”: “/title/tt0468569/”, “movie_rate”: 8.969818219072174}
{“movie_id”: 50083, “movie_name”: “12 Angry Men”, “year”: 1957, “movie_link”: “/title/tt0050083/”, “movie_rate”: 8.92419107925773}
{“movie_id”: 108052, “movie_name”: “Schindler’s List”, “year”: 1993, “movie_link”: “/title/tt0108052/”, “movie_rate”: 8.903202955255063}
{“movie_id”: 167260, “movie_name”: “The Lord of the Rings: The Return of the King”, “year”: 2003, “movie_link”: “/title/tt0167260/”, “movie_rate”: 8.880342018614696}
{“movie_id”: 110912, “movie_name”: “Pulp Fiction”, “year”: 1994, “movie_link”: “/title/tt0110912/”, “movie_rate”: 8.85114611887013}
{“movie_id”: 60196, “movie_name”: “Il buono, il brutto, il cattivo”, “year”: 1966, “movie_link”: “/title/tt0060196/”, “movie_rate”: 8.801450831661567}
……
#!/usr/bin/python
# -*- coding: utf-8 -*-
#用pandas读取后,也就是这个样子
import pandas as pd
pd.set_option('max_colwidth',150) #解决列显示不全,设置value的显示长度为150,默认为50
data = pd.read_csv("movie12.txt",sep = "\t",header=None) #得到数组,header=None表示文件没有表头
data.head()
0 | |
---|---|
0 | {"movie_id": 111161, "movie_name": "The Shawshank Redemption", "year": 1994, "movie_link": "/title/tt0111161/", "movie_rate": 9.222352298932535} |
1 | {"movie_id": 68646, "movie_name": "The Godfather", "year": 1972, "movie_link": "/title/tt0068646/", "movie_rate": 9.149459830968244} |
2 | {"movie_id": 71562, "movie_name": "The Godfather: Part II", "year": 1974, "movie_link": "/title/tt0071562/", "movie_rate": 8.982003465767905} |
3 | {"movie_id": 468569, "movie_name": "The Dark Knight", "year": 2008, "movie_link": "/title/tt0468569/", "movie_rate": 8.969818219072174} |
4 | {"movie_id": 50083, "movie_name": "12 Angry Men", "year": 1957, "movie_link": "/title/tt0050083/", "movie_rate": 8.92419107925773} |
我们希望的格式是这样的,便于python执行下一步统计分析
data = pd.read_csv("movie12.csv")
data.head()
Unnamed: 0 | movie_id | movie_name | year | movie_link | movie_rate | |
---|---|---|---|---|---|---|
0 | 0 | 111161 | The Shawshank Redemption | 1994 | /title/tt0111161/ | 9.222352 |
1 | 1 | 68646 | The Godfather | 1972 | /title/tt0068646/ | 9.149460 |
2 | 2 | 71562 | The Godfather: Part II | 1974 | /title/tt0071562/ | 8.982003 |
3 | 3 | 468569 | The Dark Knight | 2008 | /title/tt0468569/ | 8.969818 |
4 | 4 | 50083 | 12 Angry Men | 1957 | /title/tt0050083/ | 8.924191 |
#直接上代码
#!/usr/bin/python
# -*- coding: utf-8 -*-
import pandas as pd
f = open("movie12.txt",'r') #打开文件
lines = f.readlines() #逐行读取,成为列表,但里面包含了一些冗余字符
str_lines = str(lines).replace("'","").replace(r"\n","") #转字符串,便于删除冗余字符
list_dict = eval(str_lines) #从字符串转回包含字典的列表[{'Key1':'Value1_1','Key2':'Value2_1','Key3':'Value3_1',……},{},{},{}]
df = pd.DataFrame(list_dict)
df.to_csv("movie12out.csv",encoding = "utf-8-sig")
df.head()
movie_id | movie_name | year | movie_link | movie_rate | |
---|---|---|---|---|---|
0 | 111161 | The Shawshank Redemption | 1994 | /title/tt0111161/ | 9.222352 |
1 | 68646 | The Godfather | 1972 | /title/tt0068646/ | 9.149460 |
2 | 71562 | The Godfather: Part II | 1974 | /title/tt0071562/ | 8.982003 |
3 | 468569 | The Dark Knight | 2008 | /title/tt0468569/ | 8.969818 |
4 | 50083 | 12 Angry Men | 1957 | /title/tt0050083/ | 8.924191 |
pandas读取类似字典格式的txt文件
有时候,我们得到数据文件是这样的格式,txt格式,所有的列名和数值是纵向排列。
它类似与字典格式,键与值之间以:
和\t
相隔,但缺少了{}
和""
。
movie_id: 111161
movie_name: The Shawshank Redemption
year: 1994
movie_link: /title/tt0111161/
movie_rate: 9.222352298932535}
movie_id: 68646
movie_name: The Godfather
year: 1972
movie_link: /title/tt0068646/
movie_rate: 9.149459830968244}
movie_id: 71562
movie_name: The Godfather: Part II
year: 1974
movie_link: /title/tt0071562/
movie_rate: 8.982003465767905}
movie_id: 468569
movie_name: The Dark Knight
year: 2008
movie_link: /title/tt0468569/
movie_rate: 8.969818219072174}
movie_id: 50083
movie_name: 12 Angry Men
year: 1957
movie_link: /title/tt0050083/
movie_rate: 8.92419107925773}
……
#!/usr/bin/python
# -*- coding: utf-8 -*-
#pandas读取文件格式如下:
import pandas as pd
data = pd.read_csv("movie13.txt",sep = "\t",header=None) #得到数组,header=None表示文件没有表头
#data = pd.read_clipboard(sep = "\t",header=None)
data.head()
0 | |
---|---|
0 | movie_id: 111161 |
1 | movie_name: The Shawshank Redemption |
2 | year: 1994 |