pandas的一些常用方法和遇到的小问题
1. pandas读取json文件
import pandas as pd
pathfile = 'xxx.json'
data = pd.read_json(pathfile)
data的类型为<class 'pandas.core.frame.DataFrame'>
Python机器学习(八十三)Pandas 读取 JSON 数据
2.pandas日期转换
3.pandas写入csv格式文件出现中文乱码问题解决方法
- utf-8和utf-8-sig是不一样的,常常在csv文件中会遇到。
- python3 库pandas写入csv格式文件出现中文乱码问题解决方法
4. pd.DataFrame
代码:
import pandas as pd
a = [['a','b','c','d'], ['e','f','g','h']]
a_df = pd.DataFrame(a)
print(a_df)
print(type(a_df))
结果:
0 1 2 3
0 a b c d
1 e f g h
<class 'pandas.core.frame.DataFrame'>
代码:
#对于简单列表而言,转为DataFrame后,需要转置,才是一行
b = ['a','b','c','d']
b_df = pd.DataFrame(b)
print(b_df)
print(type(b_df))
b_df_T = b_df.T
print(b_df_T)
print(type(b_df_T))
结果:
0
0 a
1 b
2 c
3 d
<class 'pandas.core.frame.DataFrame'>
0 1 2 3
0 a b c d
<class 'pandas.core.frame.DataFrame'>
- 实际遇到的例子:
某些情况我们需要将一行列表存入csv文件中,当all_content = pd.DataFrame(all_content)
后,数据被转成一列,这时all_content.to_csv()
会出错。所以加一个flag
标志的判断,当是一列是进行转置。
if len(all_content) == 0:
all_content = row
flag = True
else:
all_content = np.row_stack((all_content, row))
flag = False
all_content = pd.DataFrame(all_content)
if flag:
all_content = all_content.T
all_content.to_csv(out_file, index=False,header=header,encoding='utf-8-sig')
5.pd.read_json()
近期处理一些数据(数据已经脱敏),格式如下:
#原始数据
[
{
"reposts_count": 0,
"favorited": 0,
"update_time": "Sun Jan 06 23:07:51 +0800 2000",
"original_pic": "",
"text": " 哈哈@123123123",
"created_at": "Mon Oct 29 11:30:05 +0800 2000",
"mid": 123123123123123123,
"annotations": "",
"source": "<a href=\"http:/>",
"user": {
"id": 123123123,
"idstr": "123123123",
"screen_name": "xxxxxx",
"name": "xxxxxxxx",
"location": "China",
"gender": "m",
"statuses_count": 133,
"favourites_count": 0
},
"in_reply_to_screen_name": "",
"in_reply_to_user_id": 0,
"comments_count": 2
},
{
"reposts_count": 0,
"favorited": 0,
"update_time": "Sun Jan 06 23:07:51 +0800 2010",
"original_pic": "",
"text": " 哈哈哈!你好!!",
"created_at": "Mon Oct 29 11:30:05 +0800 2010",
"mid": 456456456465456456,
"annotations": "",
"source": "<a href=\"http:/>",
"user": {
"id": 456456456,
"idstr": "456456456",
"screen_name": "yyyyyyyy",
"name": "yyyyyyyy",
"location": "China",
"gender": "f",
"statuses_count": 133,
"favourites_count": 0
},
"in_reply_to_screen_name": "",
"in_reply_to_user_id": 0,
"comments_count": 2
},
]
需要提取上面文件123456.json(或者123456.txt)中的相关内容(例如需要提取“text”的内容
,"user"中的“id”内容
),提取方法如下:
import pandas as pd
datafile = pd.read_json("123456.json",encoding='utf-8') #有时候不加encoding='utf-8'会报错
# datafile = pd.read_json("123456.txt",encoding='utf-8') #内容相同时,123456.txt文件也可以被pd.read_json处理
print("type(datafile): ",type(datafile)) #<class 'pandas.core.frame.DataFrame'>
print("datafile:\n",datafile)
num_shape = datafile.shape[0]
print("\n该文件中有 ",num_shape, " 条数据!")
data_text = datafile['text']
print("\ntype(data_text): ",type(data_text)) #<class 'pandas.core.series.Series'>
print("data_text:\n",data_text)
data_user = datafile['user']
print("\ntype(data_user): ",type(data_user)) #<class 'pandas.core.series.Series'>
print("data_user:\n",data_user)
for i in range(num_shape):
print("\n第",i,"条数据中:")
text = data_text[i]
print("type(text):",type(text))
print("text:", text)
uid = str(data_user[i]['id'])
print("type(uid):", type(uid))
print("uid:", uid)
结果:
type(datafile): <class 'pandas.core.frame.DataFrame'>
datafile:
reposts_count favorited ... in_reply_to_user_id comments_count
0 0 0 ... 0 2
1 0 0 ... 0 2
[2 rows x 13 columns]
该文件中有 2 条数据!
type(data_text): <class 'pandas.core.series.Series'>
data_text:
0 嘻嘻@123123123
1 哈哈哈!你好!!
Name: text, dtype: object
type(data_user): <class 'pandas.core.series.Series'>
data_user:
0 {'id': 123123123, 'idstr': '123123123', 'scree...}
1 {'id': 456456456, 'idstr': '456456456', 'scree...}
Name: user, dtype: object
第 0 条数据中:
type(text): <class 'str'>
text: 嘻嘻@123123123
type(uid): <class 'str'>
uid: 123123123
第 1 条数据中:
type(text): <class 'str'>
text: 哈哈哈!你好!!
type(uid): <class 'str'>
uid: 456456456