pandas日常处理笔记

置顶锲启

已于 2024-07-18 11:44:04 修改

阅读量644

点赞数

文章标签：机器学习 python 数据挖掘

于 2019-10-30 19:18:21 首次发布

本文链接：https://blog.csdn.net/weixin_44166997/article/details/102822691

版权

pandas读csv文件时路径有中文汉字如何读取

f=open("C:/Users/liubb/Desktop/机器学习/U_REEData.csv",encoding='gbk')
dataframe=pd.read_csv(f)

pandas读取报错问题可尝试解决方法

df_clean = pd.read_csv('test_error.csv',lineterminator='\n')

1,如果是某行报错

pd.read_csv(path, error_bad_lines=False)

2,如果有中文报错，可以通过先open在读取

data = pd.read_csv(open(r'D:\Mywork\work\2020\0226\mycode\新款last\aaa.csv',encoding='utf_8'))

统计某一行或者某一列的数据频数

zero_col_count = dict(df[0].value_counts())#统计第0列元素的值的个数
three_row_count = dict(df.loc[3].value_counts())#统计第3行元素的值的个数

排序（逆序）

data_ret.sort_values(by='id_count', ascending=False)

重建索引

data_ret = data_ret.reset_index(index = True)

将表数据存txt

dd.to_csv('dd.txt',index=False)

将表数据存xlxs

data.to_excel('test.xlsx',index=False)

将txt数据读取成表

pd.read_table('dd.txt',sep=',')

获取表中的两列组成字典

dict(zip(bb['关键字'], bb['排名']))

求每列的和

data_all.loc["sum"] = data_all.sum()

融化（melt）

pd.melt(data_all,id_vars=["name","team_name",'userid'],value_vars=['用户句子数','价格'],var_name=['类型'])

透视表

aa = pd.pivot_table(data_all,index=["name","team_name",'userid'],values=['风格','辱骂','其他','无标签'])



df1['序号'] = df1.groupby(['second_category'])['number'].rank(method='first', ascending=False)
df_result = pd.pivot_table(df1, index=['序号'], columns=['second_category'], values=['product_id'])
df_result

在这里插入图片描述

画roc曲线

y_true：真实的样本标签，默认为{0，1}或者{-1，1}。如果要设置为其它值，则 pos_label 参数要设置为特定值。例如要令样本标签为{1，2}，其中2表示正样本，则pos_label=2。
y_score：对每个样本的预测结果。
pos_label：正样本的标签。

from sklearn.metrics import roc_curve, auc  ###计算roc和auc  
fpr,tpr,threshold = roc_curve(y_test, y_score) ###计算真正率和假正率  
roc_auc = auc(fpr,tpr) ###计算auc的值  

plt.plot(fpr, tpr, 'b',label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.xlabel('False Positive Rate') #横坐标是fpr
plt.ylabel('True Positive Rate')  #纵坐标是tpr
plt.title('Receiver operating characteristic example')
plt.show()

利用交叉验证来确定参数

利用xgboost与交叉检验求出最优解
from xgboost import XGBRegressor
import warnings
warnings.filterwarnings("ignore")

params = [1,2,3,4,5,6]
test_scores=[]
for param in params:
    clf = XGBRegressor(max_depth = param)
    test_score = np.sqrt(-cross_val_score(clf,x_train,y_train,cv =10,scoring='neg_mean_squared_error'))
    test_scores.append(np.mean(test_score))

plt.plot(params,test_scores)
plt.grid(True,axis='both',color='red',alpha=0.1)
plt.title('max_depth vs CV Error')

pandas的交并补差集的运算

df1 = DataFrame([['a', 10, '男'], 
                 ['b', 11, '男'], 
                 ['c', 11, '女'], 
                 ['a', 10, '女'],
                 ['c', 11, '男']], 
                columns=['name', 'age', 'sex'])

df2 = DataFrame([['a', 10, '男'], 
                 ['b', 11, '女']],
                columns=['name', 'age', 'sex'])
                
取交集：print(pd.merge(df1,df2,on=['name', 'age', 'sex']))
取并集：print(pd.merge(df1,df2,on=['name', 'age', 'sex'], how='outer'))
取差集(从df1中过滤df1在df2中存在的行)：
df1 = df1.append(df2)
df1 = df1.append(df2)
df1 = df1.drop_duplicates(subset=['name', 'age', 'sex'],keep=False)
print(df1)

求分组后上下两行的差值

dd = pd.DataFrame({'A':['a','a','a','c','c','d'],'B':[3,2,1,4,5,8]})
dd['C'] = dd.groupby('A')['B'].apply(lambda i:i.diff(1))

在这里插入图片描述

求某列的累加和

df['yhzb'] = df['yhzb'].cumsum()

将时间字符串转成时间格式

df1['pay_time'] = pd.to_datetime(df1['pay_time'])

统计列表内各元素个数

import collections
cc = collections.Counter(['a','b','a','a','c'])
cc.most_common()  #返回元组列表

分箱

#每箱个数均分
data1['FareCut'] = pd.qcut(data1['Fare'], 4)
#分箱尺度均分
data1['FareCut'] = pd.cut(data1['Fare'], 4)
#分箱尺度按bins进行分箱
bins = [float('-inf'),1,3,7,float('inf')]
data1['FareCut'] = pd.cut(data1['Fare'], bins )

在这里插入图片描述

获取时间序列

days = pd.date_range('2020-05-01',periods=31, freq='D')
days = pd.date_range('2022-01-02',periods=60, freq='D').date.astype(str).tolist()

在这里插入图片描述

统计列表中元素个数

from collections import Counter
test_counter_data = ['cat', 'dog', 'sheep', 'cat', 'dog']
counter_data = Counter(test_counter_data)
print(counter_data)
print(dict(counter_data))

在这里插入图片描述

镜像

阿里云 https://mirrors.aliyun.com/pypi/simple/ 
中国科技大学 https://pypi.mirrors.ustc.edu.cn/simple/ 
豆瓣(douban) https://pypi.douban.com/simple/ 
清华大学 https://pypi.tuna.tsinghua.edu.cn/simple/ 
中国科学技术大学 https://pypi.mirrors.ustc.edu.cn/simple/

cc = bb.groupby([‘color’])[[‘color’]].count().rename(columns={‘color’:‘count’}).reset_index()

捕捉报错信

import traceback
try:
    a = 1/0
except Exception as e:
    print(e)
    print(traceback.format_exc())
    
#运行结果：
division by zero
Traceback (most recent call last):
  File "D:/mywork/test.py", line 271, in <module>
    a = 1/0
ZeroDivisionError: division by zero

利用ocr将PDF转word格式

from pdf2docx import parse

pdf_file = '123.pdf'
docx_file = '123.docx'

# convert pdf to docx
parse(pdf_file, docx_file)

获取图片的元信息

在线工具：https://nullgo.com/web/image-exif

    from PIL import Image, ExifTags
    img = Image.open(r'C:\Desktop\image.jpg')

    # 方法1
    #Image.info中是图像所包含的信息
    print(img.info.keys())
    for one in img.info.keys():
        one_data = img.info.get(one)
        print(one,'****',one_data)