pandas学习

最新推荐文章于 2022-12-29 14:32:38 发布

要做个棒的人

最新推荐文章于 2022-12-29 14:32:38 发布

阅读量185

点赞数

本文链接：https://blog.csdn.net/yxf2505/article/details/115597274

版权

1.pandas介绍

df=pd.read_csv()读取数据

pd.read_csv	读取数据
print（help（read_csv）	查看read_csv帮助文档
df.info（）	返回当前信息
df.index	索引（行）
df.columns	列名（特征）
df.dtypes	查看数据类型
df.values	值
df.describe()	查看数据基本统计特性

创建一个dataframe结构

data = {'country':['aaa','bbb','ccc'],
       'population':[10,12,14]}
df_data = pd.DataFrame(data)
df_data

输出结果：
在这里插入图片描述

取指定的数据

series:dataframe中的一行/列

age = df['Age']

age.index
age.columns

age.min()
age.max()
age.mean()

索引我们可以自己指定

df = df.set_index('Name')

在这里插入图片描述

2.pandas索引

loc 用label定位
iloc 用position位置定位

loc	用label定位
iloc	用position位置定位

df.iloc[0]   取第一行，所有列。
df.iloc[1:3,:] 取第二行和第三行，所有列。
df.iloc[2,1] 取第三行，第二列的那个数。

同理，df.loc[]的用法和df.iloc[]类似，前提是把索引改名字了。
索引改名字方法：df = df.set_index(‘Name’)

bool类型索引

df['Fare'] >40


```bash
df[df['Fare] >40][:5]
df[df['Sex']=='male'][:5]
df.loc[df['Sex']=='male,'Age'].mean()
(df[df['Sex']=='male]).sum()

3.groupby

import pandas as pd

df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
                  'data':[0,5,10,5,10,15,10,15,20]})

在这里插入图片描述

df.groupby('key')['data'].sum()
df.groupby('key').sum()
df.groupby('key').aggregate(np.mean)

df.groupby('Sex')['Age'].mean()
df.groupby('Sex')[‘Survived’].mean()
统计不同性别的生存平均数

4.数值运算

df.sum()
df.sum(axis=0)
df.sum(axis=1)
df.sum(axis='columns')
df.median()

查看变量（特征）和变量之间的关系
相关系数的取值范围【-1，+1】

cov()	df.cov()协方差
corr()	相关系数

df.Age.value_counts()
df['Age'].value_counts(ascending=True)  #升序
df['Age'].value_counts(ascending=True，bins=5)  #划分5个**等**区间

5.对象操作

Series增，删，改，查

"""
data = [10,11,12]
index = ['a','b','c']
s=pd.Series(data=data,index=index)"""

#查操作
s[0]
s[0:2]
mask = [True,False,True]  
s[mask]
s.loc['a']
s.iloc[0]

#改操作
s1 = s.copy()  #先copy一下，这样就不会改变s的数据了
s1['a'] = 100
s1.replace(to_replace = 100,value = 101,inplace = True)
s1.index = ['a','b','d']  #改索引，适合index少的，可以全打印出来
s1.rename(index = {'a':'A'},inplace = True)  #改单个索引

"""
data = [100,110]
index = ['h','k']
s2 = pd.Series(data = data,index = index)"""

#增操作
s3=s.append(s2)
s3['j']=50   #s3原本没有j这一行
s1.append(s2,ignore_index = True)   #在s1上增加s2,并把索引改为0,1,...


#删操作
del s1['A']
s1.drop(['b','d'],inplace = True)

输出s：
在这里插入图片描述
Dataframe增，删，改，查和Series方法是类似的

"""
data = [[1,2,3],[4,5,6]]
index = ['a','b']
columns = ['A','B','C']

df = pd.DataFrame(data=data,index=index,columns = columns)"""


#查操作是类似的
df['A']
df.loc['a']   #查第一行
df.iloc[0]   #查第一行
df.loc['a','A']
df.loc['a','A']

#改操作
df.loc['a']['A']=150
df.index=['g','j']


"""
data = [[1,2,3],[4,5,6]]
index = ['j','k']
columns = ['A','B','C']
df2 = pd.DataFrame(data=data,index=index,columns = columns)"""


#增操作
df.loc['c']=[1,2,3]    #增加行
df3 = pd.concat([df,df2],axis = 0)  #增加样本个数，按列拼接
df4 = pd.concat([df,df2],axis = 1)  #增加特征数，按行拼接
df['tang']=[1,2]    #增加特征

#删操作
df.drop(['j'],axis=0,inplace = True)  #删除j行
df.drop(['A','B','C'],axis = 1,inplace = True)   #删除A，B，C三列
del df['A']   #删除A这一列  **只能删除列**

df输出：
在这里插入图片描述
df2输出：

7.merge合并

left：
在这里插入图片描述
right：

pd.merge(left, right, on = 'key')  #合并，相同的key
pd.merge(left, right)
pd.merge(left, right, on = ['key1', 'key2'], how = 'outer') #left和right所有数据都有，不一样的多增加一行，以nan表示
pd.merge(left, right, how = 'left') #以left为模板，right和left不一样的用nan表示

8.pivot操作

df.pivot(index = '?',columns= '?',values = '?')

df.pivot_table(index = 'Sex',columns='Pclass',values='Fare') #默认值就是求平均值
df.pivot_table(index = 'Pclass',columns='Sex',values='Survived',aggfunc='mean')#求平均
df.pivot_table(index = 'Sex',columns='Pclass',values='Fare',aggfunc='count')求个数
df.pivot_table(index = 'Sex',columns='Pclass',values='Fare',aggfunc='max')

9.时间操作

import datetime
dt = datetime.datetime(year=2017,month=11,day=24,hour=10,minute=30)
print (dt)

在这里插入图片描述

pd.Series(pd.date_range(start='2017-11-24',periods = 10,freq = '12H'))

ts + pd.Timedelta('5 days')
pd.to_datetime('2017-11-24')
pd.to_datetime('24/11/2017')

data['Time'] = pd.to_datetime(data['Time'])
data = data.set_index('Time')  新设一个索引Time，把原来的0，1，2...索引删掉

data[pd.Timestamp('2012-01-01 09:00'):pd.Timestamp('2012-01-01 19:00')]

data[('2012-01-01 09:00'):('2012-01-01 19:00')]
data['2013']  #取2013年所有天数，2013-01，2013-02...
data['2012-01':'2012-03']
data[data.index.month == 1]
data[(data.index.hour > 8) & (data.index.hour <12)]
data.between_time('08:00','12:00')  #取8点到12点时间戳，前开后闭

data.resample('D').mean().head()  #统计一天的平均值。重采样
data.resample('D',how='mean').head()
data.resample('D').max().head()
data.resample('3D').mean().head() #统计3天的平均
data.resample('M').mean().head()  #统计每个月的平均

10.常用操作

sort_values	排序
drop_duplicates()	去重

data.sort_values(by=['group','data'],ascending = [False,True],inplace=True)  #group降序，data升序
data.sort_values(by='k2')

data.drop_duplicates()
data.drop_duplicates(subset='k1')

apply后的效果如下图：

data['food_map'] = data.apply(food_map,axis = 'columns')

在这里插入图片描述

数据离散化

在这里插入图片描述

判断缺失值

df.isnull()   #返回True表示有缺失值
df.isnull().any()  #看列有无缺失值
df.isnull().any(axis=1)  #返回的是bool类型
df[df.isnull().any(axis = 1)]  #看缺失值的样本

缺失值填充

fillna（）
df.fillna(5)  #nan填充为5

要做个棒的人

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pandas学习

1.pandas介绍df=pd.read_csv()读取数据pd.read_csv读取数据print（help（read_csv）查看read_csv帮助文档df.info（）返回当前信息df.index索引（行）df.columns列名（特征）df.dtypes查看数据类型df.values值df.describe()查看数据基本统计特性创建一个dataframe结构data = {'country':['aaa','bbb
复制链接

扫一扫