作者:老杨啊小强
来源:Python数据科学家修炼之路
本文主要介绍在数据分析过程中,Python常用的一些命令。
闲话不多说,直接上干货:
目 录
查看数据基本情况
重命名列名
删除重复的数据
数据合并、排序分列、分组标记
数据提取,选择
数据筛选
数据分类汇总
数据统计
数据输出
01
查看数据基本情况
Law
首先,自己造一些数据:
import pandas as pd
import numpy as np
df=pd.DataFrame({"id":['1001','1002','1003','1004','1005','1006'],
"date":pd.date_range("20200102",periods=6),
"city":['Beijing','SH',' guangzhou ','Shenzhen','shanghai','BEIJING'],
"age":[23,44,54,43,32,32],
"category":['100-A','100-B','110-A','110-C','210-A','130-F'],
"price":[1200,np.nan,2133,533,np.nan,4432]
},
columns=["id","date","city","age","category","price"]
)
df
看一下我们造的数,总共6行,6列:
数据大小,多少行,多少列
df.shape,查看多少行多少列
返回(6, 6),6行6列
查看数据类型
方法一:df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
id 6 non-null object
date 6 non-null datetime64[ns]
city 6 non-null object
age 6 non-null int64
category 6 non-null object
price 4 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(1), object(3)
memory usage: 368.0+ bytes
方法二:df.dtypes
id object
date datetime64[ns]
city object
age int64
category object
price float64
dtype: object
方法三:df['date'].dtypes #查看单列
dtype('<M8[ns]')
缺失值情况
df.isnull()
id date city age category price
0 False False False False False False
1 False False False False False True
2 False False False False False False
3 False False False False False False
4 False False False False False True
5 False False False False False False
当然,这样你是可以知道哪里有缺失值的(为True的地方),但是如果有几万条几十万数据,很难知道所有的缺失值,不直观,可以这样:
df.isnull().sum().sort_values(ascending=False)
price 2
category 0
age 0
city 0
date 0
id 0
dtype: int64
查看城市有多少个:
df['city'].unique()
array(['Beijing', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING'],
dtype=object)
将df提取值,DataFrame转化为数组形式:
df.values
array([['1001', Timestamp('2020-01-02 00:00:00'), 'Beijing', 23, '100-A',
1200.0],
['1002', Timestamp('2020-01-03 00:00:00'), 'SH', 44, '100-B', nan],
['1003', Timestamp('2020-01-04 00:00:00'), ' guangzhou ', 54,
'110-A', 2133.0],
['1004', Timestamp('2020-01-05 00:00:00'), 'Shenzhen', 43,
'110-C', 533.0],
['1005', Timestamp('2020-01-06 00:00:00'), 'shanghai', 32,
'210-A', nan],
['1006', Timestamp('2020-01-07 00:00:00'), 'BEIJING', 32, '130-F',
4432.0]], dtype=object)
转化后,可以进行Numpy数组切片操作:
aa[:3,-3:]
array([[23, '100-A', 1200.0],
[44, '100-B', nan],
[54, '110-A', 2133.0]], dtype=object)
提取所有的列名:
df.columns
Index(['id', 'date', 'city', 'age', 'category', 'price'], dtype='object')
查看前三行,后三行
df.head(3)
df.tail(3)
缺失值处理
df.dropna(how='any') #有缺失值,整行删除df.fillna(value=0) #缺失值用0填补
print(df['price'].mean())
df['price']=df['price'].fillna(df['price'].mean())
df
字符操作,数据类型转化:
# 原来city字段下的值有空格,去掉前后空格df['city'].map(str.strip)
0 Beijing
1 SH
2 guangzhou
3 Shenzhen
4 shanghai
5 BEIJING
Name: city, dtype: object#将city都变成小写df['city'].str.lower()
0 beijing
1 sh
2 guangzhou
3 shenzhen
4 shanghai
5 beijing
Name: city, dtype: object# 将city变成大写
df['city'].map(str.upper)
0 BEIJING
1 SH
2 GUANGZHOU
3 SHENZHEN
4 SHANGHAI
5 BEIJING
Name: city, dtype: object
#将原来的price由float类型转化为int
df['price'].astype('int')
0 1200
1 2074
2 21333 533
4 2074
5 4432
Name: price, dtype: int32
02
重命名列名
Law
df.rename(columns={'category':'category-size'})
03
删除重复的数据
Law
df['city'].str.lower().drop_duplicates()
0 beijing
1 sh
2 guangzhou
3 shenzhen
4 shanghai
Name: city, dtype: object
df['city'].str.lower().drop_duplicates(keep='last')
1 sh
2 guangzhou
3 shenzhen
4 shanghai
5 beijing
Name: city, dtype: object
df['city'].str.lower().replace('sh','shanghai').drop_duplicates(keep='last')
2 guangzhou
3 shenzhen
4 shanghai
5 beijing
Name: city, dtype: object
删除重复数据,调用drop_duplicates()方法
04
合并|排序|分列|分组标记
Law
df1=pd.DataFrame({"id":['1001','1002','1003','1004','1005','1006','1007','1008'],
'gender':['male','female','male','female','male','female','male','female'],
'pay':['N','N','Y','N','Y','Y','N','Y'],
'm-point':[10,12,20,40,40,30,20,40]})
df1
id gender pay m-point
0 1001 male N 10
1 1002 female N 12
2 1003 male Y 20
3 1004 female N 40
4 1005 male Y 40
5 1006 female Y 30
6 1007 male N 20
7 1008 female Y 40
我们造了第二个数据df1,用merge()方法,和原来的数据df合并:
df_inner=pd.merge(df,df1,how='inner')
df_inner
merge()的其他合并方法,outer、left、right# pd.merge(df,df1,how="outer")
# pd.merge(df,df1,how="left")
# pd.merge(df,df1,how="right")
pd.merge(df,df1,left_on='id',right_on='id',how='left')
# 将合并后的数据df_inner的索引设置为iddf_inner.set_index('id')
# 按age升序排序
df_inner.sort_values(by=['age'],ascending=True)
# 按索引排序,又变回原来的样子df_inner.sort_index()
新生成一列group,price大于3000的,标记为high,否则标记为low
# 数据分组
df_inner['group']=np.where(df_inner['price']>3000,'high','low')
df_inner
新生成一列sign,如果city为beijing且price大于等于4000,将其标记为1
df_inner['city']=df_inner['city'].map(str.lower) #将city都变成小写
df_inner.loc[(df_inner['city']=='beijing') & (df_inner['price']>=4000),'sign']=1
df_inner
分列
# 数据分列
split=pd.DataFrame([x.split("-") for x in df_inner['category']],index=df_inner.index,columns=['category-1','category-2'])
split
category-1 category-2
0 100 A
1 100 B
2 110 A
3 110 C
4 210 A
5 130 F
# 原数据和分列后的数据合并
df_inner=pd.merge(df_inner,split,right_index=True,left_index=True)
df_inner
必须指定右表索引或者关联条件,不然报错“MergeError: Must pass right_on or right_index=True”
pd.merge(df_inner,split,left_index=True) ( X )
05
数据提取、选择
Law
按照标签索引提取 loc
df_inner.loc[3:5] # 提取第三行到第五行的数据
# 将日期设为索引
df_inner=df_inner.set_index('date')
# 按照日期提取数据,提取'2020-01-04'之前的所有数据
df_inner[:'2020-01-04']
id city age category price gender pay m-point group sign category-1 category-2
date
2020-01-02 1001 beijing 23 100-A 1200.0 male N 10 low NaN 100 A
2020-01-03 1002 sh 44 100-B 2074.5 female N 12 low NaN 100 B
2020-01-04 1003 guangzhou 54 110-A 2133.0 male Y 20 low NaN 110 A
# 提取'2020-01-04'之前的所有行,前4列的数据
df_inner.ix[:'2020-01-04',:4]
D:\Anaconda\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
id city age category
date
2020-01-02 1001 beijing 23 100-A
2020-01-03 1002 sh 44 100-B
2020-01-04 1003 guangzhou 54 110-A
按照DataFrame.ix[]提取数据,有警告,告诉我们如果要按照标签提取用loc,按照位置提取用iloc,但是有结果,ix既可以用标签提取也可以用位置提取
按位置提取
# 提取前3行,前2列的数据
df_inner.iloc[:3,:2]
id city
date
2020-01-02 1001 beijing
2020-01-03 1002 sh
2020-01-04 1003 guangzhou
# 提取第1,3,6行,第5/6两列
df_inner.iloc[[0,2,5],[4,5]]
price gender
date
2020-01-02 1200.0 male
2020-01-04 2133.0 male
2020-01-07 4432.0 female
# 按布尔值筛选
df_inner['city'].isin(['beijing'])
date
2020-01-02 True
2020-01-03 False
2020-01-04 False
2020-01-05 False
2020-01-06 False
2020-01-07 True
Name: city, dtype: bool
df_inner.loc[df_inner['city'].isin(['beijing'])]
字符串切片
category=df_inner['category']
category
date
2020-01-02 100-A
2020-01-03 100-B
2020-01-04 110-A
2020-01-05 110-C
2020-01-06 210-A
2020-01-07 130-F
Name: category, dtype: object
pd.DataFrame(category.str[:3]) #取前三位
category
date
2020-01-02 100
2020-01-03 100
2020-01-04 110
2020-01-05 110
2020-01-06 210
2020-01-07 130
pd.DataFrame(category.str[-1:]) #只取后一位
category
date
2020-01-02 A
2020-01-03 B
2020-01-04 A
2020-01-05 C
2020-01-06 A
2020-01-07 F
06
数据筛选
Law
df_inner['city']=df_inner['city'].replace('sh','shanghai')
df_inner
1.选出年龄大于25 且 城市在上海,提取'id','city','age','category','gender'列
df_inner.loc[(df_inner['age']>25) & (df_inner['city']=='shanghai'),['id','city','age','category','gender']]
id city age category gender
date
2020-01-03 1002 shanghai 44 100-B female
2020-01-06 1005 shanghai 32 210-A male
2.选出年龄大于25 或 城市在上海,
取'id','city','age','category','gender','price'列
df_inner.loc[(df_inner['age']>25) | (df_inner['city']=='shanghai'),['id','city','age','category','gender','price']]
id city age category gender price
date
2020-01-03 1002 shanghai 44 100-B female 2074.5
2020-01-04 1003 guangzhou 54 110-A male 2133.0
2020-01-05 1004 shenzhen 43 110-C female 533.0
2020-01-06 1005 shanghai 32 210-A male 2074.5
2020-01-07 1006 beijing 32 130-F female 4432.0
3.将年龄大于25,或者城市是上海的,
取'id','city','age','category','gender','price'列,按照年龄升序排序
df_inner.loc[(df_inner['age']>25) | (df_inner['city']=='shanghai'),['id','city','age','category','gender','price']]\
.sort_values(by='age',axis=0,ascending=True)
id city age category gender price
date
2020-01-06 1005 shanghai 32 210-A male 2074.5
2020-01-07 1006 beijing 32 130-F female 4432.0
2020-01-05 1004 shenzhen 43 110-C female 533.0
2020-01-03 1002 shanghai 44 100-B female 2074.5
2020-01-04 1003 guangzhou 54 110-A male 2133.0
4.城市除了上海以外的,取'id','city','age','category','gender','price'列,并按价格降序排序
df_inner.loc[(df_inner['city']!='shanghai'),['id','city','age','category','gender','price']]\
.sort_values(by="price",ascending=False)
id city age category gender price
date
2020-01-07 1006 beijing 32 130-F female 4432.0
2020-01-04 1003 guangzhou 54 110-A male 2133.0
2020-01-02 1001 beijing 23 100-A male 1200.0
2020-01-05 1004 shenzhen 43 110-C female 533.0
5.城市除了上海以外的,价格总和是多少?
df_inner.loc[(df_inner['city']!='shanghai'),['id','city','age','category','gender','price']].sort_values(by="price")\
.price.sum()
8298.0
使用query函数筛选
# 使用query函数筛选
df_inner.query("city==['beijing','shanghai']")
id city age category price gender pay m-point group sign category-1 category-2
date
2020-01-02 1001 beijing 23 100-A 1200.0 male N 10 low NaN 100 A
2020-01-03 1002 shanghai 44 100-B 2074.5 female N 12 low NaN 100 B
2020-01-06 1005 shanghai 32 210-A 2074.5 male Y 40 low NaN 210 A
2020-01-07 1006 beijing 32 130-F 4432.0 female Y 30 high 1.0 130 F
df_inner.query("city==['beijing','shanghai']").price.sum()
9781.0
07
数据分类汇总
Law
df_inner.groupby("city")['price'].sum()
city
guangzhou 2133.0
beijing 5632.0
shanghai 4149.0
shenzhen 533.0
Name: price, dtype: float64
# 以上以Series返回,如果想返回DataFranme格式,设置as_index=False即可
df_inner.groupby("city",as_index=False)['price'].sum()
city price
0 guangzhou 2133.0
1 beijing 5632.0
2 shanghai 4149.0
3 shenzhen 533.0
按各city的price总和,平均值
df_inner.groupby('city')['price'].agg([len,np.sum,np.mean])
len sum mean
city
guangzhou 1.0 2133.0 2133.0
beijing 2.0 5632.0 2816.0
shanghai 2.0 4149.0 2074.5
shenzhen 1.0 533.0 533.0
Python中透视表的使用
pd.pivot_table(df_inner,index=['city'],values=['price'],columns=['pay'],aggfunc=[len,np.sum,np.mean],\
fill_value=np.nan,margins=True)
08
数据统计
Law
df_inner.sample(n=3) #随机不放回抽取3个
df_inner.sample(n=3,replace=True) #有放回
df_inner.sample(frac=0.5) #不放回抽50%
help(pd.DataFrame.sample) #查看sample的其他参数,random_state表示随机种子sample(self, n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)
查看五值分布
df_inner.describe().round(2).T
count mean std min 25% 50% 75% max
age 6.0 38.00 11.08 23.0 32.00 37.5 43.75 54.0
price 6.0 2074.50 1319.26 533.0 1418.62 2074.5 2118.38 4432.0
m-point 6.0 25.33 13.37 10.0 14.00 25.0 37.50 40.0
sign 1.0 1.00 NaN 1.0 1.00 1.0 1.00 1.0
# price的标准差
df_inner['price'].std()
1319.2574426547685
# m-point和price的协方差
df_inner['price'].cov(df_inner['m-point'])
298.0
# 整个数据,数值变量的协方差
df_inner.cov().round(2)
age price m-point sign
age 122.8 -1559.8 4.40 NaN
price -1559.8 1740440.2 298.00 NaN
m-point 4.4 298.0 178.67 NaN
sign NaN NaN NaN NaN
# m-point和price的相关性系数
df_inner['price'].corr(df_inner['m-point'])
0.016899153221483634
#整个数据,数值变量的相关性系数df_inner.corr().round(2)
age price m-point sign
age 1.00 -0.11 0.03 NaN
price -0.11 1.00 0.02 NaN
m-point 0.03 0.02 1.00 NaN
sign NaN NaN NaN NaN
09
数据输出
Law
# 写入elcel
df_inner.to_excel('../from_elcel_to_python.xlsx',sheet_name="data_001")
df_inner.to_csv('../from_elcel_to_python.csv')
已经保存在根目录下,下载到本地后打开查看结果,如下:
好了,本次关于Python数据分析的基础知识分享到这里了,如果经常往,就收藏,不用再百度了。^ $ ^
基础知识 温故知新...
◆ ◆ ◆ ◆ ◆
长按二维码关注我们
数据森麟公众号的交流群已经建立,许多小伙伴已经加入其中,感谢大家的支持。大家可以在群里交流关于数据分析&数据挖掘的相关内容,还没有加入的小伙伴可以扫描下方管理员二维码,进群前一定要关注公众号奥,关注后让管理员帮忙拉进群,期待大家的加入。
管理员二维码:
猜你喜欢
● 华农兄弟、徐大Sao&李子柒?谁才是B站美食区的最强王者?
● 你相信逛B站也能学编程吗