Pandas基础操作（上）

最新推荐文章于 2024-07-24 20:22:36 发布

1152：从入门到脱发

最新推荐文章于 2024-07-24 20:22:36 发布

阅读量4.2w

点赞数 6

本文链接：https://blog.csdn.net/spike1988zc/article/details/108089560

版权

文章目录

一、Pandas文件读取
- 1.pandas数据读取
- 1、读取纯文本文件
- - 1.1 读取csv，使用默认的标题行、逗号分隔符
  - 1.2 读取txt文件，自己指定分隔符、列名
- 2、读取excel文件
- 3、读取sql文件
二、pandas的数据结构DataFrame和Series
- - DataFrame：二维数据，整个表格，多行多列
- 1.Series
- 2. DataFrame
- - 2.1 根据多个字典序列创建DataFrame
- 从DataFrame中查询出Series
三.Pandas查询数据的5种方法
- Pandas查询数据的几种方法
- Pandas使用df.loc查询数据的方法
- 注意
- 0. 读取数据
- 1. 使用单个label值查询数据
- 2. 使用值列进行表批量查询
- 3. 使用数值区间进行范围查询
- 4. 使用条件表达式查询
- - 复杂条件查询，查询一下完美得天气
- 5. 调用函数查询
四、Pandas怎样新增数据列
- 0. 读取csv数据到DataFrame
- 1. 直接赋值的方法
- 2. df.apply方法
- 3. df.assign方法
- 4. 按条件选择分组分别赋值
五、Pandas数据统计函数
- 0. 读取csv数据
- 1. 汇总类统计
- 2. 唯一去重和按值计数
- - 2.1 唯一去重
  - 2.2 按值计数
- 3. 相关系数和协防差
六、Pandas对缺失值的处理
- 实例：特殊excel的读取、清洗、处理
- 步骤1：读取excel的时候，忽略前几个空行
- 步骤2：检测空值
- 步骤3：删除掉全是空值的列
- 步骤4：删除掉全是空值的行
- 步骤5：将分数列为空的填充为0分
- 步骤6：将姓名的缺失值填充
- 步骤7：将清晰好的excel保存
七、Pandas的SettingWithCopyWarning报警
- 0. 数据读取
- 1. 复现
- 2、原因
- 4. 解决方法2
- - Pandas不允许先筛选子DataFrame，在进行修改写入
八、Pandas数据排序
- 0. 读取数据
- 1. Series的排序
- 2. DataFrame的排序
- - 2.1 单列排序
  - 2.2 多列排序
九、Pandas字符串处理
- 0. 读取北京2018年天气数据
- 1. 获取Series的str属性，使用各种字符串处理函数
- 4. 使用正则表达式的处理
- - Series.str默认就开启了正则表达式模式
十、Pandas的axis参数怎么理解？
- - ***按哪个axis，就是这个axis要动起来(类似被for遍历)，其它的axis保持不动\***
- 1. 单列drop, 就是删除某一列
- 3. 按照axis=0/index执行mean聚合操作
- - ***按哪个axis，就是这个axis要动起来(类似被for遍历)，其它的axis保持不动\***
- 3. 按照axis=1/columns执行mean聚合操作
- - ***按哪个axis，就是这个axis要动起来(类似被for遍历)，其它的axis保持不动\***
- 5. 再次举例, 加深理解
- - ***按哪个axis，就是这个axis要动起来(类似被for遍历)，其它的axis保持不动\***
十一、Pandas的索引index的用途
- 1. 使用index查询数据
- 2. 使用index会提升查询性能
- 实验1:完全随机的查询顺序
- 实验2:将index排序后的查询
- 3.使用index能自动对齐数据
- - s1,s2都具有b,c索引,而a,d为各自独有,无法对齐,所有相加结果为空
- 4. 使用index更多更强大的数据结构支持
十二、Pandas怎样实现DataFrame的Merge
- - merge的语法：
- 1、电影数据集的join实例
- - - 电影评分数据集
- 2、理解merge时数量的对齐关系
- 3、理解left join、right join、inner join、outer join的区别
- 4、如果出现非Key的字段重名怎么办
十三、Pandas实现数据的合并concat
- 一、使用Pandas.concat合并数据
- 1. 默认的concat, 参数为axis=0, join=outer, ignore_index=False
- 2. 使用ignore_index=True可以忽略原来的索引
- 3. 使用join=inner过滤掉不匹配的列
- 4. 使用axis=1相当于添加新列
- - A:添加一列Series
  - B:添加多列Series
- 二、使用DateFrame.append按行合并数据
十四、Pandas批量拆分Excel与合并Excel
- 0. 读取源Excel到Pandas
- 1、将一个大excel等份拆成多个Excel
- 2、合并多个小Excel到一个大Excel
十五、Pandas怎样实现groupby分组统计
- 1、分组使用聚合函数做数据统计
- 2、遍历groupby的结果理解执行流程
- - 2.1 遍历单个列聚合的分组
  - 2.2 遍历多个列聚合的分组
- 3、实例分组探索天气数据
- - 3.1 查看每个月的最高温度
  - 3.2 查看每个月的最高温度、最低温度、平均空气质量指数
十六、Pandas的分层索引MultiIndex
- 1、Series的分层索引MultiIndex
- 2、Series有多层索引MultiIndex怎么筛选数据？
- 3、DataFrame的多层索引MultiIndex
- 4、DataFrame有多层索引MultiIndex怎样筛选？
十七、Pandas的数据转换函数map、apply、applymap
- 1. map用于Series值的转换
- - 方法1：Series.map(dict)
  - 方法2：Series.map(function)
- 2. apply用于Series和DataFrame的转换
- - Series.apply(function)
  - DataFrame.apply(function)
- 3. applymap用于DataFrame所有值的转换
十八、Pandas怎样对每个分组应用apply函数?
- 实例1：怎样对数值列按分组的归一化？
- - 演示：用户对电影评分的归一化
- 实例2：怎么取每个分组的TOP N数据

一、Pandas文件读取

1.pandas数据读取

pandas需要先读取表格类型的数据，然后进行分析

数据类型	说明	pandas读取方法
csv、tsv、txt	用逗号分隔、tab分割的纯文本文件	pd.read_csv
excel	微软xls或者xlsx文件	pd.read_excel
mysql	关系型数据库表	pd.read_sql

In [1]:

import pandas as pd

1、读取纯文本文件

1.1 读取csv，使用默认的标题行、逗号分隔符

In [2]:

fpath = "./pandas-learn-code/datas/ml-latest-small/ratings.csv"

In [3]:

# 使用pd.read_csv读取数据
ratings = pd.read_csv(fpath)

In [4]:

# 查看前几行数据
ratings.head()

Out[4]:

	userId	movieId	rating	timestamp
0	1	1	4.0	964982703
1	1	3	4.0	964981247
2	1	6	4.0	964982224
3	1	47	5.0	964983815
4	1	50	5.0	964982931

In [5]:

# 查看数据的形状，返回（行数、列数）
ratings.shape

Out[5]:

(100836, 4)

In [6]:

# 查看列名列表
ratings.columns

Out[6]:

Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')

In [7]:

# 查看索引
ratings.index

Out[7]:

RangeIndex(start=0, stop=100836, step=1)

In [9]:

# 查看每列的数据类型
ratings.dtypes

Out[9]:

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

1.2 读取txt文件，自己指定分隔符、列名

In [10]:

fpath = "./pandas-learn-code/datas/crazyant/access_pvuv.txt"

In [11]:

pvuv = pd.read_csv(fpath, sep="\t", header=None, names=["pdate","pv","uv"])

sep代表分隔符
header=none代表没有列名
names代表指定的列明

In [13]:

pvuv.head()

Out[13]:

	pdate	pv	uv
0	2019-09-10	139	92
1	2019-09-09	185	153
2	2019-09-08	123	59
3	2019-09-07	65	40
4	2019-09-06	157	98

2、读取excel文件

In [18]:

fpath = "./pandas-learn-code/datas/crazyant/access_pvuv.xlsx"
pvuv = pd.read_excel(fpath)

In [19]:

pvuv

Out[19]:

	日期	PV	UV
0	2019-09-10	139	92
1	2019-09-09	185	153
2	2019-09-08	123	59
3	2019-09-07	65	40
4	2019-09-06	157	98
5	2019-09-05	205	151
6	2019-09-04	196	167
7	2019-09-03	216	176
8	2019-09-02	227	148
9	2019-09-01	105	61

3、读取sql文件

In [36]:

import pymysql
conn = pymysql.connect(
    host="127.0.0.1",
    user="root",
    password="123456",
    database="test",
    charset="utf8"
)

In [41]:

fpath = "./pandas-learn-code/datas/crazyant/test_crazyant_pvuv.sql"
mysql_page = pd.read_sql("select * from crazyant_pvuv", con=conn)

In [42]:

pvuv

Out[42]:

	日期	PV	UV
0	2019-09-10	139	92
1	2019-09-09	185	153
2	2019-09-08	123	59
3	2019-09-07	65	40
4	2019-09-06	157	98
5	2019-09-05	205	151
6	2019-09-04	196	167
7	2019-09-03	216	176
8	2019-09-02	227	148
9	2019-09-01	105	61

二、pandas的数据结构DataFrame和Series

DataFrame：二维数据，整个表格，多行多列

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tT4RRssV-1597761927694)(C:\Users\z&y\AppData\Roaming\Typora\typora-user-images\image-20200730213558995.png)]

In [1]:

import pandas as pd
import numpy as np

1.Series

Series是一种类似于一维数组的对象，它由一组数据（不同数据类型）以及一组与之相关的数据标签（即索引）组成。

1.1 仅有数据列表即可生产最简单的Series

In [3]:

s1 = pd.Series([1,'a',5.2,7])

In [5]:

# 左侧为索引，右侧是数据
s1.head()

Out[5]:

0      1
1      a
2    5.2
3      7
dtype: object

In [6]:

# 获取索引
s1.index

Out[6]:

RangeIndex(start=0, stop=4, step=1)

In [7]:

# 获取数据
s1.values

Out[7]:

array([1, 'a', 5.2, 7], dtype=object)

1.2 创建一个具有标签索引的Series

In [8]:

s2 = pd.Series([1,'a',5.2,7], index=['a','b','c','d'])

In [9]:

s2

Out[9]:

a      1
b      a
c    5.2
d      7
dtype: object

In [10]:

s2.index

Out[10]:

Index(['a', 'b', 'c', 'd'], dtype='object')

1.3 使用python字典创建Series

In [11]:

sdata = {'Ohio':35000, 'Texas':72000, 'Oregon':16000, 'Utah':5000}

In [13]:

s3 = pd.Series(sdata)

In [14]:

# 字典的key成为了Series的索引
s3

Out[14]:

Ohio      35000
Texas     72000
Oregon    16000
Utah       5000
dtype: int64

1.4 根据数据标签索引查询数据

类似python的字典dict

In [15]:

s2

Out[15]:

a      1
b      a
c    5.2
d      7
dtype: object

In [20]:

s2['a']

Out[20]:

In [21]:

# 查询一个值,返回查询值的数据类型
type(s2['a'])

Out[21]:

int

In [18]:

# 一次查询多个值
s2[['a','b','c']]

Out[18]:

a      1
b      a
c    5.2
dtype: object

In [22]:

# 查询多个值,返回的还是Series
type(s2[['a','b','c']])

Out[22]:

pandas.core.series.Series

2. DataFrame

DataFrame是一个表格型的数据结构

每列可以是不同的值类型(数值,字符串,布尔值等)
既有行索引index,也有列索引columns
可以被看做由Series组成的字典

2.1 根据多个字典序列创建DataFrame

In [24]:

data = {
    'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
    'year':[2000,2001,2002,2003,2004],
    'pop':[1.5,1.7,3.6,2.4,2.9]
}
df = pd.DataFrame(data)

In [25]:

df

Out[25]:

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2003	2.4
4	Nevada	2004	2.9

In [26]:

df.dtypes

Out[26]:

state     object
year       int64
pop      float64
dtype: object

In [27]:

df.columns

Out[27]:

Index(['state', 'year', 'pop'], dtype='object')

In [28]:

df.index

Out[28]:

RangeIndex(start=0, stop=5, step=1)

从DataFrame中查询出Series

如果只查询一列,一列,返回的是pd.Series
如果查询多行,多列,返回的是pd.DataFrame

In [29]:

df

Out[29]:

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2003	2.4
4	Nevada	2004	2.9

3.1 查询一列,结果是一个pd.Series

In [30]:

df['year']

Out[30]:

0    2000
1    2001
2    2002
3    2003
4    2004
Name: year, dtype: int64

In [35]:

# 返回的是一个Series
type(df['year'])

Out[35]:

pandas.core.series.Series

3.2 查询多列,结果是一个pd.DataFrame

In [33]:

df[['year', 'pop']]

Out[33]:

	year	pop
0	2000	1.5
1	2001	1.7
2	2002	3.6
3	2003	2.4
4	2004	2.9

In [34]:

# 返回的结果是一个DataFrame
type(df[['year','pop']])

Out[34]:

pandas.core.frame.DataFrame

3.3 查询一行,结果是一个pd.Series

In [39]:

df.loc[0]

Out[39]:

state    Ohio
year     2000
pop       1.5
Name: 0, dtype: object

In [40]:

type(df.loc[0])

Out[40]:

pandas.core.series.Series

3.4 查询多行,结果是一个pd.DataFrame

In [41]:

# DataFrame中切片会返回结尾的数据
df.loc[0:3]

Out[41]:

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2003	2.4

In [42]:

type(df.loc[0:3])

Out[42]:

pandas.core.frame.DataFrame

三.Pandas查询数据的5种方法

Pandas查询数据的几种方法

df.loc方法,根据行,列的标签值查询
df.iloc方法,根据行,列的数字位置查询
df.where方法
df.query方法

.loc方法既能查询,又能覆盖写入,推荐使用此方法

Pandas使用df.loc查询数据的方法

使用单个label值查询数据
使用值列表批量查询
使用数值区间进行范围查询
使用条件表达式查询
调用函数查询

注意

以上查询方法,既适用于行,也适用于列

In [3]:

import pandas as pd

0. 读取数据

数据为北京2018年全年天气预报

In [4]:

df = pd.read_csv("./pandas-learn-code/datas/beijing_tianqi/beijing_tianqi_2018.csv")

In [5]:

df.head()

Out[5]:

	ymd	bWendu	yWendu	tianqi	fengxiang	fengli	aqi	aqiInfo	aqiLevel
0	2018-01-01	3℃	-6℃	晴~多云	东北风	1-2级	59	良	2
1	2018-01-02	2℃	-5℃	阴~多云	东北风	1-2级	49	优	1
2	2018-01-03	2℃	-5℃	多云	北风	1-2级	28	优	1
3	2018-01-04	0℃	-8℃	阴	东北风	1-2级	28	优	1
4	2018-01-05	3℃	-6℃	多云~晴	西北风	1-2级	50	优	1

In [6]:

# 设定索引为日期,方便按日期筛选
df.set_index('ymd', inplace=True)

In [7]:

df.head()

Out[7]:

	bWendu	yWendu	tianqi	fengxiang	fengli	aqi	aqiInfo	aqiLevel
ymd
2018-01-01	3℃	-6℃	晴~多云	东北风	1-2级	59	良	2
2018-01-02	2℃	-5℃	阴~多云	东北风	1-2级	49	优	1
2018-01-03	2℃	-5℃	多云	北风	1-2级	28	优	1
2018-01-04	0℃	-8℃	阴	东北风	1-2级	28	优	1
2018-01-05	3℃	-6℃	多云~晴	西北风	1-2级	50	优	1

In [8]:

# 时间序列见后续课程,本次按字符串处理
df.index

Out[8]:

Index(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05',
       '2018-01-06', '2018-01-07', '2018-01-08', '2018-01-09', '2018-01-10',
       ...
       '2018-12-22', '2018-12-23', '2018-12-24', '2018-12-25', '2018-12-26',
       '2018-12-27', '2018-12-28', '2018-12-29', '2018-12-30', '2018-12-31'],
      dtype='object', name='ymd', length=365)

In [9]:

# 替换掉温度的后缀℃
# df.loc[:]表示筛选出所有的行
df.loc[:, "bWendu"] = df["bWendu"].str.replace("℃","").astype('int32')
df.loc[:, "yWendu"] = df["yWendu"].str.replace("℃","").astype('int32')

In [10]:

# bWendu和yWendu改为int类型
df.dtypes

Out[10]:

bWendu        int32
yWendu        int32
tianqi       object
fengxiang    object
fengli       object
aqi           int64
aqiInfo      object
aqiLevel      int64
dtype: object

1. 使用单个label值查询数据

行或者列,都可以只传入单个值,实现精确匹配

In [11]:

# 得到单个值
df.loc['2018-01-03','bWendu']

Out[11]:

In [12]:

# 得到一个Series
df.loc['2018-01-03',['bWendu', 'yWendu']]

Out[12]:

bWendu     2
yWendu    -5
Name: 2018-01-03, dtype: object

2. 使用值列进行表批量查询

In [13]:

# 得到Series
df.loc[['2018-01-03','2018-01-04','2018-01-05'], 'bWendu']

Out[13]:

ymd
2018-01-03    2
2018-01-04    0
2018-01-05    3
Name: bWendu, dtype: int32

In [14]:

# 得到DataFrame
df.loc[['2018-01-03','2018-01-04','2018-01-05'], ['bWendu','yWendu']]

Out[14]:

	bWendu	yWendu
ymd
2018-01-03	2	-5
2018-01-04	0	-8
2018-01-05	3	-6

3. 使用数值区间进行范围查询

注意:区间既包含开始,也包含结束

In [15]:

# 行index按区间
df.loc['2018-01-03':'2018-01-05', 'bWendu']

Out[15]:

ymd
2018-01-03    2
2018-01-04    0
2018-01-05    3
Name: bWendu, dtype: int32

In [16]:

# 列index按区间
df.loc['2018-01-03','bWendu':'fengxiang']

Out[16]:

bWendu        2
yWendu       -5
tianqi       多云
fengxiang    北风
Name: 2018-01-03, dtype: object

In [17]:

# 行和列都按区间查询
df.loc['2018-01-03':'2018-01-05','bWendu':'fengxiang']

Out[17]:

	bWendu	yWendu	tianqi	fengxiang
ymd
2018-01-03	2	-5	多云	北风
2018-01-04	0	-8	阴	东北风
2018-01-05	3	-6	多云~晴	西北风

4. 使用条件表达式查询

bool列表的长度得等于行数或者列数

In [23]:

df.loc[df["yWendu"]<-10,:]

Out[23]:

	bWendu	yWendu	tianqi	fengxiang	fengli	aqi	aqiInfo	aqiLevel
ymd
2018-01-23	-4	-12	晴	西北风	3-4级	31	优	1
2018-01-24	-4	-11	晴	西南风	1-2级	34	优	1
2018-01-25	-3	-11	多云	东北风	1-2级	27	优	1
2018-12-26	-2	-11	晴~多云	东北风	2级	26	优	1
2018-12-27	-5	-12	多云~晴	西北风	3级	48	优	1
2018-12-28	-3	-11	晴	西北风	3级	40	优	1
2018-12-29	-3	-12	晴	西北风	2级	29	优	1
2018-12-30	-2	-11	晴~多云	东北风	1级	31	优	1

In [24]:

df["yWendu"]<-10

Out[24]:

ymd
2018-01-01    False
2018-01-02    False
2018-01-03    False
2018-01-04    False
2018-01-05    False
2018-01-06    False
2018-01-07    False
2018-01-08    False
2018-01-09    False
2018-01-10    False
2018-01-11    False
2018-01-12    False
2018-01-13    False
2018-01-14    False
2018-01-15    False
2018-01-16    False
2018-01-17    False
2018-01-18    False
2018-01-19    False
2018-01-20    False
2018-01-21    False
2018-01-22    False
2018-01-23     True
2018-01-24     True
2018-01-25     True
2018-01-26    False
2018-01-27    False
2018-01-28    False
2018-01-29    False
2018-01-30    False
              ...  
2018-12-02    False
2018-12-03    False
2018-12-04    False
2018-12-05    False
2018-12-06    False
2018-12-07    False
2018-12-08    False
2018-12-09    False
2018-12-10    False
2018-12-11    False
2018-12-12    False
2018-12-13    False
2018-12-14    False
2018-12-15    False
2018-12-16    False
2018-12-17    False
2018-12-18    False
2018-12-19    False
2018-12-20    False
2018-12-21    False
2018-12-22    False
2018-12-23    False
2018-12-24    False
2018-12-25    False
2018-12-26     True
2018-12-27     True
2018-12-28     True
2018-12-29     True
2018-12-30     True
2018-12-31    False
Name: yWendu, Length: 365, dtype: bool

复杂条件查询，查询一下完美得天气

注意，组合条件用&符号合并，每个条件判断都得带括号

In [29]:

df.loc[(df["bWendu"]<=30) & (df["yWendu"]>=15) & (df["tianqi"]=="晴") & (df["aqiLevel"]==1),:]

Out[29]:

	bWendu	yWendu	tianqi	fengxiang	fengli	aqi	aqiInfo	aqiLevel
ymd
2018-08-24	30	20	晴	北风	1-2级	40	优	1
2018-09-07	27	16	晴	西北风	3-4级	22	优	1

In [30]:

(df["bWendu"]<=30) & (df["yWendu"]>=15) & (df["tianqi"]=="晴") & (df["aqiLevel"]==1)

Out[30]:

ymd
2018-01-01    False
2018-01-02    False
2018-01-03    False
2018-01-04    False
2018-01-05    False
2018-01-06    False
2018-01-07    False
2018-01-08    False
2018-01-09    False
2018-01-10    False
2018-01-11    False
2018-01-12    False
2018-01-13    False
2018-01-14    False
2018-01-15    False
2018-01-16    False
2018-01-17    False
2018-01-18    False
2018-01-19    False
2018-01-20    False
2018-01-21    False
2018-01-22    False
2018-01-23    False
2018-01-24    False
2018-01-25    False
2018-01-26    False
2018-01-27    False
2018-01-28    False
2018-01-29    False
2018-01-30    False
              ...  
2018-12-02    False
2018-12-03    False
2018-12-04    False
2018-12-05    False
2018-12-06    False
2018-12-07    False
2018-12-08    False
2018-12-09    False
2018-12-10    False
2018-12-11    False
2018-12-12    False
2018-12-13    False
2018-12-14    False
2018-12-15    False
2018-12-16    False
2018-12-17    False
2018-12-18    False
2018-12-19    False
2018-12-20    False
2018-12-21    False
2018-12-22    False
2018-12-23    False
2018-12-24    False
2018-12-25    False
2018-12-26    False
2018-12-27    False
2018-12-28    False
2018-12-29    False
2018-12-30    False
2018-12-31    False
Length: 365, dtype: bool

5. 调用函数查询

In [31]:

# 直接写lambda表达式
df.loc[lambda df: (df["bWendu"]<=30) & (df["yWendu"]>=15),:]

Out[31]:

	bWendu	yWendu	tianqi	fengxiang	fengli	aqi	aqiInfo	aqiLevel
ymd
2018-04-28	27	17	晴	西南风	3-4级	125	轻度污染	3
2018-04-29	30	16	多云	南风	3-4级	193	中度污染	4
2018-05-04	27	16	晴~多云	西南风	1-2级	86	良	2
2018-05-09	29	17	晴~多云	西南风	3-4级	79	良	2
2018-05-10	26	18	多云	南风	3-4级	118	轻度污染	3
2018-05-11	24	15	阴~多云	东风	1-2级	106	轻度污染	3
2018-05-12	28	16	小雨	东南风	3-4级	186	中度污染	4
2018-05-13	30	17	晴	南风	1-2级	68	良	2
2018-05-16	29	21	多云~小雨	东风	1-2级	142	轻度污染	3
2018-05-17	25	19	小雨~多云	北风	1-2级	70	良	2
2018-05-18	28	16	多云~晴	南风	1-2级	49	优	1
2018-05-19	27	16	多云~小雨	南风	1-2级	69	良	2
2018-05-20	21	16	阴~小雨	东风	1-2级	54	良	2
2018-05-23	29	15	晴	西南风	3-4级	153	中度污染	4
2018-05-26	30	17	小雨~多云	西南风	3-4级	143	轻度污染	3
2018-05-28	30	16	晴	西北风	4-5级	178	中度污染	4
2018-06-09	23	17	小雨	北风	1-2级	45	优	1
2018-06-10	27	17	多云	东南风	1-2级	51	良	2
2018-06-11	29	19	多云	西南风	3-4级	85	良	2
2018-06-13	28	19	雷阵雨~多云	东北风	1-2级	73	良	2
2018-06-18	30	21	雷阵雨	西南风	1-2级	112	轻度污染	3
2018-06-22	30	21	雷阵雨~多云	东南风	1-2级	83	良	2
2018-07-08	30	23	雷阵雨	南风	1-2级	73	良	2
2018-07-09	30	22	雷阵雨~多云	东南风	1-2级	106	轻度污染	3
2018-07-10	30	22	多云~雷阵雨	南风	1-2级	48	优	1
2018-07-11	25	22	雷阵雨~大雨	东北风	1-2级	44	优	1
2018-07-12	27	22	多云	南风	1-2级	46	优	1
2018-07-13	28	23	雷阵雨	东风	1-2级	60	良	2
2018-07-17	27	23	中雨~雷阵雨	西风	1-2级	28	优	1
2018-07-24	28	26	暴雨~雷阵雨	东北风	3-4级	29	优	1
…	…	…	…	…	…	…	…	…
2018-08-11	30	23	雷阵雨~中雨	东风	1-2级	60	良	2
2018-08-12	30	24	雷阵雨	东南风	1-2级	74	良	2
2018-08-14	29	24	中雨~小雨	东北风	1-2级	42	优	1
2018-08-16	30	21	晴~多云	东北风	1-2级	40	优	1
2018-08-17	30	22	多云~雷阵雨	东南风	1-2级	69	良	2
2018-08-18	28	23	小雨~中雨	北风	3-4级	40	优	1
2018-08-19	26	23	中雨~小雨	东北风	1-2级	37	优	1
2018-08-22	28	21	雷阵雨~多云	西南风	1-2级	48	优	1
2018-08-24	30	20	晴	北风	1-2级	40	优	1
2018-08-27	30	22	多云~雷阵雨	东南风	1-2级	89	良	2
2018-08-28	29	22	小雨~多云	南风	1-2级	58	良	2
2018-08-30	29	20	多云	南风	1-2级	47	优	1
2018-08-31	29	20	多云~阴	东南风	1-2级	48	优	1
2018-09-01	27	19	阴~小雨	南风	1-2级	50	优	1
2018-09-02	27	19	小雨~多云	南风	1-2级	55	良	2
2018-09-03	30	19	晴	北风	3-4级	70	良	2
2018-09-06	27	18	多云~晴	西北风	4-5级	37	优	1
2018-09-07	27	16	晴	西北风	3-4级	22	优	1
2018-09-08	27	15	多云~晴	北风	1-2级	28	优	1
2018-09-09	28	16	晴	西南风	1-2级	51	良	2
2018-09-10	28	19	多云	南风	1-2级	65	良	2
2018-09-11	26	19	多云	南风	1-2级	68	良	2
2018-09-12	29	19	多云	南风	1-2级	59	良	2
2018-09-13	29	20	多云~阴	南风	1-2级	107	轻度污染	3
2018-09-14	28	19	小雨~多云	南风	1-2级	128	轻度污染	3
2018-09-15	26	15	多云	北风	3-4级	42	优	1
2018-09-17	27	17	多云~阴	北风	1-2级	37	优	1
2018-09-18	25	17	阴~多云	西南风	1-2级	50	优	1
2018-09-19	26	17	多云	南风	1-2级	52	良	2
2018-09-20	27	16	多云	西南风	1-2级	63	良	2

64 rows × 8 columns

In [33]:

# 编写自己的函数，查询9月份，空气质量好的数据
def query_my_data(df):
    return df.index.str.startswith("2018-09") & (df["aqiLevel"]==1)
df.loc[query_my_data,:]

Out[33]:

	bWendu	yWendu	tianqi	fengxiang	fengli	aqi	aqiInfo	aqiLevel
ymd
2018-09-01	27	19	阴~小雨	南风	1-2级	50	优	1
2018-09-04	31	18	晴	西南风	3-4级	24	优	1
2018-09-05	31	19	晴~多云	西南风	3-4级	34	优	1
2018-09-06	27	18	多云~晴	西北风	4-5级	37	优	1
2018-09-07	27	16	晴	西北风	3-4级	22	优	1
2018-09-08	27	15	多云~晴	北风	1-2级	28	优	1
2018-09-15	26	15	多云	北风	3-4级	42	优	1
2018-09-16	25	14	多云~晴	北风	1-2级	29	优	1
2018-09-17	27	17	多云~阴	北风	1-2级	37	优	1
2018-09-18	25	17	阴~多云	西南风	1-2级	50	优	1
2018-09-21	25	14	晴	西北风	3-4级	50	优	1
2018-09-22	24	13	晴	西北风	3-4级	28	优	1
2018-09-23	23	12	晴	西北风	4-5级	28	优	1
2018-09-24	23	11	晴	北风	1-2级	28	优	1
2018-09-25	24	12	晴~多云	南风	1-2级	44	优	1
2018-09-29	22	11	晴	北风	3-4级	21	优	1
2018-09-30	19	13	多云	西北风	4-5级	22	优	1

四、Pandas怎样新增数据列

In [1]:

import pandas as pd

0. 读取csv数据到DataFrame

In [15]:

df = pd.read_csv("./pandas-learn-code/datas/beijing_tianqi/beijing_tianqi_2018.csv")

In [16]:

df.head()

Out[16]:

	ymd	bWendu	yWendu	tianqi	fengxiang	fengli	aqi	aqiInfo	aqiLevel
0	2018-01-01	3℃	-6℃	晴~多云	东北风	1-2级	59	良	2
1	2018-01-02	2℃	-5℃	阴~多云	东北风	1-2级	49	优	1
2	2018-01-03	2℃	-5℃	多云	北风	1-2级	28	优	1
3	2018-01-04	0℃	-8℃	阴	东北风	1-2级	28	优	1
4	2018-01-05	3℃	-6℃	多云~晴	西北风	1-2级	50	优	1

1. 直接赋值的方法

实例：清理温度列，变成数字类型

In [31]:

df.loc[:,"bWendu"] = df["bWendu"].str.replace("℃","").astype('int32')
df.loc[:,"yWendu"] = df["yWendu"].str.replace("℃","").astype('int32')
实例：计算温差

In [49]:

del df["bWendnu"]

In [51]:

del df["bWednu"]

In [52]:

# 注意，fpath["bWendu"]其实是一个Series，后面的减法返回的是Series
df.loc[:,"wencha"] = df["bWendu"] - df["yWendu"]

In [53]:

df.head()

Out[53]:

	ymd	bWendu	yWendu	tianqi	fengxiang	fengli	aqi	aqiInfo	aqiLevel	wencha
0	2018-01-01	3	-6	晴~多云	东北风	1-2级	59	良	2	9
1	2018-01-02	2	-5	阴~多云	东北风	1-2级	49	优	1	7
2	2018-01-03	2	-5	多云	北风	1-2级	28	优	1	7
3	2018-01-04	0	-8	阴	东北风	1-2级	28	优	1	8
4	2018-01-05	3	-6	多云~晴	西北风	1-2级	50	优	1	9

2. df.apply方法

Apply a function along an axis of the DataFrame. Objects passed to the function are Series objects whose index is either the DataFrame’s index (axis=0) or the DataFrame’s columns (axis=1) 实例：添加一列温度类型：

如果最高温度大于33度就是高温
低于-10度是低温
否则是常温

In [60]:

def get_wendu_type(x):
    if x["bWendu"] > 33:
        return "高温"
    if x["yWendu"] < -10:
        return "低温"
    return "常温"

# 注意需要设置axis=1
df.loc[:,"wendu_type"] = df.apply(get_wendu_type, axis=1)

In [61]:

# 查看温度类型的计数
df["wendu_type"].value_counts()

Out[61]:

常温    328
高温     29
低温      8
Name: wendu_type, dtype: int64

3. df.assign方法

Assign new columns to a DataFrame.

Returns a new object with all original columns in addtion to new ones.

实例：将温度从摄氏度变成华氏度

In [63]:

# 可以同时添加多个新的列
df.assign(
    yWendu_huashi = lambda x: x["yWendu"]*9/5 + 32,
    bWendu_huashi = lambda x: x["bWendu"]*9/5 + 32
)

. . .

4. 按条件选择分组分别赋值

按条件选择数据，然后对整个数据赋值新列

实例：高低温差大于10度，则认为温差大

In [65]:

df.loc[:,"wencha_type"] = ""
df.loc[df["bWendu"]-df["yWendu"]>10, "wencha_type"] = "温差大"
df.loc[df["bWendu"]-df["yWendu"]<=10, "wencha_type"]= "温度正常"

In [67]:

df["wencha_type"].value_counts()

Out[67]:

温度正常    187
温差大     178
Name: wencha_type, dtype: int64

五、Pandas数据统计函数

汇总类统计
唯一去重和按值计数
相关系数和协方差

In [2]:

import pandas as pd

0. 读取csv数据

In [5]:

fpath = "./pandas-learn-code/datas/beijing_tianqi/beijing_tianqi_2018.csv"
df = pd.read_csv(fpath)

In [6]:

df.head(3)

Out[6]:

	ymd	bWendu	yWendu	tianqi	fengxiang	fengli	aqi	aqiInfo	aqiLevel
0	2018-01-01	3℃	-6℃	晴~多云	东北风	1-2级	59	良	2
1	2018-01-02	2℃	-5℃	阴~多云	东北风	1-2级	49	优	1
2	2018-01-03	2℃	-5℃	多云	北风	1-2级	28	优	1

In [12]:

df.loc[:, "yWendu"] = df["yWendu"].str.replace("℃","").astype("int32")

In [14]:

df.head(3)

Out[14]:

	ymd	bWendu	yWendu	tianqi	fengxiang	fengli	aqi	aqiInfo	aqiLevel
0	2018-01-01	3	-6	晴~多云	东北风	1-2级	59	良	2
1	2018-01-02	2	-5	阴~多云	东北风	1-2级	49	优	1
2	2018-01-03	2	-5	多云	北风	1-2级	28	优	1

1. 汇总类统计

In [15]:

# 一次提取所有数字列统计结果
df.describe()

Out[15]:

	bWendu	yWendu	aqi	aqiLevel
count	365.000000	365.000000	365.000000	365.000000
mean	18.665753	8.358904	82.183562	2.090411
std	11.858046	11.755053	51.936159	1.029798
min	-5.000000	-12.000000	21.000000	1.000000
25%	8.000000	-3.000000	46.000000	1.000000
50%	21.000000	8.000000	69.000000	2.000000
75%	29.000000	19.000000	104.000000	3.000000
max	38.000000	27.000000	387.000000	6.000000

In [16]:

#  查看单个Series的数据
df["bWendu"].mean()

Out[16]:

18.665753424657535

In [17]:

#  最高温
df["bWendu"].max()

Out[17]:

In [18]:

# 最低温
df["bWendu"].min()

Out[18]:

-5

2. 唯一去重和按值计数

2.1 唯一去重

一般不用于数值列，而是枚举、分类列

In [19]:

df["fengxiang"].unique()

Out[19]:

array(['东北风', '北风', '西北风', '西南风', '南风', '东南风', '东风', '西风'], dtype=object)

In [20]:

df["tianqi"].unique()

Out[20]:

array(['晴~多云', '阴~多云', '多云', '阴', '多云~晴', '多云~阴', '晴', '阴~小雪', '小雪~多云',
       '小雨~阴', '小雨~雨夹雪', '多云~小雨', '小雨~多云', '大雨~小雨', '小雨', '阴~小雨',
       '多云~雷阵雨', '雷阵雨~多云', '阴~雷阵雨', '雷阵雨', '雷阵雨~大雨', '中雨~雷阵雨', '小雨~大雨',
       '暴雨~雷阵雨', '雷阵雨~中雨', '小雨~雷阵雨', '雷阵雨~阴', '中雨~小雨', '小雨~中雨', '雾~多云',
       '霾'], dtype=object)

In [22]:

df["fengli"].unique()

Out[22]:

array(['1-2级', '4-5级', '3-4级', '2级', '1级', '3级'], dtype=object)

2.2 按值计数

In [24]:

df["fengxiang"].value_counts()

Out[24]:

南风     92
西南风    64
北风     54
西北风    51
东南风    46
东北风    38
东风     14
西风      6
Name: fengxiang, dtype: int64

In [25]:

df["tianqi"].unique()

Out[25]:

array(['晴~多云', '阴~多云', '多云', '阴', '多云~晴', '多云~阴', '晴', '阴~小雪', '小雪~多云',
       '小雨~阴', '小雨~雨夹雪', '多云~小雨', '小雨~多云', '大雨~小雨', '小雨', '阴~小雨',
       '多云~雷阵雨', '雷阵雨~多云', '阴~雷阵雨', '雷阵雨', '雷阵雨~大雨', '中雨~雷阵雨', '小雨~大雨',
       '暴雨~雷阵雨', '雷阵雨~中雨', '小雨~雷阵雨', '雷阵雨~阴', '中雨~小雨', '小雨~中雨', '雾~多云',
       '霾'], dtype=object)

In [26]:

df["fengli"].value_counts()

Out[26]:

1-2级    236
3-4级     68
1级       21
4-5级     20
2级       13
3级        7
Name: fengli, dtype: int64

3. 相关系数和协防差

用途：

两只股票，是不是同涨同跌？程度多大？正相关还是负相关？
产品销量的波动，跟哪些因素正相关、负相关，程度有多大？

对于两个变量x, y:

协方差：衡量同向反向程度，如果协方差为正，说明x,y同向变化，协方差越大说明同向程度越高；如果协方差为负，说明x,y反向运动，协方差越小说明反向程度越高。
相关系数：衡量相似度程度，当他们的相关系数为1时，说明两个变量变化时正向相似度越大，当关系数为-1时，说明两个变量变化的反向相似度最大

In [27]:

# 协方差矩阵
df.cov()

Out[27]:

	bWendu	yWendu	aqi	aqiLevel
bWendu	140.613247	135.529633	47.462622	0.879204
yWendu	135.529633	138.181274	16.186685	0.264165
aqi	47.462622	16.186685	2697.364564	50.749842
aqiLevel	0.879204	0.264165	50.749842	1.060485

In [28]:

# 相关系数矩阵
df.corr()

Out[28]:

	bWendu	yWendu	aqi	aqiLevel
bWendu	1.000000	0.972292	0.077067	0.071999
yWendu	0.972292	1.000000	0.026513	0.021822
aqi	0.077067	0.026513	1.000000	0.948883
aqiLevel	0.071999	0.021822	0.948883	1.000000

In [29]:

# 单独查看空气质量和最高温度的相关系数
df["aqi"].corr(df["bWendu"])

Out[29]:

0.07706705916811067

In [30]:

df["aqi"].corr(df["yWendu"])

Out[30]:

0.026513282672968895

In [31]:

# 空气质量和温差的相关系数
df["aqi"].corr(df["bWendu"]-df["yWendu"])

Out[31]:

0.2165225757638205

虽然单独观察最高温度和最低温度对空气质量的影响不大，但是明显温差对空气质量的影响要大得多，因此，前端数据的挖掘对结果的呈现十分重要。

六、Pandas对缺失值的处理

Pandas使用这些函数处理缺失值：

isnull和notnull：检测是否是空值，可用于df和Series
dropna：丢弃、删除缺失值
- axis：删除行还是列，{0 or “index”, 1 or “columns”}, default 0
- how：如果等于any则任何值为空都删除，如果等于all则所有值都为空才删除
- inplace：如果为True则修改当前df，否则返回新的df
fillna：填充空值
- value：用于填充的值，可以是单个值，或者字典（key是列名，value是值）
- method：等于ffill使用前一个不为空的值填充forward fill，等于fill使用后一个不为空的值填充backword fill
- axis：按行还是列填充，{0 or “index”, 1 or "columns’}
- inplace：如果为True则修改当前df，否则返回新的df

In [1]:

import pandas as pd

实例：特殊excel的读取、清洗、处理

步骤1：读取excel的时候，忽略前几个空行

In [5]:

# skiprows=2, 跳过前两行
studf = pd.read_excel("./pandas-learn-code/datas/student_excel/student_excel.xlsx", skiprows=2)

In [6]:

studf

Out[6]:

	Unnamed: 0	姓名	科目	分数
0	NaN	小明	语文	85.0

最低0.47元/天解锁文章

1152：从入门到脱发

关注

6
点赞
踩
28

收藏

觉得还不错? 一键收藏
1
评论
Pandas基础操作（上）

文章目录一、Pandas文件读取1.pandas数据读取1、读取纯文本文件1.1 读取csv，使用默认的标题行、逗号分隔符1.2 读取txt文件，自己指定分隔符、列名2、读取excel文件3、读取sql文件二、pandas的数据结构DataFrame和SeriesDataFrame：二维数据，整个表格，多行多列1.Series1.1 仅有数据列表即可生产最简单的Series1.2 创建一个具有标签索引的Series1.3 使用python字典创建Series1.4 根据数据标签索引查询数据2. DataFr
复制链接

扫一扫