python数据分析进阶

grazieST

已于 2022-12-21 19:50:20 修改

阅读量220

点赞数

分类专栏：生物信息文章标签： python 数据分析 pandas

于 2022-12-19 22:42:43 首次发布

本文链接：https://blog.csdn.net/m0_65592124/article/details/128378605

版权

生物信息专栏收录该内容

2 篇文章 0 订阅

订阅专栏

python学习笔记

文章目录

python学习笔记
前言
pandas 入门

前言

#生信100天打卡系列

pandas 入门

[2/100]pandas 数据读取

pd.read_csv(path, sep = “\t”, header = None, names = [‘pdate’, ‘pv’, ‘uv’])
pd.read_excel
pd.read_sql

import pandas as pd
import numpy as np

name = pd.read_csv("../name_id.xls", sep = "\t", names = ['sample', 'subpopulation'])
name

pandas 的数据结构

dataframe --> 二维数据、整个表格、多行多列
series --> 一维数据、一行一列（类似于字典，比较好处理）
- series 由一组数据（不同的数据类型）以及与之相关索引构成

Series

仅有数据列表即可产生简单的Series

s1 = pd.Series([1, 'a', 5.2, 7])
s1

0      1
1      a
2    5.2
3      7
dtype: object

s1.index

RangeIndex(start=0, stop=4, step=1)

s1.values

array([1, 'a', 5.2, 7], dtype=object)

在创建Series时指定索引

s2 = pd.Series([1, 'a', 5.2, 7], index = ['a', 'b', 'd', 'e'])
s2

a      1
b      a
d    5.2
e      7
dtype: object

使用字典创建Series

字典的keys为索引， values为值

sdata = {'Ohio' : 35000, 'Texas' : 72000, 'Oregon' : 16000, 'Utah' : 5000}
s3 = pd.Series(sdata)
s3

Ohio      35000
Texas     72000
Oregon    16000
Utah       5000
dtype: int64

根据标签索引查询数据

类似字典dict
查询一个值，返回的是这个值本身
同时查询多个值，返回的是一个Series

s2

a      1
b      a
d    5.2
e      7
dtype: object

s2['a']

type(s2['a'])

int

s2[['b', 'a']]

b    a
a    1
dtype: object

type(s2[['b', 'a']])

pandas.core.series.Series

dataframe

每列可以是不同的类型
既有行索引index, 也有列索引columns
也可以被看做是由Series组成的字典

# 根据多个字典序列创建dataframe
data = {
    'state' : ['Ohio', 'Utah', 'Ohio', 'Texas', 'Ohio'],
    'year' : [2000, 2001, 2002, 2003, 2004],
    'pop' : [1.5, 1.7, 3.6, 2.4, 2.9]
}
df = pd.DataFrame(data)
df

	state	year	pop
0	Ohio	2000	1.5
1	Utah	2001	1.7
2	Ohio	2002	3.6
3	Texas	2003	2.4
4	Ohio	2004	2.9

dataframe 查询

查询一行，结果是一个Series
查询多行，结果是一个Dataframe

# 查询一行
df.loc[1]

state    Utah
year     2001
pop       1.7
Name: 1, dtype: object

# 查询多行
df.loc[1:3]

	state	year	pop
1	Utah	2001	1.7
2	Ohio	2002	3.6
3	Texas	2003	2.4

type(df.loc[1:3])

pandas.core.frame.DataFrame

[2/100]pandas 数据查询

df.loc ，根据行、列的标签值查询
df.iloc ，根据行、列的数字位置查询
df.where
df.query

.loc 既能查询，又能覆盖写入

使用df.loc进行数据查询

使用单个label值进行查询
使用值列表批量查询
使用数值区间进行范围查询
使用条件表达式查询
调用函数查询

读取数据

na = pd.read_table("../name_id.xls", header = None, names = ['samples', 'subpopulation'])
na

	samples	subpopulation
0	110	CA
1	111	CA
2	113	CA
3	117	CA
119	MPSM10	MP
120	MPSM11	MP
121	MPSM12	MP
122	MPSM13	MP
155	553	WA
156	660	WA
157	855	WA
250	P14	OG
251	P28	OG

# 设定索引为samples
na.set_index('samples', inplace = True)
na.index

Index(['110', '111', '113', '117', '120', '124', '128', '129', '130', '142',
       ...
       'P64', 'P65', 'P67', 'P70', 'P7', 'P81', 'P84', 'P8', 'P14', 'P28'],
      dtype='object', name='samples', length=252)

na

# 将subpopulation列中的CA替换为test
na.loc[:, "subpopulation"] = na["subpopulation"].str.replace("CA", "test").astype('str')

na.head()

	subpopulation
samples
110	test
111	test
113	test
117	test
120	test

查询单个值

na

使用值列表进行查询

na.loc[['110', '117'], 'subpopulation']

samples
110    CA
117    CA
Name: subpopulation, dtype: object

使用数值区间进行范围查询

na.loc['110':'117', 'subpopulation']

samples
110    CA
111    CA
113    CA
117    CA
Name: subpopulation, dtype: object

使用iloc数字索引进行查询

na.iloc[1:20, 0]

samples
111    CA
113    CA
117    CA
120    CA
124    CA
128    CA
129    CA
130    CA
142    CA
143    CA
147    CA
153    CA
161    CA
162    CA
170    CA
171    CA
172    CA
188    CA
194    CA
Name: subpopulation, dtype: object

使用条件表达式进行查询

na[na.loc[:,'subpopulation'] == 'CA']

调用函数查询

na.loc[lambda na : (na['subpopulation'] == "CA") | (na['subpopulation'] == "OG"), :]

	subpopulation
samples
110	CA
111	CA
P14	OG
P28	OG

[3/100]pandas 新增数据列

直接赋值
df.apply
df.assign
按条件选择分组分别赋值

import numpy as np
import pandas as pd

读取数据

df = pd.read_table('../name_id.xls', sep = '\t', header = None, names = ['samples', 'subpopulation'])
df.head()

	samples	subpopulation
0	110	CA
1	111	CA
2	113	CA
3	117	CA
4	120	CA

df.set_index('samples', inplace = True)
df.head()

	subpopulation
samples
110	CA
111	CA
113	CA
117	CA
120	CA

直接赋值

df.loc[:, 'order'] = [int(i) for i in range(1,253)]
df

	subpopulation	order
samples
110	CA	1
111	CA	2
113	CA	3
117	CA	4
120	CA	5
...	...	...
P81	WA	248
P84	WA	249
P8	WA	250
P14	OG	251
P28	OG	252

252 rows × 2 columns

df.apply方法

Apply a function along an axis of the DataFrame
Objects passed to the function are Series objects whose index is either the DataFrame’s index(axis=0)
or the DataFrame’s columns(axis = 1)
实例：添加一列中文注释：
1. 如果为CA，则注释为栽培薄皮
2. 如果为WA，则注释为野生薄皮
3. 如果为MP，则注释为马泡
4. 如果为OG，则注释为外群

def add_column(x):
    if x['subpopulation'] == 'CA':
        return "栽培薄皮"
    elif x['subpopulation'] == 'WA':
        return "野生薄皮"
    elif x['subpopulation'] == 'MP':
        return "马泡"
    else:
        return "外群"
df.loc[:, 'un'] = df.apply(add_column, axis = 1)

df

	subpopulation	order	un
samples
110	CA	1	栽培薄皮
111	CA	2	栽培薄皮
113	CA	3	栽培薄皮
117	CA	4	栽培薄皮
120	CA	5	栽培薄皮
...	...	...	...
P81	WA	248	野生薄皮
P84	WA	249	野生薄皮
P8	WA	250	野生薄皮
P14	OG	251	外群
P28	OG	252	外群

252 rows × 3 columns

查看un的计数

df[‘un’].value_counts()

df['un'].value_counts()

栽培薄皮    119
野生薄皮     95
马泡       36
外群        2
Name: un, dtype: int64

df.assign方法

Assign new columns to a Dataframe
Return a new object with all original columns in addition to new ones.
不会修改Dataframe本身，返回的是一个新的对象

# 可以同时新增多个列
df.assign(
    sort = lambda x : x['order'],
    sort1 = lambda x : x['order'] * 2
)

	subpopulation	order	un	sort	sort1
samples
110	CA	1	栽培薄皮	1	2
111	CA	2	栽培薄皮	2	4
113	CA	3	栽培薄皮	3	6
117	CA	4	栽培薄皮	4	8
120	CA	5	栽培薄皮	5	10
...	...	...	...	...	...
P81	WA	248	野生薄皮	248	496
P84	WA	249	野生薄皮	249	498
P8	WA	250	野生薄皮	250	500
P14	OG	251	外群	251	502
P28	OG	252	外群	252	504

252 rows × 5 columns

按条件选择分组分别赋值

按条件优先选择数据，然后对这部分数据赋值新列

# 创建一个空列（一种创建新列的方法）
# 用到了broadcast机制
df['priority'] = ''
df.loc[df['order'] <= 30, 'priority'] = "第一组"
df.loc[df['order'] > 30, 'priority'] = "第二组"
df

	subpopulation	order	un	priority
samples
110	CA	1	栽培薄皮	第一组
111	CA	2	栽培薄皮	第一组
113	CA	3	栽培薄皮	第一组
117	CA	4	栽培薄皮	第一组
120	CA	5	栽培薄皮	第一组
...	...	...	...	...
P81	WA	248	野生薄皮	第二组
P84	WA	249	野生薄皮	第二组
P8	WA	250	野生薄皮	第二组
P14	OG	251	外群	第二组
P28	OG	252	外群	第二组

252 rows × 4 columns

df['priority'].value_counts()

第二组    222
第一组     30
Name: priority, dtype: int64

[4/100]pandas 数据统计函数

汇总类统计
唯一去重和按值计数
相关系数和方差

import pandas as pd

sample = pd.read_table('../name_id.xls', header = None, sep = '\t', names = ['samples', 'subpopulation'])
sample.head()

	samples	subpopulation
0	110	CA
1	111	CA
2	113	CA
3	117	CA
4	120	CA

sample.set_index('samples', inplace = True)
sample.head(10)

	subpopulation
samples
110	CA
111	CA
113	CA
117	CA
120	CA
124	CA
128	CA
129	CA
130	CA
142	CA

sample['order'] = [int(i) for i in range(1,253)]
sample.head(10)

	subpopulation	order
samples
110	CA	1
111	CA	2
113	CA	3
117	CA	4
120	CA	5
124	CA	6
128	CA	7
129	CA	8
130	CA	9
142	CA	10

sample = sample.assign(
    sort = sample['order'],
    sort1 = sample['order'] * 2
)
sample

	subpopulation	order	sort	sort1
samples
110	CA	1	1	2
111	CA	2	2	4
113	CA	3	3	6
117	CA	4	4	8
120	CA	5	5	10
...	...	...	...	...
P81	WA	248	248	496
P84	WA	249	249	498
P8	WA	250	250	500
P14	OG	251	251	502
P28	OG	252	252	504

252 rows × 4 columns

sample['priority'] = ''
sample.loc[sample['order'] <= 30, 'priority'] = "first group"
sample.loc[sample['order'] > 30, 'priority'] = 'second group'
sample

	subpopulation	order	sort	sort1	priority
samples
110	CA	1	1	2	first group
111	CA	2	2	4	first group
113	CA	3	3	6	first group
117	CA	4	4	8	first group
120	CA	5	5	10	first group
...	...	...	...	...	...
P81	WA	248	248	496	second group
P84	WA	249	249	498	second group
P8	WA	250	250	500	second group
P14	OG	251	251	502	second group
P28	OG	252	252	504	second group

252 rows × 5 columns

汇总类统计

# 一次提取所有数字列的统计结果
sample.describe()

	order	sort	sort1
count	252.000000	252.000000	252.000000
mean	126.500000	126.500000	253.000000
std	72.890329	72.890329	145.780657
min	1.000000	1.000000	2.000000
25%	63.750000	63.750000	127.500000
50%	126.500000	126.500000	253.000000
75%	189.250000	189.250000	378.500000
max	252.000000	252.000000	504.000000

# 查看单个列
display(sample['order'].mean())
display(sample['order'].max())
display(sample['order'].min())

唯一去重和按值计数

唯一性去重

sample['priority'].unique()

array(['first group', 'second group'], dtype=object)

按值计数

sample['priority'].value_counts()

second group    222
first group      30
Name: priority, dtype: int64

	order	sort	sort1
order	5313.0	5313.0	10626.0
sort	5313.0	5313.0	10626.0
sort1	10626.0	10626.0	21252.0

	order	sort	sort1
order	1.0	1.0	1.0
sort	1.0	1.0	1.0
sort1	1.0	1.0	1.0

[4/100]pandas 缺失值的处理

pandas使用以下函数处理缺失值：
- isnull 和 notnull ：检查是否是空值，可用于df和series
- dropna ：丢弃、删除缺失值
  - axis ：删除行还是列， 0 for index ，1 for columns， default is 0
  - how ：如果等于any则任何值为空都删除，如果等于all则所有值为空才删除
  - inplace ：如果为True则修改当前的df， False则返回新的df
- fillna ：填充空值
  - value ：要填充的值，可以是单个值，也可以是字典（key是列名，value是值）
  - method ：等于 ffill 使用前一个不为空的值来填充，等于 bfill 使用后一个不为空的值来填充
  - axis ：按行还是列填充
  - inplace ：返回的对象

读取excel

import openpyxl
df = pd.read_excel('../pd_test.xlsx', skiprows = 0)
df

	姓名	科目	分数
0	小明	语文	NaN
1	NaN	数学	80.0
2	NaN	英语	90.0
3	NaN	NaN	NaN
4	小王	语文	85.0
5	NaN	数学	NaN
6	NaN	英语	90.0
7	NaN	NaN	NaN
8	小刚	语文	85.0
9	NaN	数学	80.0
10	NaN	英语	NaN

检查空值

df.isnull()

	姓名	科目	分数
0	False	False	True
1	True	False	False
2	True	False	False
3	True	True	True
4	False	False	False
5	True	False	True
6	True	False	False
7	True	True	True
8	False	False	False
9	True	False	False
10	True	False	True

df['分数'].isnull()

0      True
1     False
2     False
3      True
4     False
5      True
6     False
7      True
8     False
9     False
10     True
Name: 分数, dtype: bool

# 筛选没有空分数的所有行
df.loc[df['分数'].notnull() , :]

	姓名	科目	分数
1	NaN	数学	80.0
2	NaN	英语	90.0
4	小王	语文	85.0
6	NaN	英语	90.0
8	小刚	语文	85.0
9	NaN	数学	80.0

删除掉全是空值的列

df.dropna(axis = 'columns', how = 'all', inplace = True)
df

	姓名	科目	分数
0	小明	语文	NaN
1	NaN	数学	80.0
2	NaN	英语	90.0
3	NaN	NaN	NaN
4	小王	语文	85.0
5	NaN	数学	NaN
6	NaN	英语	90.0
7	NaN	NaN	NaN
8	小刚	语文	85.0
9	NaN	数学	80.0
10	NaN	英语	NaN

删除掉全是空值的行

df.dropna(axis = 'index', how = 'all', inplace = True)
df

	姓名	科目	分数
0	小明	语文	NaN
1	NaN	数学	80.0
2	NaN	英语	90.0
4	小王	语文	85.0
5	NaN	数学	NaN
6	NaN	英语	90.0
8	小刚	语文	85.0
9	NaN	数学	80.0
10	NaN	英语	NaN

将分数列的空值填充为0

df['分数'].fillna(0, inplace = True)
df

	姓名	科目	分数
0	小明	语文	0.0
1	NaN	数学	80.0
2	NaN	英语	90.0
4	小王	语文	85.0
5	NaN	数学	0.0
6	NaN	英语	90.0
8	小刚	语文	85.0
9	NaN	数学	80.0
10	NaN	英语	0.0

# 另一种方法
df.fillna({'分数' : 0}, inplace = True)
df

	姓名	科目	分数
0	小明	语文	0.0
1	NaN	数学	80.0
2	NaN	英语	90.0
4	小王	语文	85.0
5	NaN	数学	0.0
6	NaN	英语	90.0
8	小刚	语文	85.0
9	NaN	数学	80.0
10	NaN	英语	0.0

对姓名列的缺失值进行填充

df['姓名'].fillna(method = 'ffill', inplace = True)
df

	姓名	科目	分数
0	小明	语文	0.0
1	小明	数学	80.0
2	小明	英语	90.0
4	小王	语文	85.0
5	小王	数学	0.0
6	小王	英语	90.0
8	小刚	语文	85.0
9	小刚	数学	80.0
10	小刚	英语	0.0

将清洗好的excel保存

df.to_excel('../pd_test_clean.xlsx', index = False)

grazieST

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

python数据分析进阶

python学习笔记

文章目录

前言

pandas 入门

[2/100]pandas 数据读取

pandas 的数据结构

Series

仅有数据列表即可产生简单的Series

在创建Series时指定索引

使用字典创建Series

根据标签索引查询数据

dataframe

dataframe 查询

[2/100]pandas 数据查询

使用df.loc进行数据查询

读取数据

查询单个值

使用值列表进行查询

使用数值区间进行范围查询

使用iloc数字索引进行查询

使用条件表达式进行查询

调用函数查询

[3/100]pandas 新增数据列

读取数据

直接赋值

df.apply方法

查看un的计数

df.assign方法

按条件选择分组分别赋值

[4/100]pandas 数据统计函数

汇总类统计

唯一去重和按值计数

唯一性去重

按值计数

相关系数和协方差

[4/100]pandas 缺失值的处理

读取excel

检查空值

删除掉全是空值的列

删除掉全是空值的行

将分数列的空值填充为0

对姓名列的缺失值进行填充

将清洗好的excel保存