Pandas练习

最新推荐文章于 2024-01-12 22:10:24 发布

莫莫先生

最新推荐文章于 2024-01-12 22:10:24 发布

阅读量2.7k

点赞数 1

分类专栏： # Python数据分析文章标签： python pandas 数据分析

本文链接：https://blog.csdn.net/weixin_44835732/article/details/105126059

版权

本文提供100个Pandas实战练习，覆盖数据转换、合并、统计分析、时间序列处理等核心操作，旨在帮助读者深入理解和掌握Pandas库。通过实例，包括系列创建、数据读取、缺失值处理、数据清洗、数据重塑、统计计算、数据分组和排序等，全面提升数据分析技能。

摘要由CSDN通过智能技术生成

下面的练习来源：pandas数据分析100道练习题，将将够了解熟悉一下pandas各种操作，我对有些题目使用到的函数还不是十分理解。
题目我都写到一个ipynb文件里了，已上传到CSDN，0积分，链接。

另外分享两个网站练习，能通过实战一般练习到pandas（有十个实战的题目）：
两个练习题我没全看，只看了和鲸Kesci网的，但是看github上的题目标题及练习的形式大概判断它们的本质是一模一样的。
和鲸Kesci：pandas数分析练习 \ pandas数分析练习：Github链接
 知乎：pandas练习题100道 \ pandas练习题100道：Github链接

如何引入pandas并查看版本

import pandas as pd
print(pd.__version__)
print(pd.show_versions(as_json=True))

0.25.1
{'system': {'commit': None, 'python': '3.7.4.final.0', 'python-bits': 64, 'OS': 'Windows', 'OS-release': '10', 'machine': 'AMD64', 'processor': 'Intel64 Family 6 Model 142 Stepping 9, GenuineIntel', 'byteorder': 'little', 'LC_ALL': 'None', 'LANG': 'None', 'LOCALE': 'None.None'}, 'dependencies': {'pandas': '0.25.1', 'numpy': '1.16.5', 'pytz': '2019.3', 'dateutil': '2.8.0', 'pip': '19.2.3', 'setuptools': '41.4.0', 'Cython': '0.29.13', 'pytest': '5.2.1', 'hypothesis': None, 'sphinx': '2.2.0', 'blosc': None, 'feather': None, 'xlsxwriter': '1.2.1', 'lxml.etree': '4.4.1', 'html5lib': '1.0.1', 'pymysql': '0.9.3', 'psycopg2': None, 'jinja2': '2.10.3', 'IPython': '7.8.0', 'pandas_datareader': None, 'bs4': '4.8.0', 'bottleneck': '1.2.1', 'fastparquet': None, 'gcsfs': None, 'matplotlib': '3.1.1', 'numexpr': '2.7.0', 'odfpy': None, 'openpyxl': '3.0.0', 'pandas_gbq': None, 'pyarrow': None, 'pytables': None, 's3fs': None, 'scipy': '1.3.1', 'sqlalchemy': '1.3.9', 'tables': '3.5.2', 'xarray': None, 'xlrd': '1.2.0', 'xlwt': '1.3.0'}}
None

list或numpy array或dict转pd.Series

import numpy as np
mylist = list('abcdefghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = zip(mylist, myarr)
ser1 = pd.Series(mylist)
ser2 = pd.Series(myarr)
ser3 = pd.Series(mydict)
print(ser3.head())

0    (a, 0)
1    (b, 1)
2    (c, 2)
3    (d, 3)
4    (e, 4)
dtype: object

series的index转dataframe的column

df = ser3.to_frame().reset_index()
df.head()

	index	0
0	0	(a, 0)
1	1	(b, 1)
2	2	(c, 2)
3	3	(d, 3)
4	4	(e, 4)

多个series合并成一个dataframe

df = pd.DataFrame({
   'col1': ser1, 'col2':ser2})
df.head()

	col1	col2
0	a	0
1	b	1
2	c	2
3	d	3
4	e	4

根据index, 多个series合并成dataframe

s1 = ser1[:16]
s2 = ser2[14:]
pd.concat([s1,s2], axis=1)

	0	1
0	a	NaN
1	b	NaN
2	c	NaN
3	d	NaN
4	e	NaN
5	f	NaN
6	g	NaN
7	h	NaN
8	i	NaN
9	j	NaN
10	k	NaN
11	l	NaN
12	m	NaN
13	n	NaN
14	o	14.0
15	p	15.0
16	NaN	16.0
17	NaN	17.0
18	NaN	18.0
19	NaN	19.0
20	NaN	20.0
21	NaN	21.0
22	NaN	22.0
23	NaN	23.0
24	NaN	24.0
25	NaN	25.0

头尾拼接两个series

pd.concat([s1,s2],axis=0)

0      a
1      b
2      c
3      d
4      e
5      f
6      g
7      h
8      i
9      j
10     k
11     l
12     m
13     n
14     o
15     p
14    14
15    15
16    16
17    17
18    18
19    19
20    20
21    21
22    22
23    23
24    24
25    25
dtype: object

找到元素在series A中不在series B中

ser1 = pd.Series([1,2,3,4,5])
ser2 = pd.Series([4,5,6,7,8])
ser1[~ser1.isin(ser2)]

0    1
1    2
2    3
dtype: int64

两个seiries的并集

np.union1d(ser1,ser2)

array([1, 2, 3, 4, 5, 6, 7, 8], dtype=int64)

两个series的交集

np.intersect1d(ser1,ser2)

array([4, 5], dtype=int64)

两个series的非共有元素

u = pd.Series(np.union1d(ser1,ser2))
i = pd.Series(np.intersect1d(ser1,ser2))
u[~u.isin(i)]

0    1
1    2
2    3
5    6
6    7
7    8
dtype: int64

如何获得series的最小值，第25百分位数，中位数，第75位和最大值？

ser = pd.Series(np.random.normal(10, 5, 25))
np.random.RandomState(100)
np.percentile(ser, q=[0, 25, 50, 75, 100])

array([-2.54523372,  7.86187042, 10.16123596, 15.60337005, 23.12409334])

如何获得系列中唯一项目的频率计数?

ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))
ser.value_counts()

h    8
b    5
c    4
g    4
f    4
a    3
d    2
dtype: int64

series中计数排名前2的元素

v_cnt = ser.value_counts()
print(v_cnt)
cnt_cnt = v_cnt.value_counts().index[:2]
print(cnt_cnt)
cnt_cnt

h    8
b    5
c    4
g    4
f    4
a    3
d    2
dtype: int64
Int64Index([4, 5], dtype='int64')





Int64Index([4, 5], dtype='int64')

v_cnt[v_cnt.isin(cnt_cnt)].index

Index(['b', 'c', 'g', 'f'], dtype='object')

如何将数字系列分成10个相同大小的组

ser = pd.Series(np.random.random(20))
ser.head()

0    0.588945
1    0.356710
2    0.798986
3    0.170943
4    0.076717
dtype: float64

groups = pd.qcut(ser, q=[0, .10, .20, .3, .4, .5, .6, .7, .8, .9, 1], 
        labels=['1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th'])
groups.head()

0    5th
1    3rd
2    9th
3    3rd
4    1st
dtype: category
Categories (10, object): [1st < 2nd < 3rd < 4th ... 7th < 8th < 9th < 10th]

如何将numpy数组转换为给定形状的dataframe

ser = pd.Series(np.random.randint(1,10,35))
df = pd.DataFrame(ser.values.reshape(7,5))
df

	0	1	2	3	4
0	8	6	5	4	1
1	1	7	8	1	4
2	5	3	5	7	5
3	8	6	3	4	6
4	5	9	2	4	3
5	3	7	6	8	7
6	6	2	2	7	5

如何从一系列中找到2的倍数的数字位置

ser = pd.Series(np.random.randint(1,10,7))
ser

0    1
1    8
2    7
3    9
4    9
5    2
6    6
dtype: int32

# ser[ser.map(lambda x: x%2 == 0)].index
np.argwhere(ser % 2 == 0)

array([[1],
       [5],
       [6]], dtype=int64)

如何从系列中的给定位置提取项目

ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz'))
pos = [0,4,8,14,20]
ser.take(pos)

0     a
4     e
8     i
14    o
20    u
dtype: object

获取元素的位置

aims = list('adhz')
[pd.Index(ser).get_loc(i) for i in aims]

[0, 3, 7, 25]

如何计算真值和预测序列的均方误差

truth = pd.Series(range(10))
pred = pd.Series(range(10)) + np.random.random(10)
np.mean((truth - pred)**2)

0.32466394194250286

如何将系列中每个元素的第一个字符转换为大写

ser = pd.Series(['how','are','you'])
ser.map(lambda x: x.title())

0    How
1    Are
2    You
dtype: object

如何计算系列中每个单词的字符数

ser.map(lambda x:len(x))

0    3
1    3
2    3
dtype: int64

如何计算时间序列数据的差分

ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])

# 一级差分
ser.diff()

0    NaN
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    6.0
7    8.0
dtype: float64

# 二级差分
ser.diff().diff()

0    NaN
1    NaN
2    1.0
3    1.0
4    1.0
5    1.0
6    0.0
7    2.0
dtype: float64

series如何将一日期字符串转换为时间

import pandas as pd
ser = pd.Series(
    ['01 Jan 2010', 
     '02-02-2011', 
     '20120303', 
     '2013/0

最低0.47元/天解锁文章