pandas_Sample-Fundational

最新推荐文章于 2023-02-25 11:19:05 发布

Xemacil

最新推荐文章于 2023-02-25 11:19:05 发布

阅读量1k

点赞数

分类专栏：数据分析师的自我修养文章标签：数据分析 python 数据挖掘 pandas

本文链接：https://blog.csdn.net/Xemacil/article/details/115803880

版权

数据分析师的自我修养专栏收录该内容

2 篇文章 0 订阅

订阅专栏

本篇主要是pandas50练习题的基础部分，也就是1~22题，还是较好理解的部分，后续更新++++++++

1.导入Pandas库并简写为pd，输出版本号

import pandas as pd
import numpy as np
pd.__version__

'1.1.3'

2.从列表创建Series

arr = [0,1,2,3,4]
df = pd.Series(arr) # 如果不做特殊指定说明，default从0开始
df

0    0
1    1
2    2
3    3
4    4
dtype: int64

3.从字典创建Series

d = {'a':1,'b':2,'c':3,'d':4,'e':5}
df = pd.Series(d)
df

a    1
b    2
c    3
d    4
e    5
dtype: int64

4.从numpy数组创建DataFrame

class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)[source]
Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

Parameters
datandarray (structured or homogeneous), Iterable, dict, or DataFrame
Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order.

Changed in version 0.25.0: If data is a list of dicts, column order follows insertion-order.

indexIndex or array-like
Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided.

columnsIndex or array-like
Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided.

dtypedtype, default None
Data type to force. Only a single dtype is allowed. If None, infer.

copybool, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input.

dates = pd.date_range('today',periods = 6) #定义时间序列作为 index
num_arr = np.random.randn(6,4)  #传入numpy的 6行 × 4列随机数组
columns = ['A','B','C','D']   #将列表作为列名
df1 = pd.DataFrame(num_arr, index = dates, columns = columns)
df1

	A	B	C	D
2021-04-17 23:03:54.397660	-1.520920	0.092495	-0.487495	-0.466914
2021-04-18 23:03:54.397660	-0.289537	0.108166	0.192073	-0.013956
2021-04-19 23:03:54.397660	0.693032	-0.445103	-0.425715	0.944692
2021-04-20 23:03:54.397660	0.403142	0.062311	0.544867	0.554797
2021-04-21 23:03:54.397660	1.535514	-0.539361	-0.096690	0.197693
2021-04-22 23:03:54.397660	-0.676061	0.951746	0.312777	0.724948

5.从CSV中创建DataFrame，分隔符为：，编码格式为gbk

 # df = pd.read_csv('test.csv',encoding = 'gbk,sep=';'')

6.从字典对象data创建DataFrame,设置索引为labels

import numpy as np
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(data, index = labels)
df

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no
f	cat	2.0	3	no
g	snake	4.5	1	no
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

7.显示DataFrame的基础信息，包括行的数量；列名；每一列值的数量、类型

df.info

<bound method DataFrame.info of   animal  age  visits priority
a    cat  2.5       1      yes
b    cat  3.0       3      yes
c  snake  0.5       2       no
d    dog  NaN       3      yes
e    dog  5.0       2       no
f    cat  2.0       3       no
g  snake  4.5       1       no
h    cat  NaN       1      yes
i    dog  7.0       2       no
j    dog  3.0       1       no>

df.shape

(10, 4)

df.describe()

	age	visits
count	8.000000	10.000000
mean	3.437500	1.900000
std	2.007797	0.875595
min	0.500000	1.000000
25%	2.375000	1.000000
50%	3.000000	2.000000
75%	4.625000	2.750000
max	7.000000	3.000000

8.显示df的前3行

df.iloc[:3]

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no

9.取出df的animal和age列

loc函数使用：loc[ rows , columns],其中rows/columns是列表

df.loc[:,['animal','age']]

	animal	age
a	cat	2.5
b	cat	3.0
c	snake	0.5
d	dog	NaN
e	dog	5.0
f	cat	2.0
g	snake	4.5
h	cat	NaN
i	dog	7.0
j	dog	3.0

10.取出索引为[3,4,8]行的animal和age列

df.loc[df.index[[3,4,8]],['animal','age']]

	animal	age
d	dog	NaN
e	dog	5.0
i	dog	7.0

11.取出age值大于3的行

df[df['age'] > 3]

	animal	age	visits	priority
e	dog	5.0	2	no
g	snake	4.5	1	no
i	dog	7.0	2	no

12.取出age值缺失的行

df[df['age'].isnull()]

	animal	age	visits	priority
d	dog	NaN	3	yes
h	cat	NaN	1	yes

13.取出age在2,4间的行（不含）

df[(df['age'] > 2)&(df['age'] < 4)]

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
j	dog	3.0	1	no

df[df['age'].between(2, 4)]

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
f	cat	2.0	3	no
j	dog	3.0	1	no

14.f行的age改为1.5

df.loc['f','age'] = 1.5
df

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no
f	cat	1.5	3	no
g	snake	4.5	1	no
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

15.计算visits的总和

df['visits'].sum()

16.计算每个不同种类animal的age的平均数

df.groupby('animal')['age'].mean()

animal
cat      2.333333
dog      5.000000
snake    2.500000
Name: age, dtype: float64

17.计算df中每个种类animal的数量

df['animal'].value_counts()

dog      4
cat      4
snake    2
Name: animal, dtype: int64

18.先按age降序排列，后按visits升序排列

df.sort_values(by = ['age','visits'],ascending=[False,True])##排序筛选

	animal	age	visits	priority
i	dog	7.0	2	no
e	dog	5.0	2	no
g	snake	4.5	1	no
j	dog	3.0	1	no
b	cat	3.0	3	yes
a	cat	2.5	1	yes
f	cat	1.5	3	no
c	snake	0.5	2	no
h	cat	NaN	1	yes
d	dog	NaN	3	yes

19.将priority列中的yes, no替换为布尔值True, False

df['priority'] = df['priority'].map({'yes':True,'no':False})
df

	animal	age	visits	priority
a	cat	2.5	1	True
b	cat	3.0	3	True
c	snake	0.5	2	False
d	dog	NaN	3	True
e	dog	5.0	2	False
f	cat	1.5	3	False
g	snake	4.5	1	False
h	cat	NaN	1	True
i	dog	7.0	2	False
j	dog	3.0	1	False

20.将animal列中的snake替换为python

df['animal'] = df['animal'].replace('snake','python')
df

	animal	age	visits	priority
a	cat	2.5	1	True
b	cat	3.0	3	True
c	python	0.5	2	False
d	dog	NaN	3	True
e	dog	5.0	2	False
f	cat	1.5	3	False
g	python	4.5	1	False
h	cat	NaN	1	True
i	dog	7.0	2	False
j	dog	3.0	1	False

21.对每种animal的每种不同数量visits，计算平均age，即，返回一个表格，行是aniaml种类，列是visits数量，表格值是行动物种类列访客数量的平均年龄

#确定数据类型
df.dtypes

animal       object
age         float64
visits        int64
priority       bool
dtype: object

df.age = df.age.astype(float)

df.pivot_table(index = 'animal',columns = 'visits',values = 'age',aggfunc = 'mean')

visits	1	2	3
animal
cat	2.5	NaN	2.25
dog	3.0	6.0	NaN
python	4.5	0.5	NaN

22.在df中插入新行k，然后删除该行

df.loc['k'] = [5.5,'dog','no',2]
df

	animal	age	visits	priority
a	cat	2.5	1	1
b	cat	3	3	1
c	python	0.5	2	0
d	dog	NaN	3	1
e	dog	5	2	0
f	cat	1.5	3	0
g	python	4.5	1	0
h	cat	NaN	1	1
i	dog	7	2	0
j	dog	3	1	0
k	5.5	dog	no	2

df = df.drop('k')
df

	animal	age	visits	priority
a	cat	2.5	1	1
b	cat	3	3	1
c	python	0.5	2	0
d	dog	NaN	3	1
e	dog	5	2	0
f	cat	1.5	3	0
g	python	4.5	1	0
h	cat	NaN	1	1
i	dog	7	2	0
j	dog	3	1	0

Xemacil

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
2
评论
pandas_Sample-Fundational

本篇主要是pandas50练习题的基础部分，也就是1~22题，还是较好理解的部分，后续更新++++++++1.导入Pandas库并简写为pd，输出版本号import pandas as pdimport numpy as nppd.__version__'1.1.3'2.从列表创建Seriesarr = [0,1,2,3,4]df = pd.Series(arr) # 如果不做特殊指定说明，default从0开始df0 01 12 23 34
复制链接

扫一扫