【学习总结——numpy、pandas、matplotlib】

最新推荐文章于 2024-02-19 17:02:49 发布

吾仪

最新推荐文章于 2024-02-19 17:02:49 发布

阅读量766

点赞数 1

文章标签： numpy 学习 pandas matplotlib

本文链接：https://blog.csdn.net/weixin_68746494/article/details/125695420

版权

一、numpy

NumPy是一个功能强大的Python库，主要用于对多维数组执行计算。NumPy这个词来源于两个单词-- Numerical和Python。NumPy提供了大量的库函数和操作，可以帮助程序员轻松地进行数值计算。

（一）创建数组

1.由列表创建

创建一维数组

import numpy as np
a1 =np.array([1,2,3,4])

创建二维数组

import numpy as np
a2 =np.array([[1,2],[3,4]])

2.由函数创建

通过numpy zeros函数，创建全0数组
通过numpy ones函数，创建全1数组
通过numpy eye函数，创建单位数组
通过numpy arange函数，创建等间隔的数字数组(从0开始）
通过numpy random函数，创建随机数组
通过linspace函数，创建指定范围的一维数组，并分成若干等份
通过reshape函数，更改数组的维度创建新数组

（二）数组的运算

1.数组与标量之间的运算
import numpy as np
a=np.array([1,2,3])
print(a+1)
2.数组之间的运算（同型）
import numpy as np
a=np.arange(1,10).reshape(3,3)
b=np.arange(1,10).reshape(3,3)
print(a+b)
print(a-b)
print(a*b)
3.不同维度数组之间的运算（广播性）

import numpy as np
x=np.array([[1,2,3],[2,2,3],[4,5,6]])
y=np.array([2,2,1])
print(x+y)

[[3 4 4]
[4 4 4]
[6 7 7]]

(三)数组的索引

1.一维数组索引
arr=np.arange(10)
arr[5] #索引第6个元素
arr[5:8] #选择第6到第8个元素作为数组
arr[:] = 3 #将数组的全部元素改为3
2.二维数组索引
arr=np.array([[1,2,3],[4,5,6],[7,8,9]])
arr[1] #索引二维数组的第2行
arr[0,2] （相当于arr[0][2]） #索引二维数组的1行3列元素
arr[:2] #选择二维数组第1行和第2行（不含第3行）
arr[:,1] #选择二维数组第2列

import numpy as np
a=np.array([[1,2,3],[2,2,3],[4,5,6]])
print(a[1])
print(a[1,2])
print(a[:2])
print(a[:,1])

[2 2 3]
3
[[1 2 3]
[2 2 3]]
[2 2 5]

3.布尔索引

import numpy as np
a=np.array([[1,2,3],[2,2,3],[4,5,6]])
print(a<5)

[[ True True True]
[ True True True]
[ True False False]]

令a[(a<5)]=1 a[(a>=5)]=0
在这里插入图片描述
4.三元运算符
三元运算符np.where(表达式，数1，数2)，当表达式为真，结果为数1，否则结果为数2。

（四）常用函数

在这里插入图片描述

axis=0 垂直方向堆叠数组,要有相同的列数
axis=1 水平方向堆叠数组,要有相同的行数

1.np.concatenate 连接数组

在这里插入图片描述

2.np.vstack()垂直方向堆叠数组
np.hstack()水平方向堆叠数组

在这里插入图片描述

3.np.mean()均值

在这里插入图片描述

4.np.sort（）排序

在这里插入图片描述

5.np.unique唯一化

在这里插入图片描述

6.np.random.uniform() 在（0,10）内生成3行4列的随机数

在这里插入图片描述

二、pandas

Pandas中主要使用Series和DataFrame两种数据结构。
(1)series：是一个一维数据结构，它由index和value组成。
(2)dataframe：是一个二维表格型数据结构，它由index、value和column组成。

（一）数据结构

1.Series

Series创建
（1）列表
import pandas as pd
a = pd.Series([1,2,3],index=[‘a’,‘b’,‘c’])
输出：
a 1
b 2
c 3
(2)标量值
import pandas as pd
a = pd.Series(25,index=[‘a’,‘b’,‘c’])
输出：
a 25
b 25
c 25
(3)python字典
import pandas as pd
a={‘o’:3500,‘t’:7100,‘u’:5000}
b = pd.Series(a)
输出：
o 3500
t 7100
u 5000
Series操作(索引和切片)
a.index # 获取索引值
a.values # 获取值
a.index[x:y]
a.values[x:y]

2.DataFrame

DataFrame是一个二维表格型数据结构。DataFrame对象既有行索引，又有列索引。

DataFrame创建

(1)由二维数组对象创建
在这里插入图片描述
(2)由字典创建

DataFrame属性和操作

1 基础属性
(1)df.shape #行数和列数
(2)df.dtypes #列数据类型
(3)df.ndim #数据维度
(4)df.index #行索引
(5)df.columns #列索引
(6)df.values #对象值
2 基本操作
(1)df.head(3) #显示前3行
(2)df.tail(3) #显示末尾3行
(3)df.info() #显示信息概述，行数，列数，索引，列非空值个数，列类型等。
(4)df.describe() # 统计信息，均值，最大值，最小值，标准差等。

索引操作

(1)Dataframe可以通过标签索引loc获取标签所指定列的数据。标签索引loc
(2)Dataframe可以通过位置索引iloc获取位置所指定列的数据。位置索引iloc
在这里插入图片描述

(3)索引

布尔索引

在这里插入图片描述

索引元素
(1)获取0号同学的姓名。
student.loc[0,‘name’] 或者
student.iloc[0,0]
(2)将1号同学的年龄改为23岁。
student.loc[1,‘sage’] = 23 或者
student.iloc[1,1] = 23
删除指定索引对象
drop()能够删除DataFrame指定行或列索引。
(1)将0号同学的信息删除。
student=student.drop([0]) //按行删除
(2)将同学的年龄信息删除。
student=student.drop([‘sage’],axis=1) //按列删除
合并操作（concat，merge）

(1)concat
concat把两个表拼在一起或堆叠。这个函数的关键参数应该是 axis，用于指定连接的轴向。axis=0垂直堆叠，axis=1 水平堆叠，默认是axis=0。
在这里插入图片描述

(2)merge
merge表示按照指定的列把数据按照一定的方式合并到一起。
在这里插入图片描述
合并方式包括四种，通过how属性设置：
(1)默认的为交集inner，
a1.merge(a2,left_on=“第二列”,right_on=‘第三列’,how= ‘inner’)
(2)并集outer，NaN补全
a1.merge(a2,left_on=“第二列”,right_on=‘第三列’,how= ‘outer’)
(3)以左边为准left，NaN补全
a1.merge(a2,left_on=“第二列”,right_on=‘第三列’,how= ‘left’)
(4)以右边为准right，NaN补全 a1.merge(a2,left_on=“第二列”,right_on=‘第三列’,how= ‘right’)
在这里插入图片描述

统计操作

方法	说明
.sum()	计算数据的总和
.count()	非NaN(缺失值)的数量
.mean() .median()	计算数据的算术平均值、算术中位数
.var() .std()	计算数据的方差、标准差
.min() .max()	计算数据的最小值、最大值
.describe()	输出所有列的统计信息
info()	检查缺失值情况

创建学生表，学生表包括姓名，年龄，性别。
统计要求如下：
(1)统计所有学生的年龄总和：student[‘sage’].sum()
(2)统计素有学生年龄的最大值：student[‘sage’].max()

在这里插入图片描述

分组统计
包括以下3个过程：
(1)拆分（Spliting）：将数据进行分组
(2)应用（Applying）：对每组应用函数进行计算
(3)合并（Combining）：将计算结果进行数据聚合
数据排序

（1）sort_values
该方法根据数据进行排序，可以分为sort column和sort by column。

sort column
这种方法是从DataFrame中挑取出来具体的列，然后对该行进行排序。操作的是特定的行。

dataframe.colname.sort_values()
dataframe.colname.sort_values(ascending = False)
dataframe[“colname”].sort_values()
dataframe[“colname”].sort_values(ascending = False)

sort by column
这种方法是按照给定的列的值对DataFrame数据进行排序，与上一种区别在于操作的是整个DataFrame。

单列sort
dataframe.sort_values(“colname”)
多列sort
dataframe.sort_values( [“col1”, “col2”,…“coln”])

（2）sort_index
该方法可以根据行名排序，也可以根据列名排序，区别在于axis。

axis = 0 给列名排序，ascending指定排序方法

dataframe.sort_index()
dataframe.sort_index (ascending=False)

axis = 1 给行名排序，ascending指定排序方法

dataframe.sort_index(axis = 1)
dataframe.sort_index (ascending=False, axis = 1)

文件操作
读入文件
pd.read_csv(filepath, sep=‘,’, delimiter=None, header=‘infer’, names=None, index_col=None, prefix=None, nrows=None, encoding=None, skiprows=0)

df=pd.read_csv(‘train.csv’)

保存文件
to_csv(path,sep,na_rep,columns,header,index)

df.to_csv(‘train_chinese.csv’)

按格式输出
DataFrame.to_html(“test.html”)

三、matplotlib

- 折线图

import matplotlib.pyplot as plt
x=range(2,26,2)     #x轴坐标
y=[15,13,14.5,17,20,25,26,26,27,22,18,15]    #y轴坐标
plt.plot(x,y)         # 通过plot画出折线
plt.xticks(x)         # 设置x的刻度
plt.title("温度的变化",fontproperties='simhei',fontsize = 20)        # 设置折线标题
plt.xlabel('month')    # 设置x轴标签
plt.ylabel('price')     # 设置y轴标签
plt.savefig("test.png") #将输出图形存储为文件，默认PNG格式，可以通过dpi修改输出质量。 
plt.show()          # 显示折线

在这里插入图片描述

折线的色彩和样式
我们在绘制折线时，可以指定折线的颜色、样式、粗细和透明度等。
通过color属性指定折线的颜色如下表：

符号颜色
b blue
g green
r red
c cyan
m magenta
Y yellow
k black
通过linestyle属性指定折线的样式如下：
线型描述
‘-’ solid
‘–’ dashed
‘-.’ dash_dot
‘:’ dotted
‘None’ draw nothing
‘’ draw nothing
通过marker属性指定折线的标记如下:
标记描述
“ｏ” circle
“v” triangle_down
“s” square
“p” pentagon
“*” star
“h” hexagon1
“+” plus
“D” diamond

- 多子图

import matplotlib.pyplot as plt
x=[2,4,6,8,10,12]
y=[41,43,45,50,43,41]
plt.subplot(2,1,1)     #(2行1列，第1个区域)
plt.plot(x,x,color='red')

plt.subplot(2,1,2)     #(2行1列，第2个区域)
plt.plot(y,y,color='green')
plt.savefig("test3.png")
plt.show()

在这里插入图片描述

常用图形

matplotlib还能够绘制饼图，散点图，柱状图，直方图和箱线图等。

- 饼图

饼图表示各项的大小与各项总和的比例。例如：frogs有15%，logs有10%，hogs有30%，dogs有45%。
我们通过plt.pie (x, explode, labels, autopct, shadow, startangle)函数可以绘制散点图。其中
(1)x表示数据
(2)explode表示突出的部分
(3)labels表示标签
(4)autopct表示数据标签，%0.1%% 保留一位小数
(5)shadow表示是否显示阴影bool
(6)pctdistance表示数据标签的距离圆心位置 0~1
(7)labeldistance表示标签的比例。
(8)startangle 表示开始绘图的角度
(9)radius表示半径长，默认是1

import matplotlib.pyplot as plt
labels = 'Frogs' , 'Hogs', 'Dogs', 'Logs'
x = [15,30,45,10]
explode = (0,0.1,0,0)
plt.pie(x,explode=explode,labels=labels,autopct='%1.1f%%',
shadow=False, startangle=90)
plt.axis('equal')
plt.savefig("test8.png")
plt.show()

在这里插入图片描述

- 散点图

散点图用两组数据构成多个坐标点，考察坐标点的分布，判断两变量之间是否存在某种关联或坐标点的分布情况。判断变量之间是否存在关联趋势，判断离群点。
假设通过爬虫你获取到了北京2016年3月和10月份每天白天的最高气温(分别位于列表a和b), 那么此时如何寻找出气温和随时间(天)变化的某种规律？
a=[11,17,16,11,12,11,12,6,6,7,8,9,12,15,14,17,18,21,16,17,20,14,15,15,15,19,21,22,22,22,23] b=[8,26,28,19,21,17,16,19,18,20,20,19,22,23,10,20,21,20,22,15,11,15,5,13,17,10,11,13,12,13,6]
我们通过plt.scatter (x, y, s, c, marker, alpha)函数可以绘制散点图。其中
(1)x，y：表示的是x轴和y轴的坐标，也就是我们即将绘制散点图的数据点坐标。
(2)s：是一个实数，表示点的大小。
(3)c：表示的是颜色。
(4)marker：表示点形状，默认的是’o’。
(5)alpha：是一个实数，表示点的透明度。

import matplotlib.pyplot as plt
import numpy as np
x= range(0,31,1)
y3=[11,17,16,11,12,11,12,6,6,7,8,9,12,15,14,17,18,21,16,17,20,14,15,15,15,19,21,22,22,22,23]
y10=[8,26,28,19,21,17,16,19,18,20,20,19,22,23,10,20,21,20,22,15,11,15,5,13,17,10,11,13,12,13,6]
plt.scatter(x,y3,marker='o')
plt.scatter(x,y10,marker='x')
plt.xlabel('day')
plt.ylabel('temperture')
plt.savefig("test4.png")
plt.show()

在这里插入图片描述

直方图

直方图是将范围内的数值数据分成一系列间隔，然后计算每个间隔中有多少值。这些值通常被指定为连续的，不重叠的变量间隔。间隔必须相邻，并且通常是相等的大小。
假设你获取了250部电影的时长(列表a中)，希望统计出这些电影时长的分布状态(比如时长为100分钟到120分钟电影的数量，出现的频率)等信息，你应该如何呈现这些数据？
a=[131, 98, 125, 131, 124, 139, 131, 117, 128, 108, 135, 138, 131, 102, 107, 114, 119, 128, 121, 142, 127, 130, 124, 101, 110, 116, 117, 110, 128, 128, 115, 99, 136, 126, 134, 95, 138, 117, 111,78, 132, 124, 113, 150, 110, 117, 86, 95, 144, 105, 126, 130,126, 130, 126, 116, 123, 106, 112, 138, 123, 86, 101, 99, 136,123, 117, 119, 105, 137, 123, 128, 125, 104, 109, 134, 125, 127,105, 120, 107, 129, 116, 108, 132, 103, 136, 118, 102, 120, 114,105, 115, 132, 145, 119, 121, 112, 139, 125, 138, 109, 132, 134,156, 106, 117, 127, 144, 139, 139, 119, 140, 83, 110, 102,123,107, 143, 115, 136, 118, 139, 123, 112, 118, 125, 109, 119, 133,112, 114, 122, 109, 106, 123, 116, 131, 127, 115, 118, 112, 135,115, 146, 137, 116, 103, 144, 83, 123, 111, 110, 111, 100, 154,136, 100, 118, 119, 133, 134, 106, 129, 126, 110, 111, 109, 141,120, 117, 106, 149, 122, 122, 110, 118, 127, 121, 114, 125, 126,114, 140, 103, 130, 141, 117, 106, 114, 121, 114, 133, 137, 92,121, 112, 146, 97, 137, 105, 98, 117, 112, 81, 97, 139, 113,134, 106, 144, 110, 137, 137, 111, 104, 117, 100, 111, 101, 110,105, 129, 137, 112, 120, 113, 133, 112, 83, 94, 146, 133, 101,131, 116, 111, 84, 137, 115, 122, 106, 144, 109, 123, 116, 111,111, 133, 150] 我们通过plt.hist (x, bins, color, density, histtype)函数可以绘制散点图，其中：
(1)x表示数据，数值类型。
(2)bins表示条形数。
(3)color表示颜色，“r”,“g”,“y”,“c”。
(4)density表示是否以密度的形式显示，bool=True/False。
(5)histtype表示线条的类型，“bar”:方形，“barstacked”:柱形，“step”:“未填充线条”，“stepfilled”:“填充线条”。
把数据分为多少组进行统计？组数要适当？太少会有较大的统计误差？大多规律不明显。组数：将数据分组的总数。组距：指每个小组的两个端点的距离。组数=max-min/组距。
假设已知数据为a，设置组距bin_width，组数num_bins=(max(a)-min(a))/ bin_width。

import numpy as np
import matplotlib.pyplot as plt
a = np.array([131,  98, 125, 131, 124, 139, 131, 117, 128, 108, 135, 138, 131, 102, 107, 114, 119, 128, 121, 142, 127, 130, 124, 101, 110, 116, 117, 110, 128, 128, 115,  99, 136, 126, 134,  95, 138, 117, 111,78, 132, 124, 113, 150, 110, 117,  86,  95, 144, 105, 126, 130,126, 130, 126, 116, 123, 106, 112, 138, 123,  86, 101,  99, 136,123, 117, 119, 105, 137, 123, 128, 125, 104, 109, 134, 125, 127,105, 120, 107, 129, 116, 108, 132, 103, 136, 118, 102, 120, 114,105, 115, 132, 145, 119, 121, 112, 139, 125, 138, 109, 132, 134,156, 106, 117, 127, 144, 139, 139, 119, 140,  83, 110, 102,123,107, 143, 115, 136, 118, 139, 123, 112, 118, 125, 109, 119, 133,112, 114, 122, 109, 106, 123, 116, 131, 127, 115, 118, 112, 135,115, 146, 137, 116, 103, 144,  83, 123, 111, 110, 111, 100, 154,136, 100, 118, 119, 133, 134, 106, 129, 126, 110, 111, 109, 141,120, 117, 106, 149, 122, 122, 110, 118, 127, 121, 114, 125, 126,114, 140, 103, 130, 141, 117, 106, 114, 121, 114, 133, 137,  92,121, 112, 146,  97, 137, 105,  98, 117, 112,  81,  97, 139, 113,134, 106, 144, 110, 137, 137, 111, 104, 117, 100, 111, 101, 110,105, 129, 137, 112, 120, 113, 133, 112,  83,  94, 146, 133, 101,131, 116, 111,  84, 137, 115, 122, 106, 144, 109, 123, 116, 111,111, 133, 150])
min = np.min(a)
max =np.max(a)
bin_width = 5
num_bins= (max-min) // bin_width
plt.hist(a, num_bins, density=False, histtype='stepfilled', color='b', alpha=0.7)
plt.title('Histogram')
plt.savefig("test6.png")
plt.show()

在这里插入图片描述

- 柱状图

柱状图通过使用水平或垂直方向柱子的高度来显示不同类别的数值，其中柱状图的一个轴显示正在比较的类别，而另一个轴代表对应的刻度值。例如：
假设你获取到了2017年内地电影票房前20的电影(列表a)和电影票房数据(列表b),那么如何更加直观的展示该数据？
a = [“战狼2”,“速度与激情8”,“功夫瑜伽”,“西游伏妖篇”,“变形金刚5：最后的骑士”,“摔跤吧！爸爸”,“加勒比海盗5：死无对证”,“金刚：骷髅岛”,“极限特工：终极回归”,“生化危机6：终章”,“乘风破浪”,“神偷奶爸3”,“智取威虎山”,“大闹天竺”,“金刚狼3：殊死一战”,“蜘蛛侠：英雄归来”,“悟空传”,“银河护卫队2”,“情圣”,“新木乃伊”] b=[56.01,26.94,17.53,16.49,15.45,12.96,11.8,11.61,11.28,11.12,10.49,10.3,8.75,7.55,7.32,6.99,6.88,6.86,6.58,6.23] 单位:亿我们通过bar(x, height, width, color, align, yerr)函数可以绘制柱状图。其中
(1)x：表示x轴的位置序列，一般采用range函数产生一个序列；
(2)y：表示y轴的数值序列，也就是柱形图的高度，一般就是我们需要展示的数据；
(3)width：为柱形图的宽度，一般这是为1即可；
(4)color：为柱形图填充的颜色;
(5)yerr：让柱形图的顶端空出一部分。
(6)alpha：设置柱状填充颜色的透明度。

import matplotlib.pyplot as plt
x_lable=["战狼2","速度与激情8","功夫瑜伽","西游伏妖篇","变形金刚5：最后的骑士","摔跤吧！爸爸","加勒比海盗5：死无对证","金刚：骷髅岛","极限特工：终极回归","生化危机6：终章","乘风破浪","神偷奶爸3","智取威虎山","大闹天竺","金刚狼3：殊死一战","蜘蛛侠：英雄归来","悟空传","银河护卫队2","情圣","新木乃伊"]
x = range(len(x_lable)) 
y=[56.01,26.94,17.53,16.49,15.45,12.96,11.8,11.61,11.28,11.12,10.49,10.3,8.75,7.55,7.32,6.99,6.88,6.86,6.58,6.23]
plt.bar(x,y,width=0.2)
plt.xticks(x,x_lable,fontproperties='simhei',rotation=90)  #设置坐标轴，并将文字旋转90度。
plt.xlabel('片名',fontproperties='simhei')
plt.ylabel('票房',fontproperties='simhei')
plt.savefig("test5.png")
plt.show()

在这里插入图片描述

- 箱型图

箱型图：是一种用作显示一组数据分散情况的统计图，包含一组数据的：中位数、第一分位数（Q1）、第三分位数（Q3）、异常值。
在这里插入图片描述
(1)中位数：中间位置的数。
(2)第一个四分位数Q1(下四分位数)
(3)第三四分位数Q3(上四分位数)
(4)四分位间距IQR：IQR=Q3-Q1；
(5)最大值区间Q3+1.5IQR；
(6)最小值区间Q1-1.5IQR ；
(7)异常值outliers。
我们用plot.box()函数可以绘制箱型图。

import  matplotlib.pyplot as plt
import numpy as np
import  matplotlib.pyplot as plt
import pandas as pd

df = pd.DataFrame(np.random.rand(10,5),columns=['A','B','C','D','E'])

f = df.boxplot(sym = 'o',            #异常点形状
               vert = True,          # 是否垂直
               whis=1.5,             # IQR
               patch_artist = True,  # 上下四分位框是否填充
               meanline = False,showmeans = True,  # 是否有均值线及其形状
               showbox = True,   # 是否显示箱线
               showfliers = True,  #是否显示异常值
               notch = False,    # 中间箱体是否缺口
               return_type='dict')  # 返回类型为字典
plt.title('箱线图',fontproperties="simhei")
plt.savefig("test7.png")
plt.show()