数据分析_plt.xticks(x[::2])-CSDN博客

本文链接：https://blog.csdn.net/weixin_42681868/article/details/83552467

本文主要介绍matplotlib，numpy，和pandas的使用
matplotlib：
导入：from matplotlib import pyplot as plt
画图步骤:
1.绘制了折线图(plt.plot)
2.设置了图片的大小和分辨率(plt.figure)
3.实现了图片的保存(plt.savefig)
4.设置了xy轴上的刻度和字符串(xticks)
5.解决了刻度稀疏和密集的问题(xticks)
6.设置了标题,xy轴的lable(title,xlable,ylable)
7.设置了字体(font_manager. fontProperties,matplotlib.rc)
8.在一个图上绘制多个图形(plt多次plot即可)
9.为不同的图形添加图例

建立画图区域，设置大小和分辨率：plt.figure(figsize=(10,8),dpi=80)
绘制折线图：plt.plot(x,y)
设置x轴刻度：plt.xticks(x[::2])
字体设置：from matplotlib import font_manager my_font = font_manager.FontProperties(fname="C:\\windows\\fonts\\msyh.ttc") ，plt.xticks(x[::5],_x_xticks[::5],rotation=90,fontproperties=my_font)
坐标名设置：plt.xlabel("时间",fontproperties=my_font)
表头名称设置：plt.title("10点至12点温度变化图",fontproperties=my_font)

在这里插入图片描述
用一张图中多条线，设置每条线的标签。plt.plot(x,a,color='r',label='这是a')；plt.plot(x,b,color='blue',label='这是b')； plt.legend(prop=my_font,loc='best')

matplotlib使用的流程总结：
1：明确问题
2：选择图形的呈现方式
3：准备数据
4：绘图和图形完善

matplotlib常见问题总结：
1：应该选择那种图形来呈现数据
2：matplotlib.plot(x,y)
3：matplotlib.bar(x,y)
4：matplotlib.scatter(x,y)
5：matplotlib.hist(data,bins,normed)
6：xticks和yticks的设置
7：label和titile,grid的设置
8：绘图的大小和保存图片

例：

from matplotlib import pyplot as plt
from matplotlib import font_manager

'''
#matplotlib 是基于python语言仿照matlab的绘图库
temperature = [15,13,22,25,26,18,15,22,21,20,19,17] #假设这是一天中的气温，间隔2h
x = range(2,26,2)
plt.figure(figsize=(18,8))  #设置图片大小
plt.plot(x,temperature)   #绘制折线图
plt.xticks(range(2,25,1))               #设置x轴刻度

# plt.savefig('temp.png')    #保存图片,"以文件名./sig_size.png 的格式可以保存矢量图（放大不会有锯齿）"
plt.show()                  #展示折线图
'''


import random

#a表示10点到12点每分钟的气温
a = [random.randint(20,35) for i in range(120)]
x = range(1,121)
plt.figure(figsize=(15,8))
plt.plot(x,a)
#字体设置(为了显示中文)
my_font = font_manager.FontProperties(fname="C:/Windows/Fonts/msyh.ttc")   #需要先找到电脑中字体的路径
#调整x轴刻度
_xtick_labels = ["10:{}".format(i) for i in range(60)]
_xtick_labels += ["11:{}".format(i) for i in range(60)]
_x = list(x)[::3]
_xtick_labels = _xtick_labels[::3]
plt.xticks(_x,_xtick_labels,rotation=270,fontproperties=my_font) #当同时输入数字和字符串时，会一一对应(数字和字符串长度应该一样)
#添加描述信息
plt.xlabel("时间",fontproperties=my_font)      #x坐标名
plt.ylabel("温度",fontproperties=my_font)      #y坐标名
plt.title("10点到12点每分钟温度变化的情况",fontproperties=my_font)  #图标标题

plt.show()



'''
#假设a是当年吃屎数，x是年龄，绘制折线图以便分析吃屎走势
a = [1,0,1,1,2,4,3,2,3,4,4,5,6,5,4,3,3,1,1,1]
b = [1,0,3,1,2,2,3,3,2,1,2,1,1,1,1,1,1,1,1,1]
x = range(11,31)
plt.figure(figsize=(15,8))
plt.plot(x,a,label='我')
plt.plot(x,b,label='沙雕',color='red')
#字体设置
my_font = font_manager.FontProperties(fname="C:/Windows/Fonts/msyh.ttc",size=14)
#调整x，y轴刻度
_x = ['{}岁'.format(i) for i in range(11,31)]
plt.xticks(x,_x,rotation=270,fontproperties=my_font)

#设置x,y标题及表头
plt.xlabel("年龄",fontproperties=my_font)
plt.ylabel("吃屎数",fontproperties=my_font)
plt.title("吃屎数量随年龄变化的情况",fontproperties=my_font)

plt.grid(alpha=0.4) #绘制网格,alpha是透明度
#添加图例
plt.legend(prop=my_font,loc='upper left') #只有图例中的字体用prob接受
# plt.savefig('chishi.png')
plt.show()
'''

numpy
NumPy数组是一个多维的数组对象（矩阵），称为ndarray，具有矢量算术运算能力和复杂的广播能力，并具有执行速度快和节省空间的特点。
ndarray的属性：
1：ndim属性：维度个数
2：shape属性：维度大小
3：dtype属性：数据类型

生成34的成0-1内的随机浮点数组arr =np.random.rand(3, 4)
生成34的整数数组arr = np.random.randint(-1, 5, size = (3, 4))
生成指定维度大小（3行4列）的随机多维浮点型数据arr = np.random.uniform(-1, 5, size = (3, 4))
指定大小的全0数组zeros_arr = np.zeros((3, 4)) 第一个参数是元组，用来指定大小
指定大小的全1数组 ones_arr = np.ones((2, 3))
初始化数组，不是总是返回全0np.empty()
创建一维数组arr = np.arange(15)
变形：print(arr.reshape(3, 5))
random.shuffle() 将打乱数组序列（类似于洗牌）np.random.shuffle(arr)
dtype参数（在创建数组时使用，用于指定数据类型）
转换数据类型zeros_float_arr.astype(np.int32)

ndarray矩阵运算
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(“元素相乘：”)
print(arr * arr)
print(“矩阵相加：”)
print(arr + arr)
结果为：
元素相乘：
[[ 1 4 9]
[16 25 36]]
矩阵相加：
[[ 2 4 6]
[ 8 10 12]]

矢量与标量运算
print(1. / arr)
print(2. * arr)
结果为：
[[ 1. 0.5 0.33333333]
[ 0.25 0.2 0.16666667]]
[[ 2. 4. 6.]
[ 8. 10. 12.]]

轴(axis)：在numpy中可以理解为方向,使用0,1,2…数字表示。np.arange(0,10).reshape((2,5)),reshpe中2表示0轴长度(包含数据的条数)为2, 1轴长度为5

索引和切片
一维数组的索引与切片：
和列表一样
arr1 = np.arange(10)
print(arr1)
print(arr1[2:5])

多维数组的索引与切片：

arr[r1:r2, c1:c2]
arr[1,1] 等价 arr[1][1]
[:] 代表某个维度的数据

例如：
arr2 = np.arange(12).reshape(3,4)
print(arr2)
print(arr2[1])
print(arr2[0:2, 2:])
print(arr2[:, 1:3])
结果为：
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]

[4 5 6 7]

[[2 3]
[6 7]]

[[ 1 2]
[ 5 6]
[ 9 10]]

条件索引：
data_arr = np.random.rand(3,3)
print(data_arr)
year_arr = np.array([[2000, 2001, 2000],
[2005, 2002, 2009],
[2001, 2003, 2010]])
is_year_after_2005 = year_arr >= 2005
print(is_year_after_2005, is_year_after_2005.dtype)

filtered_arr = data_arr[is_year_after_2005]
print(filtered_arr)

#filtered_arr = data_arr[year_arr >= 2005]
#print(filtered_arr)

#多个条件
filtered_arr = data_arr[(year_arr <= 2005) & (year_arr % 2 == 0)]
print(filtered_arr)
结果为：
[[ 0.53514038 0.93893429 0.1087513 ]
[ 0.32076215 0.39820313 0.89765765]
[ 0.6572177 0.71284822 0.15108756]]

[[False False False]
[ True False True]
[False False True]] bool

[ 0.32076215 0.89765765 0.15108756]

#[ 0.32076215 0.89765765 0.15108756]

[ 0.53514038 0.1087513 0.39820313]

ndarray的维数转换（转置）arr = np.random.rand(2,3) # 2x3 数组； print(arr) ；print(arr.transpose()) # 转换为 3x2 数组

np读取本地数据np.loadtxt(fname,dtype=np.float,delimiter=None,skiprows=0,usecols=None,unpack=False)
在这里插入图片描述一般不使用numpy读取数据，因为pandas读取数据更方便

元素计算函数：
ceil(): 向上最接近的整数，参数是 number 或 array
floor(): 向下最接近的整数，参数是 number 或 array
rint(): 四舍五入，参数是 number 或 array
isnan(): 判断元素是否为 NaN(Not a Number)，参数是 number 或 array
multiply(): 元素相乘，参数是 number 或 array
divide(): 元素相除，参数是 number 或 array
abs()：元素的绝对值，参数是 number 或 array
where(condition, x, y): 三元运算符，x if condition else y

元素统计函数：

np.mean(), np.sum()：所有元素的平均值，所有元素的和，参数是 number 或 array
np.max(), np.min()：所有元素的最大值，所有元素的最小值，参数是 number 或 array
np.std(), np.var()：所有元素的标准差，所有元素的方差，参数是 number 或 array
np.argmax(), np.argmin()：最大值的下标索引值，最小值的下标索引值，参数是 number 或 array
np.cumsum(), np.cumprod()：返回一个一维数组，每个元素都是之前所有元素的累加和和累乘积，参数是 number 或 array
多维数组默认统计全部维度，axis参数可以按指定轴心统计，值为0则按列统计，值为1则按行统计。

元素判断函数:
np.any(): 至少有一个元素满足指定条件，返回True
np.all(): 所有的元素满足指定条件，返回True
例如：
arr = np.random.randn(2,3)
print(arr)
print(np.any(arr > 0))
print(np.all(arr > 0))
结果为：
[[ 0.05075769 -1.31919688 -1.80636984]
[-1.29317016 -1.3336612 -0.19316432]]
True
False

元素去重排序函数：
如：
arr = np.array([[1, 2, 1], [2, 3, 4]])
print(arr)
print(np.unique(arr))
结果为：
[[1 2 1] [2 3 4]]
[1 2 3 4]

在这里插入图片描述

pandas
Pandas有两个最主要也是最重要的数据结构： Series 和 DataFrame

Series：
Series是一种类似于一维数组的对象，由一组数据（各种NumPy数据类型）以及一组与之对应的索引（数据标签）组成。
类似一维数组的对象
由数据和索引组成
索引(index)在左，数据(values)在右
索引是自动创建的

通过list构建Series：
ser_obj = pd.Series(range(10, 20))
print(ser_obj)
print(ser_obj.head(3))

获取数据和索引：
print(ser_obj.values)
print(ser_obj.index)

通过索引获取数据：
print(ser_obj[0])
print(ser_obj[8])

通过dict构建Series：
year_data = {2001: 17.8, 2002: 20.1, 2003: 16.5}
ser_obj2 = pd.Series(year_data)

DataFrame

DataFrame是一个表格型的数据结构，它含有一组有序的列，每列可以是不同类型的值。DataFrame既有行索引也有列索引，它可以被看做是由Series组成的字典（共用同一个索引），数据是以二维结构存放的。

通过ndarray构建DataFrame：
array = np.random.randn(5,4)
df_obj = pd.DataFrame(array)

通过dict构建DataFrame：
dict_data = {‘A’: 1,
‘B’: pd.Timestamp(‘20170426’),
‘C’: pd.Series(1, index=list(range(4)),dtype=‘float32’),
‘D’: np.array([3] * 4,dtype=‘int32’),
‘E’: [“Python”,“Java”,“C++”,“C”],
‘F’: ‘ITCast’ }
#print dict_data
df_obj2 = pd.DataFrame(dict_data)
print(df_obj2)
结果为：
A B C D E F
0 1 2017-04-26 1.0 3 Python ITCast
1 1 2017-04-26 1.0 3 Java ITCast
2 1 2017-04-26 1.0 3 C++ ITCast
3 1 2017-04-26 1.0 3 C ITCast

通过列索引获取列数据：
print(df_obj2[‘A’])
print(type(df_obj2[‘A’]))
print(df_obj2.A)
结果为
0 1.0
1 1.0
2 1.0
3 1.0
Name: A, dtype: float64
<class ‘pandas.core.series.Series’>
0 1.0
1 1.0
2 1.0
3 1.0
Name: A, dtype: float64

增加列数据：
df_obj2[‘G’] = df_obj2[‘D’] + 4
print(df_obj2.head())
结果为：
A B C D E F G
0 1.0 2017-01-02 1.0 3 Python ITCast 7
1 1.0 2017-01-02 1.0 3 Java ITCast 7
2 1.0 2017-01-02 1.0 3 C++ ITCast 7
3 1.0 2017-01-02 1.0 3 C ITCast 7

删除列：
del(df_obj2[‘G’] )
print(df_obj2.head())
结果为：
A B C D E F
0 1.0 2017-01-02 1.0 3 Python ITCast
1 1.0 2017-01-02 1.0 3 Java ITCast
2 1.0 2017-01-02 1.0 3 C++ ITCast
3 1.0 2017-01-02 1.0 3 C ITCast

缺失值和空值处理：