数据分析

最新推荐文章于 2024-09-16 14:49:33 发布

weixin_30836759

最新推荐文章于 2024-09-16 14:49:33 发布

阅读量79

点赞数

文章标签： python

原文链接：http://www.cnblogs.com/Chinesehan/p/11557811.html

版权

numpy

import numpy as np

向量运算
　　shop_price = [30, 20, 15, 40]
　　shop_num = [2, 3, 1, 4]
　　np_shop_price = np.array(shop_price)
　　np_shop_num = np.array(shop_num)
　　np.sum(np_shop_price * np_shop_num)

四则运算
　　b1 = [1,2,3,4,]
　　b1+2 # 报错
　　b1*3 # [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4]

　　b2 = np.array([1,2,3,4])
　　b2+3 # array([4, 5, 6, 7])
　　b2*3 # array([ 3, 6, 9, 12])

常用属性
　　T 数组的转置（对高维数组(行列式)而言）
　　dtype 数组元素的数据类型
　　size 数组元素的总个数
　　ndim 数组的维数（行数）
　　shape 数组的维度大小（以元组形式）（行数，列数）

　　ndarray-多维数组对象
　　dtype：
　　　　bool_, int(8,16,32,64), float(16,32,64)
类型转换：astype()
　　创建nd.array：
　　array() 将列表转换为数组，可选择显式指定dtype
　　linspace() 类似arange()，第三个参数为数组长度
　　zeros() 根据指定形状和dtype创建全0数组
　　ones() 根据指定形状和dtype创建全1数组
　　reshape()

等分
　　前后包括 [start,stop] num：默认等分数
　　　　np.linspace(start,stop,num=50,endpoint=True,retstep=False,dtype=None,axis=0,)
　　前包括后不包括 [start,stop）
　　　　np.linspace(1,8,num=3,endpoint=False,retstep=False,dtype=None,axis=0,)

数据个数等分矩阵
　　a6 = np.arange(10)
　　a6.reshape(2,5) # (行数,列数)

索引
　　b1 = np.arange(10) # array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
　　b1[3] # 3

　　b2 = np.array([[1,2,3,4,5],[10,20,30,40,50]])
　　b2[0,3] # 4 [行数，列数]
　　b2[1,2] # 30 [行数，列数]

切片
　　[start:stop:step]
　　# 用法与python中基础的列表切片一样

布尔索引
　　b1 = np.arange(10) # array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
　　b1>4 # array([False, False, False, False, False, True, True, True, True, True])
　　b1[b1>4] # array([5, 6, 7, 8, 9])

花式索引
　　b1 = np.arange(10) # array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
　　b1[[2,3,5]] # array([2, 3, 5])

通用函数
一元函数：
　　abs：绝对值
　　sqrt：开根号(Square root)
　　exp：指数函数(Exponentia)，以e为底
　　log：对数函数(Logarithmic)
　　ceil：向下取整
　　floor：向上取整
　　rint：四舍五入
　　trunc：舍去小数位
　　　　>>> a = np.array([-1.7, -1.5, -0.2, 0.2, 1.5, 1.7, 2.0])
　　　　>>> np.trunc(a)
　　array([-1., -1., -0., 0., 1., 1., 2.])
　　modf：把小数与整数分开
　　　　>>> np.modf([2.8,3.6])
　　(array([0.8, 0.6]), array([2., 3.])
　　isnan：是否是数字 (nan：not a num）
　　isinf: 是否是无穷(inf：infinity)
　　cos, sin, tan：三角函数
二元函数：
　　add,
　　substract,
　　multiply,
　　divide,
　　power,
　　mod,
　　maximum,
　　mininum,

数学和统计方法
常用函数
　　sum 求和
　　mean 求平均数
　　std 求标准差
　　var 求方差
　　min 求最小值
　　max 求最大值
　　　　argmin 求最小值索引位置
　　argmax 求最大值索引位置
　　Sort 求和

随机数生成
　　np.random常用函数
　　rand 产生0到1之间的随机数组(行列式)（0到1之间的数）后可加(行数,列数)
　　np.random.rand(5)
　　array([0.431528 , 0.78256694, 0.57882107, 0.62501602, 0.5008957 ])
　　np.random.rand(2,5)
　　array([[0.53423693, 0.64314084, 0.09177743, 0.9410606 , 0.33872707],
　　[0.72611849, 0.15508284, 0.00941139, 0.19386368, 0.50045678]])
　　randint 产生给定范围内的一个随机整数或数组(行列式)
　　np.random.randint(5) 产生一个整数
　　np.random.randint(5, size=(4)) 产生一个有四个整数的一维数组
　　np.random.randint(5, size=(2,4)) 产生一个有四个整数的三维数组
　　choice 给定形状产生随机选择
　　shuffle 与random.shuffle相同
　　uniform 给定形状产生随机数组

Pandas

import pandas as pd

生成数列
　　生成带默认索引的数列
　　　　s1 = pd.Series([1,2,3,4])
　　　　　　索引一个 >>>s1[1] # 2
　　　　　　索引多个 >>>s1[[3,2,...]]

　　生成带自定义索引的数列
　　　　s2 = pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd'])
　　　　s2 = pd.Series({"a":1,"b":2,"c":3,"d":4})
　　　　　　索引一个 >>>s1["b"] # 2
　　　　　　索引多个 >>>s1[["b","c",...]]

　　生成带自定义索引,value值相同的数列
　　　　s4 = pd.Series(10,index=['a', 'b', 'c', 'd'])

数列之间的四则运算是
　　只有在索引相同时，value才能进行运算
　　有几个相同索引就运算几个相同索引(value的类型要相同)
　　如果对表中没有匹配的索引结果就会是Nan

筛选s1中大于2的 (布尔索引)
　　s1>2 结果大于2的返回True，不大于2的返回False
　　s1[s1>2] 得到大于2的结果

判断索引是否在数列内
　　"b" in s2
　　在返回 Ture，不再返回 False

去除数列中value为Nan
　　s3 = pd.Series([1,2,3])
　　sr = s1+s3
　　sr.dropna(inplace=True)

sr.isnull()
　　sr的value值是：attr `numpy.NaN`，被映射为True值。
　　其他所有内容都映射到False值。

生成行列式
　　s5 = pd.DataFrame({'one':[1,2,3,4],'two':[2,3,4,5,],'three':[3,4,5,6]})
　　s5['one'][1] 　　　　 # 第一个索引指的是列索引第二个指的才是行索引
　　s5[['one','two']] 　　# 得到两列

Eg——1:

拿数据(本地):
　　movies = pd.read_csv('./douban_movie.csv.csv') # 拿不到数据要，额是路径问题，要么文件不是csv格式
列名
　　movies.columns

第n行数据
　　movies.loc[n] 　　# n 相当于key

　　movies.iloc[n]　　　#　n 为第几行

　　movies.ix[n]　　　　# loc 与 iloc 的结合体

共多少条数据：
　　movies.index #RangeIndex(start=0, stop=38735, step=1)

取前n条数据，不填默认5条
　　ovies.head(n)

除列名的数据
　　movies.values

总体描述
　　movies.describe()

Eg——2:
拿数据(网络)表单
　　res = pd.read_html('网址')
　　res = pd.read_html('https://baike.baidu.com/item/NBA%E6%80%BB%E5%86%A0%E5%86%9B/2173192?fr=aladdin')

当列名是第一行时，将列名数据复制给columns
　　nbachampions.columns = nbachampions.loc[0]

删除第n行
　　nbachampions.drop([n], inplace=True) # 原数据也删除
　　nbachampions.drop([n], inplace=False) # 新数据删除，原数据不删除

分组
　　nbachampions.groupby('冠军').groups

分组求和
　　nbachampions.groupby('冠军').size()

排序(默认升序)
　　nbachampions.groupby('冠军').size().sort_values()
　　nbachampions.groupby('冠军').size().sort_values(ascending=True)
降序
　　nbachampions.groupby('冠军').size().sort_values(ascending=False)

Matplotlib

import pandas as pd
import matplotlib as plt

文字编码格式设置(windows)
　　plt.rcParams['font.sans-serif'] = ['SimHei']
　　plt.rcParams['axes.unicode_minus'] = False

绘制折线图(plot)
　　数据
　　　　x = [1,2,3,4,5]
　　　　y = [9,6,8,5,7]
　　绘图
　　　　plt.figure(figsize=(16,9)) # 设置图形大小
　　　　plt.title('测试',size=20,color='blue') # 设置图形标题
　　　　plt.xlabel('X轴',size=20,color='red') # 设置X轴标识
　　　　plt.ylabel('Y轴',size=20,color='red') # 设置Y轴标识

　　　　# plt.plot(x, y,color='red',linestyle='--',marker='o')
　　　　# 设置图形的样式颜色折点形状
　　　　plt.plot(x,y,'r--o') # 简写
　　　　plt.show()

绘制柱状图(bar)
　　数据(可网络也可本地)
　　　　mov = pd.read_csv('./douban_movie.csv.csv')
　　　　mov = pd.read_csv('网址')
　　处理数据，拿到自己需要的格式
　　　　info = mov.groupby('产地').size().sort_values()
　　　　x = info.index
　　　　y = info.values
　　绘图
　　　　plt.figure(figsize=(15,6))
　　　　plt.title('每个国家或者地区的电影数量',size=20,color='blue')
　　　　plt.xlabel('产地',size=20,color='red')
　　　　plt.ylabel('数量',size=20,color='red')
　　　　# X轴,Y轴显示的数据旋转角度、大小、颜色
　　　　plt.xticks(rotation=45,size=15,color='blue')
　　　　plt.yticks(rotation=0,size=15,color='blue')

　　　　for a,b in zip(x,y): #把x,y的值解压循环赋值给a,b
　　　　plt.text(a,b+120,b,ha='center',size=15,color='red')
　　　　　　# a:x轴位置
　　　　　　# b+120:y轴位置
　　　　　　# b:要显示的数据
　　　　　　# ha:固定写法，x轴每一个柱状图居中

　　　　# plt.plot(x, y,color='red',linestyle='--',marker='o')
　　　　plt.bar(x, y)
　　　　plt.show()

绘制饼图(pie)
　　数据(可网络也可本地)
　　　　mov = pd.read_csv('./douban_movie.csv.csv')
　　　　mov = pd.read_csv('网址')
　　处理数据
　　　　mov_time = mov['时长']
　　　　res = pd.cut(mov_time,[0,30,60,90,110,130,1000]).value_counts()
　　　　x = res.index
　　　　y = res.values
　　绘图
　　　　plt.figure(figsize=(10,6))
　　　　plt.title('饼图',size=20,color='red')
　　　　# plt.pie(y,labels=x,autopct='%.2f%%' )
　　　　# autopct='%.2f%%'：固定格式，用于显示百分比

　　　　patchs,n_text,p_text = plt.pie(y,labels=x,autopct='%.2f%%' )
　　　　# 类似解压赋值
　　　　for i in n_text:
　　　　i.set_size(13)
　　　　i.set_color("red") # 饼图各部分名称的显示格式
　　　　for i in p_text:
　　　　i.set_size(16)
　　　　i.set_color("white") # 饼图各部分所占的百分比显示格式