Data analysis and visualization

最新推荐文章于 2022-07-10 10:33:46 发布

linendsound

最新推荐文章于 2022-07-10 10:33:46 发布

阅读量626

点赞数

分类专栏： Dataquest笔记

本文链接：https://blog.csdn.net/linendsound/article/details/73463748

版权

Dataquest笔记专栏收录该内容

3 篇文章 0 订阅

订阅专栏

本章内容可以再利用python进行数据分析一书中得到详尽的介绍。

numpy包

import numpy as np
vector = np.array([5, 10, 15, 20])
matrix = np.array([[5, 10, 15], [20, 25, 30], [35, 40, 45]])

ndarray.shape 返回形状
numpy.genfromtxt() 读取数据

import numpy
nfl = numpy.genfromtxt("data.csv", delimiter=",")

numpy 数列中每个值的类型必须相同

.astype()可以转换array的类型

numpy内置了一些运算函数如sum(),mean(),max()等，可以指定axis参数，0表示对每列运算，1表示对每行运算。
np.meshgrid接受两个一维数组，并产生两个二微矩阵，对应两个数组中所有的（x，y）对

a = np.array([1,2,3,4])
b = np.array([4,5,6])
x,  y=np.meshgrid(a,b)
x
array([[1, 2, 3, 4],
       [1, 2, 3, 4],
       [1, 2, 3, 4]])
y
array([[4, 4, 4, 4],
       [5, 5, 5, 5],
       [6, 6, 6, 6]])

np.where 是三元表达式 x if condition else y的矢量化版本

np.where(cond,xarr,yarr)
xarr与yarr可以也为np.where表达式以嵌套循环

np.where(arr>0,2,-2)#将arr中大于0的部分变为2 小于0的部分变为-2
np.where(arr>0,2,arr)#将arr中大于0的部分变为2 其他不变
np.any() 可以返回是否存在True,axis不赋值时对整个数列检测返回一个值，当axis=0时对每列检测，返回列数的对应bool值，axis=1 对每行检测。

排序

arr.sort()
#多维数组可在任意轴上排序，只需传递轴编号，1 对每行排序 0 列
#值得注意的是np.sort()返回的是数组的排序副本，而arr.sort（）的实例方法将改变数组本身。

唯一化

如pthon中的set()
np提供了唯一化函数np.unique()
返回的是唯一且已排序结果

其他集合运算函数请见书表4-6

随机数

np.random

Pandas包

import pandas as pd
pd.read_csv()

df.head(n=5)可选取前5行，其中n可以指定
df.columns返回列名列表
df.shape[0]行数
df.shape[1]列数


.loc  以行标签提取
.iloc  以位置提取可用于列如[:,'cols']
.ix 综合(非常重要的是ix中包含最后一项)

DataFrame.sort_values(col,inplace=True,ascending=False)
#inplace 将原df排序而不是生成新的df
pandas.isnull(series) 返回true和false的序列
选取非null 可以对返回值与False的对比
df.pivot_table(index="分类条件", values="目标值，可为多个值 以列表表示", aggfunc=np.mean)
DataFrame.dropna()将删除任何含有na值的行，axis=0删除行 axis=1 删除列，subset=“寻找na的列名列表”
df.reset_index(drop=True) 否则将会将旧的index作为新的一列
df.apply(func,axis=) axis =1 对行循环，0 对列循环
Series.unique()
创建Series类时，Series（VALUE,index=）
Series.reindex('list of strings corresponding to the order we would like for that Series object')
series.sort_index
series.sort_value
dataframe.set_index(column,inplace=,drop=)inplace = True 将更新此dataframe drop=False将保留设为index的column

Matplotlib

import matplotlib.pyplot as plt

plt.plot(x_values, y_values)
plt.xticks()
plt.yticks()
plt.xlabel()
plt.ylabel()
plt.title()

当我们在一个图中展示多个图标时：

fig = plt.figure(figsize=(width, height))
axes_obj = fig.add_subplot(nrows, ncols, plot_number)

#ax1 = fig.add_subplot(2,1,1)
#ax2 = fig.add_subplot(2,1,2)
#可以简化为如下形式
fig, ax = plt.subplots()

对于ax实例：
ax.set_title()

图例：
当我们在一个ax中画多个线时可以使用图例
在绘制每条曲线时我们需要添加label参数
然后

plt.legend(loc='upper left')

条形图

ax.bar(bar_positions, bar_heights,width=)
ax.set_xticks(tick_positions)
ax.set_xticklabels()
ax.set_xlabel()
ax.set_xlim()
ax.set_ylim()

我们可以设置水平条形图

ax.barh(bar_positions, bar_widths, 0.5)

散点图

 Axes.scatter()

柱形图

Axes.hist(Series,bins=,range=)

箱形图

ax.boxplot(value)

图形细节修改

#修改图形的刻度时
Axes.tick_params(bottom='off',top='off',left='off',right='off') 
#修改图形的边框时
for key in ax.spines:
    ax.spines[key].set_visible(False)
#当我们画图时会用到c参数来指定线条的颜色，此时我们可以使用RGB值，matplotlib需要的rgb为0-1之间因此需要在原RGB值上除以255
cb_dark_blue = (0/255,107/255,164/255)
linewidth可以修改线条宽度
Axes.text(x,y,'text')

Seaborn库

import seaborn as sns
#柱形图
sns.distplot()
#kernel density plot
sns.kdeplot()
#图形类型
sns.set_style()
#去除边框
sns.despine()

# Condition on unique values of the "Survived" column.
g = sns.FacetGrid(titanic, col="Survived", size=6)
# For each subset of values, generate a kernel density plot of the "Age" columns.
g.map(sns.kdeplot, "Age", shade=True)
#可以多个条件
g = sns.FacetGrid(titanic, col="Survived", row="Pclass",hue = 'Sex',size =3)
g.map(sns.kdeplot, "Age", shade=True)
g.add_legend()

Basemap

import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
m = Basemap(projection='merc',llcrnrlat=-80,urcrnrlat=80,llcrnrlon=-180,urcrnrlon=180)

pd.concat([df1,df2]axis=)

Correlation

pandas.DataFrame.corr()

linendsound

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Data analysis and visualization

本章内容可以再利用python进行数据分析一书中得到详尽的介绍。numpy包import numpy as npvector = np.array([5, 10, 15, 20])matrix = np.array([[5, 10, 15], [20, 25, 30], [35, 40, 45]])ndarray.shape 返回形状 numpy.genfromtxt() 读取数据import
复制链接

扫一扫