Python
- 使用.ix命令可以按行号索引数据
- 随机选择数据:例子:rows = np.random.choice(diamonds.index.values,round(0.0001*len(diamonds)))
- pandasql添加包来使用SQL语句有条件地查询数据。(自己上网找)
- 对于数值数据,pandas中的Describe命令与summary命令用于数值数据地描述效果相同。
- 在Python中,value_counts()地操作与R中的table()频率列表中的操作相同。
- pandas中两个变量之间的列联表由crosstab函数给出。
### 线性回归
import pandas as pd
import statsmodels.formula.api as sm
iris = pd.read_csv("http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv")
iris = iris.drop('Unnamed: 0',1)
iris.head()
# import numpy as np
# y = np.array([1,2,0,1])
# x = np.array([12,22,16,24])
# p = x/sum(x)
# Y_PPS = 0.25*sum(y/p)
# import mean_interval as mi
# mi.mean_interval(mean=8900, std=None, sig=500, n=35, confidence=0.95)
iris.columns = ['Sepal_Length','Sepal_Width','Petal_length','Petal_Width','Species']
result = sm.ols(formula = "Sepal_Length ~ Petal_length + Sepal_Width + Petal_Width + Species",
data = iris)
result.fit()
result.fit().summary()
result.fit().params
数据可视化
import pandas as pd
import statsmodels.formula.api as sm
import numpy as np
import ggplot as gg
anscombe = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/anscombe.csv")
anscombe = anscombe.drop('Unnamed: 0',1)
# anscombe.head()
# np.mean(anscombe)
# np.std(anscombe)
### 分别拟合X和Y之间的回归线
result1 = sm.ols(formula = "y1 ~ x1",data = anscombe).fit()
# result1.summary()
# result1.params
result2 = sm.ols(formula = "y1 ~ x2",data = anscombe).fit()
result3 = sm.ols(formula = "y1 ~ x3",data = anscombe).fit()
result4 = sm.ols(formula = "y1 ~ x4",data = anscombe).fit()
print(result1.params)
print(result2.params)
print(result3.params)
print(result4.params)
## 数据可视化
# %matplotlib inline
p = gg.ggplot(gg.aes(x = 'x1',y = 'y1'),data = anscombe)
p + gg.geom_point()
p2 = gg.ggplot(gg.aes(x = 'x2',y = 'y2'),data = anscombe)
p2 + gg.geom_point()
p3 = gg.ggplot(gg.aes(x = 'x3',y = 'y3'),data = anscombe)
p3 + gg.geom_point()
p4 = gg.ggplot(gg.aes(x = 'x4',y = 'y4'),data = anscombe)
p4 + gg.geom_point()
ggplot在使用过程中会出现一些问题,原因应该是ggplot没更新造成的,很多其它python包已经升级了。
问题1:
AttributeError: module 'pandas' has no attribute 'tslib'
解决方法:
#1. Change `pandas.tslib.Timestamp` to `pandas.Timestamp` (in utils.py and stats/smoothers.py)
#2. Change `from pandas.lib import Timestamp` to `from pandas import Timestamp`. (in smoothers.py)
问题2:
AttributeError: 'DataFrame' object has no attribute 'sort'
解决方法:
#Change `smoothed_data.sort('x')` to `smoothed_data.sort_values('x')` (in stat_smooth.py)
原文链接:https://blog.csdn.net/yzhlinscau/article/details/100975562
swarmplot可以帮助我们可视化多个类别的散点图(它绘制具有非重叠点的分类散点图),而配对图可以帮助我们绘制整个数据框(通过绘制整个数据集中的成对关系)
注意:我们可以通过参数plt.figure修改图形大小(figsize = (A,B)),用hue参数修改颜色,
使图形更加连贯或易于理解。
import pandas as pd
import seaborn as sns
# import statsmodels.formula.api as sm
import matplotlib.pyplot as plt
iris = pd.read_csv("http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv")
iris = iris.drop('Unnamed: 0',1)
iris.head()
iris.info()
plt.scatter(x = "Sepal.Length",y = "Petal.Length",data = iris)
sns.distplot(iris["Sepal.Length"]) # 绘制分布图
sns.regplot(x = "Sepal.Length",y = "Sepal.Width",data = iris) # 添加回归线
# 可以使用seaborn中的swarmplot绘制多分类的散点图
plt.figure(figsize = (8,6))
sns.swarmplot(x = "Sepal.Length",y = "Sepal.Width",hue = "Species",data = iris)
# swarmplot可以帮助我们可视化多个类别的散点图(它绘制具有非重叠点的分类散点图),
# 而配对图可以帮助我们绘制整个数据框(通过绘制整个数据集中的成对关系)
# 注意:我们可以通过参数plt.figure修改图形大小(figsize = (A,B)),用hue参数修改颜色,
# 使图形更加连贯或易于理解。
sns.pairplot(iris,hue = "Species")
配对结果如下(这样的图一看就让人喜欢)
条形图——我们可以使用seaborn在Python中轻松绘制条形图(图是真的还不错)
数据可以在这下(http://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/diamonds.csv)
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as sm
import matplotlib.pyplot as plt
diamonds = pd.read_csv("diamonds.csv")
sns.barplot(x = "color",y = "carat",data = diamonds)
Factorplot——我们使用来自seaborn添加包的factorplot,并在diamonds数据集中找出颜色I和J相比其他具有的最高价格,而Premium切割与其他相比具有最高价格..Factorplot将一个分类图绘制到FareGrid上。
通过更改x轴和col(颜色)变量,我们得到以下图形,并通过将factorplot的种类参数从箱图更改为条形图或散点图,我们得到以下图形。图表的变化有助于探索性分析。(这就厉害了)
sns.factorplot(x = "cut",y = "price",col = "color",data = diamonds,kind = "bar",size = 4,
aspect = .5);
sns.factorplot(x = "color",y = "price",col = "cut",data = diamonds,kind = "box",size = 4,
aspect = .5)
sns.factorplot(x = "cut",y = "price",col = "color",data = diamonds,kind = "box",size = 4,
aspect = .5)
sns.factorplot(x = "cut",y = "price",col = "color",data = diamonds,kind = "point",size = 4,
aspect = .5)
我们可以使用jointplot来绘制联合分布图(jointplot)
它们可以是kde(用于密度)或scatter(用于点)或hexbins(用于过度绘制)的形式。
sns.jointplot(x = "carat",y = "price",data = diamonds)
ggplot 的应用
import matplotlib as mt
from ggplot import *
import pandas as pd
# 分层显示
diamonds = pd.read_csv("diamonds.csv")
p = ggplot(aes(x = 'price', y = 'carat'), data = diamonds)
p + geom_point()
p + geom_point() + facet_grid('cut')
p = ggplot(aes(x = 'price', y = 'carat', color = "cut"), data = diamonds)
p + geom_point()
p = ggplot(aes(x = 'price', y = 'carat', color = "clarity"),
data = diamonds)
p + geom_point()
数据科学的添加包
Pandas:允许用户使用熟悉的数据框格式,其中行是观测值,列是变量和各种有用的数据分析特征。
Scikit-learn:允许你使用广泛使用的机器学习包进行数据挖掘和建模。
Statsmodels:带来了Python中可用的统计检验和模型。
Seaborn:Seaborn为Python带来了统计数据可视化功能。
Pandasql:允许使用SQL语法。
ggplot:这是Python中图形语法的实现。
SQLAlchemy:允许你连接和查询数据框。
简单的机器学习
Python中的分析前的预处理
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.cluster import KMeans
import sklearn.metrics as sm
import pandas as pd
import numpy as np
wine = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data",
header = None)
wine.head()
# Idiots all knows why we do next step,shall we?
wine.columns=['winetype','Alcohol','Malic acid','Ash','Alcalinity of ash','Magnesium',
'Total phenols','Flavanoids','Nonflavanoid phenols','Proan thocyanins',
'Color intensity','Hue','OD280/OD315 of diluted wines','Proline']
wine.head()
wine.info()
wine.describe()
pd.value_counts(wine['winetype'])