pandas基础（二）

最新推荐文章于 2021-12-02 14:08:16 发布

ZLuby

最新推荐文章于 2021-12-02 14:08:16 发布

阅读量386

点赞数

分类专栏： python 文章标签： python pandas

本文链接：https://blog.csdn.net/weixin_38300566/article/details/85226821

版权

python 专栏收录该内容

39 篇文章 10 订阅

订阅专栏

第二部分探索性数据分析

在学习了如何提取和检查数据之后，接下来将在视觉上和数量上进行探索。这个过程称为探索性数据分析（EDA），是任何数据科学项目的重要组成部分。 pandas有强大的方法，有助于统计和视觉EDA。在本部分，学习如何以及何时应用这些技术。

一、视觉探索性数据分析

1.1 pandas 线图

.plot（）方法默认情况下会将Index值放在x轴上。在本练习中，您将练习在x和y轴上创建具有特定列的线图。

使用包含2015年AAPL，GOOG和IBM月度股票价格的数据集。股票价格来自雅虎财经。使用列名列表在x轴上绘制“月”列，在y轴上绘制AAPL和IBM价格。已导入所有必需的模块，DataFrame在工作区中以df的形式提供。使用.head（），。info（）和.describe（）等方法探索列名称。

# Create a list of y-axis column names: y_columns
y_columns = ['AAPL','IBM']

# Generate a line plot
df.plot(x='Month', y=y_columns)

# Add the title
plt.title('Monthly stock prices')

# Add the y-axis label
plt.ylabel('Price ($US)')

# Display the plot
plt.show()

看起来AAPL和IBM的月度股价在下跌前一年初达到顶峰。???

1.2 pandas散点图

使用kind ='scatter'关键字参数生成大熊猫散点图。散点图要求通过在.plot（）中指定x和y参数来选择x和y列。采用s关键字参数来提供每个圆的半径以像素为单位绘制。根据不同权重，绘制的散点图点的大小区别。每个圆的大小由NumPy数组提供，sizes。此数组包含数据集中每辆汽车的标准化“权重”。

array([ 51.12044694, 56.78387977, 49.15557238, 49.06977358,
49.52823321, 78.4595872 , 78.93021696, 77.41479205,
......
29.5706502 , 23.38638738, 36.23351603, 32.40968826,
18.88972581, 21.92965639, 28.68963762, 30.80379718])

# Generate a scatter plot
df.plot(kind='scatter', x='hp', y='mpg', s=sizes)

# Add the title
plt.title('Fuel efficiency vs Horse-power')

# Add the x-axis label
plt.xlabel('Horse-power')

# Add the y-axis label
plt.ylabel('Fuel efficiency (mpg)')

# Display the plot
plt.show()

1.3 pandas 箱线图

虽然pandas可以在单个图中绘制多列数据，但是绘制共享相同x和y轴的图，但是有些情况下两列不能一起绘制，因为它们的单位不匹配。 .plot（）方法可以为每个绘制的列生成子图。在这里，每个图将独立缩放。箱形图是可视化重要摘要统计数据的好方法。

在本练习中，根据汽车数据集生成燃料效率（mpg）和重量的箱形图。要在单个图中执行此操作，在.plot（）内指定subplots = True以生成两个单独的图。

# Make a list of the column names to be plotted: cols
cols = ['weight','mpg']

# Generate the box plots
df[cols].plot(kind='box',subplots=True)

# Display the plot
plt.show()

1.4 pandas hist，pdf和cdf

Pandas依赖.hist（）方法不仅生成直方图，还生成概率密度函数（PDF）和累积密度函数（CDF）的图。

使用由餐馆账单组成的数据集，其中包括客户提示的金额。原始数据集由Seaborn软件包提供。为数据集的fraction列绘制PDF和CDF。 在绘制PDF时，需要在调用.hist（）时指定normed = True，并且在绘制CDF时，除了normed = True之外，还需要指定cumulative = True。fraction列包含的信息：小费占总账单的百分数。

ax = axes [0]表示该图将出现在第一行。

# This formats the plots such that they appear on separate rows
fig, axes = plt.subplots(nrows=2, ncols=1)

# Plot the PDF
df.fraction.plot(ax=axes[0], kind='hist', bins=30, normed=True, range=(0,.3))
plt.show()

# Plot the CDF
df.fraction.plot(ax=axes[1], kind='hist', bins=30, cumulative=True, normed=True, range=(0,.3))
plt.show()

二、统计探索性数据分析

中位数是一个非常有用的统计量，特别是在存在异常值的情况下，当它比平均值更稳健时

调查1970年至2011年期间颁发给女性的学士学位百分比的统计数据。每年记录17个不同领域的数据。计算“Engineering”列的最小值和最大值，并生成每年所有17个学术领域的平均值的线图。 使用.mean（）方法和关键字参数axis ='columns'。这会计算每行所有列的平均值。

# Print the minimum value of the Engineering column
print(df.Engineering.min())

# Print the maximum value of the Engineering column
print(df.Engineering.max())

# Construct the mean percentage per year: mean
mean = df.mean(axis='columns')

# Plot the average percentage per year
mean.plot()

# Display the plot
plt.show()

2.2 中位数与平均值

在许多数据集中，由于存在异常值，平均值和中值可能存在很大差异。调查乘客在泰坦尼克号上支付的平均价格，中位数和最高票价，并生成票价的箱形图。

# Print summary statistics of the fare column with .describe()
print(df.fare.describe())

# Generate a box plot of the fare column
df.fare.plot(kind='box')

# Show the plot
plt.show()

在这里你可以看到为什么在存在异常值的情况下，中位数是一个更具信息量的统计量。

位数

在本练习中，您将研究世界各国的预期寿命概率。该数据集包含每年1800至2015年出生人口的预期寿命。由于国家名称发生变化或未报告结果，因此并非每个国家都有价值观。该数据集来自Gapminder。

首先，您将确定2015年报告的国家/地区数量。整个数据集中共有260个唯一国家/地区。然后，您将计算整个数据集中预期寿命的第5和第95百分位数。最后，您将制作一个从1800年到2000年每50年预期寿命的盒子图。请注意这一时期内分布的巨大变化。

未完

11111111

ZLuby

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
pandas基础（二）

第二部分探索性数据分析在学习了如何提取和检查数据之后，接下来将在视觉上和数量上进行探索。这个过程称为探索性数据分析（EDA），是任何数据科学项目的重要组成部分。 pandas有强大的方法，有助于统计和视觉EDA。在本部分，学习如何以及何时应用这些技术。一、视觉探索性数据分析1.1 pandas 线图.plot（）方法默认情况下会将Index值放在x轴上。在本练习中，您将练习在...
复制链接

扫一扫

专栏目录