pandas数据相关性分析_使用Pandas，SciPy和Seaborn进行探索性数据分析

最新推荐文章于 2023-03-01 21:59:33 发布

cumei1658

最新推荐文章于 2023-03-01 21:59:33 发布

阅读量2k

点赞数 1

文章标签：可视化 python 数据分析机器学习 java

原文链接：https://www.pybloggers.com/2018/11/explorative-data-analysis-with-pandas-scipy-and-seaborn/

版权

本文介绍了如何利用Python的Pandas、SciPy和Seaborn库进行数据探索性分析。内容涵盖从HTML中读取数据，数据清洗，缺失值计算，分类数据统计，按组聚合，以及使用Pandas和Seaborn进行数据可视化，特别是散点图和相关性分析。通过实例展示了如何解析维基百科数据，分析辣椒品种的热量和豆荚大小之间的关系。

摘要由CSDN通过智能技术生成

pandas数据相关性分析

In this post we are going to learn to explore data using Python, Pandas, and Seaborn. The data we are going to explore is data from a Wikipedia article. In this post we are actually going to learn how to parse data from a URL, exploring this data by grouping it and data visualization. More specifically, we will learn how to count missing values, group data to calculate the mean, and then visualize relationships between two variables, among other things.

在本文中，我们将学习使用Python， Pandas和Seaborn探索数据。我们将要探索的数据是维基百科文章中的数据。在这篇文章中，我们实际上将学习如何从URL解析数据，通过对数据进行分组和数据可视化来对其进行探索。更具体地说，我们将学习如何计算缺失值，将数据分组以计算平均值，然后可视化两个变量之间的关系。

In previous posts we have used Pandas to import data from Excel and CSV files. Here we are going to use Pandas read_html because it has support for reading data from HTML from URLs (https or http). To read HTML Pandas use one of the Python libraries LXML, Html5Lib, or BeautifulSoup4. This means that you have to make sure that at least one of these libraries are installed. In the specific Pandas read_html example here, we use BeautifulSoup4 to parse the html tables from the Wikipedia article.

在以前的文章中，我们使用Pandas从Excel和CSV文件导入数据。在这里，我们将使用Pandas read_html，因为它支持从URL（https或http）中读取HTML数据。要读取HTML熊猫，请使用Python库LXML，Html5Lib或BeautifulSoup4中的一种。这意味着您必须确保至少安装了这些库之一。在此处的特定Pandas read_html示例中，我们使用BeautifulSoup4来解析Wikipedia文章中的html表。

安装库 (Installing the Libraries)

Before proceeding to the Pandas read_html example we are going to install the required libraries. In this post we are going to use Pandas, Seaborn, NumPy, SciPy, and BeautifulSoup4. We are going to use Pandas to parse HTML and plotting, Seaborn for data visualization, NumPy and SciPy for some calculations, and BeautifulSoup4 as the parser for the read_html method.

在继续阅读Pandas read_html示例之前，我们将安装所需的库。在本文中，我们将使用Pandas，Seaborn，NumPy，SciPy和BeautifulSoup4。我们将使用Pandas解析HTML和绘图，使用Seaborn进行数据可视化，使用NumPy和SciPy进行一些计算，并使用BeautifulSoup4作为read_html方法的解析器。

Installing Anaconda is the absolutely easiest method to install all packages needed. If your Anaconda distribution you can open up your terminal and type: conda install <packagename>. That is, if you need to install all packages:

安装Anaconda是绝对简单的方法来安装所有需要的软件包。如果是Anaconda发行版，则可以打开终端并键入：conda install <packagename>。也就是说，如果您需要安装所有软件包：

conda install numpy scipy pandas seaborn beautifulsoup4

It’s also possible to install using Pip:

也可以使用Pip安装：

pip install numpy scipy pandas seaborn beautifulsoup4

如何使用熊猫read_html (How to Use Pandas read_html)

In this section we will work with Pandas read_html to parse data from a Wikipedia article. The article we are going to parse have 6 tables and there are some data we are going to explore in 5 of them. We are going to look at Scoville Heat Units and Pod size of different chili pepper species.

在本节中，我们将与Pandas read_html一起使用，以分析Wikipedia文章中的数据。我们将要分析的文章有6个表，其中5个将要探究一些数据。我们将研究Scoville的热量单位和不同辣椒品种的荚大小。

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_Capsicum_cultivars'
data = pd.read_html(url, flavor='bs4', header=0, encoding='UTF8')

In the code above we are, as usual, starting by importing pandas. After that we have a string variable (i.e., URL) that is pointing to the URL. We are then using Pandas read_html to parse the HTML from the URL. As with the read_csv and read_excel methods, the parameter header is used to tell Pandas read_html on which row the headers are. In this case, it’s the first row. The parameter flavor is used, here, to make use of beatifulsoup4 as HTML parser. If we use LXML, some columns in the dataframe will be empty. Anyway, what we get is all tables from the URL. These tables are, in turn, stored in a list (data). In this Panda