熊猫分发_熊猫实用指南

最新推荐文章于 2020-10-11 12:58:54 发布

weixin_26737625

最新推荐文章于 2020-10-11 12:58:54 发布

阅读量683

点赞数

原文链接：https://medium.com/swlh/a-practical-guide-to-pandas-9cc126f77261

版权

熊猫分发

什么是熊猫？ (What is Pandas?)

Pandas is an open-source data analysis and manipulation tool for Python.

Pandas是用于Python的开源数据分析和处理工具。

The name? It comes from the econometrics term “panel data”, which is multi-dimensional data with measurements over time. It’s also pretty cute so that’s a bonus!

名字？它来自计量经济学术语“ 面板数据 ”，它是随时间进行测量的多维数据。它也很可爱，这是一个奖励！

At its core, it allows us to easily use spreadsheet-like data. From there, you can clean the data, preform any additional modifications, and analyse it to gain some insight into your data.

从本质上讲，它使我们可以轻松使用类似电子表格的数据。从那里，您可以清理数据，进行任何其他修改并进行分析，以获得对数据的一些了解。

安装和导入 (Installation and Importing)

If you have Python installed, you likely already have pip. If not, there are easy instructions to install it here. Pip allows us to easily install packages from the Python Package Index from the command line. Pandas can be installed with pip, along with its dependency NumPy.

如果您安装了Python，则可能已经有了pip。如果不是这样，很容易说明进行安装在这里。通过Pip ，我们可以从命令行轻松地从Python软件包索引中安装软件包。熊猫可以与pip及其依赖项NumPy一起安装。

pip install numpy
pip install pandas

Whenever you need to use pandas for one of your projects, it can be imported like so:

每当您需要在其中一个项目中使用熊猫时，都可以像这样导入它：

import numpy as np
import pandas as pd

We import using the abbreviations np and pd to make it easier to call upon the functions and classes in the module. If you want to read more about how the import system works in Python or how to use NumPy, you can check out my blog posts on either subject.

我们使用缩写np和pd导入，以便更轻松地调用模块中的函数和类。如果您想了解有关导入系统如何在Python中工作或如何使用NumPy的更多信息，则可以查看我关于这两个主题的博客文章。

基础 (The Basics)

Reading files with Pandas

用熊猫读取文件

If you’ve got data you want to do something with, you probably don’t just have it memorized. Most likely, you’ve got it in some sort of file. Conveniently, Pandas lets us easily read data from a wide variety of ways it can be stored! This will bring the data into Pandas with one of Panda’s objects, which we will talk about later.

如果您有想要处理的数据，则可能不只是记住它。您很有可能已将其保存在某种文件中。方便地使用Pandas，我们可以通过多种存储方式轻松读取数据！这会将数据与熊猫的其中一个对象一起带入熊猫，我们将在后面讨论。

For example, the most common ways are:

例如，最常见的方法是：

# Make sure you have pandas imported first!


# To read from a CSV (Comma-Separated Values) file
csv_data = pd.read_csv("path/to/file/data.csv")


# To read from a JSON (JavaScript Object Notation) file
json_data = pd.read_json("path/to/file/data.json")


# To read from a HTML (Hypertext Markup Language) file
html_data = pd.read_html("path/to/file/data.html")

As always, all of the code for this blog post can be found my GitHub.

与往常一样，可以在我的GitHub上找到此博客文章的所有代码。

Seeing our data

查看我们的数据

If you want to get a basic overview of what your data looks like, you can use the head and tail methods. They allow you to see a snippet of the first few and last few (default 5) entries of your data respectively.

如果要基本了解数据的外观，可以使用head和tail方法。它们使您可以分别查看数据的前几个和最后几个(默认为5个)条目的摘要。

For the purposes of this guide, I’ll be using a tool known as Jupyter Notebook so that we can easily see the results of our code. I used a data set from the Food Network show Chopped, courtesy of Jeffrey Braun. All of the data is stored in a CSV file called “chopped.csv.”

出于本指南的目的，我将使用称为Jupyter Notebook的工具，以便我们可以轻松查看代码的结果。我使用了由Jeffrey Braun提供的食品网络节目《切碎》中的数据集。所有数据都存储在一个名为“ chopped.csv”的CSV文件中。

图片发布 — Kevin Doran on 凯文·多兰 ( Unsplash Underlash)摄

该系列 (The Series)

Creating Series

创建系列

According to the Pandas documentation, a series is “a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).” Importantly, the series is built off of the NumPy ndarray, so all of the functionality from that also carries over into our series.

根据Pandas文档，一系列是“一维标记数组，能够保存任何类型的数据(整数，字符串，浮点数，python对象等)。” 重要的是，该系列是基于NumPy ndarray构建的，因此该系列中的所有功能都将延续到我们的系列中。

Essentially, its like having a spreadsheet with two columns: labels, which we will refer to as the index, and values, which could be anything from countries, prices, or MySpace “Top 8” rankings.

从本质上讲，它就像具有两列的电子表格：标签(我们将其称为索引)和值(可以是国家，价格或MySpace“前8名”排名中的任何内容)。

Conveniently, there are several ways to create them:

方便地，有几种创建它们的方法：

Series are ndarray-like

系列类似于ndarray

Series act very similarly to NumPy ndarrays, and as such a lot of similar functionality can be used on a series.

系列的行为与NumPy ndarray非常相似，因此可以在系列中使用许多相似的功能。

Additionally, because of this property, Pandas Series can also take advantage of “vectorized operations.” Essentially, this means that we don’t have to loop through each element in a Series to preform operations on it. Instead, our operation can be “scaled” to the size of our Series and be done just like a matrix operation. For example, if we wanted to add 3 to each element in our Series, all we have to do is add 3 to the Series itself:

此外，由于具有此属性，Pandas系列还可以利用“ 矢量化操作 ”的优势。本质上，这意味着我们不必遍历Series中的每个元素就可以对其执行操作。相反，我们的操作可以“缩放”到系列的大小，并且可以像矩阵操作一样进行。例如，如果我们想向系列中的每个元素添加3，那么我们要做的就是向系列本身中添加3：

The scalar value 3 got “stretched” into a vector or array full of 3’s that was the same size as our original Series. Consequently, the computation was much easier and faster.

标量值3被“拉伸”到一个充满3的向量或数组中，该向量或数组的大小与原始系列的大小相同。因此，计算变得更加容易和快捷。

Series are also dictionary-like

系列也像字典

Series also act very similarly to a Python dictionary, with the index label acting as a key.

系列也非常类似于Python词典，索引标签用作键。

数据框 (The DataFrame)

The DataFrame is a two-dimensional data structure with potentially different types of data in each of its columns. It is essentially a spreadsheet or table, just with a lot of added functionality. It is an incredibly key object in Pandas.

DataFrame是一个二维数据结构，其每个列中都有可能具有不同类型的数据。它实质上是一个电子表格或表格，只是具有许多附加功能。它是熊猫中一个难以置信的关键对象。

Data Frame Creation

数据框创建

Conveniently, it accepts many different types of data as input:

方便地，它接受许多不同类型的数据作为输入：

A DataFrame from several Series:

来自多个系列的DataFrame：

A DataFrame from a dictionary:

字典中的DataFrame：

A DataFrame from a 2D array:

来自2D数组的DataFrame：

DataFrame getting and setting

DataFrame的获取和设置

You can treat a DataFrame exactly like a Python dictionary, with the keys being the names of each column, and the corresponding values being a Series. For example:

您可以将DataFrame像Python字典一样对待，其键为每一列的名称，而对应的值为Series。例如：

If you want to get a row in the DataFrame as a Series, you can use the built-in Pandas loc method. Additionally, you can use the method to get and set values for specific elements in the DataFrame.

如果要在DataFrame中获得一系列数据，则可以使用内置的Pandas loc方法。此外，您可以使用该方法来获取和设置DataFrame中特定元素的值。

If you need to add an entire row of new data to the DataFrame, this can be done with the append method.

如果需要将新数据的整行添加到DataFrame中，则可以使用append方法来完成。

However, this can be computationally expensive, so try your best to first gather all your data and then create a DataFrame.

但是，这在计算上可能会很昂贵，因此请尽力先收集所有数据，然后创建一个DataFrame。

One helpful feature of Pandas is the ability to do boolean indexing. Essentially, this allows us to get only the values which meet a certain condition. For example, if we only want to grab people who are over the age of 25:

熊猫的一项有用功能是能够进行布尔索引。本质上，这允许我们仅获得满足特定条件的值。例如，如果我们只想吸引25岁以上的人：

There are many applications of boolean indexing, and I encourage you to try out several different ideas. (What if you wanted to grab all the values about the mean age?)

布尔索引有许多应用，我鼓励您尝试几种不同的想法。 (如果您想获取有关平均年龄的所有值怎么办？)

DataFrame cleaning

DataFrame清洁

Often, data is not very clean. There will be typos, missing information, extraneous rows, mismatched types, and just general weirdness. One of the most important jobs of a data scientist is getting a data set into a state that is “clean” and devoid of all of this weirdness. That way, any insights taken from that data isn’t the result of bad data.

通常，数据不是很干净。会有拼写错误，信息丢失，多余的行，不匹配的类型，以及一般的怪异现象。数据科学家最重要的工作之一就是使数据集进入“干净”的状态，并且没有所有这些怪异之处。这样，从该数据中获取的任何见解都不是不良数据的结果。

This blog post by Malay Agarwal does an amazing job of explaining how to clean data, and goes into much more depth than I could in this single guide.

马来·阿加瓦尔(Malay Agarwal)撰写的这篇博客文章在解释如何清除数据方面做得非常出色，并且比我在本指南中所能深入的深入。

DataFrame Manipulation

DataFrame操纵

Finally, one of the most important things we can do with a DataFrame is manipulating its data to create new insights.

最后，我们对DataFrame可以做的最重要的事情之一就是操纵其数据以创建新的见解。

For example, imagine you work at Amazon and you have sales data from some technology items.

例如，假设您在亚马逊工作，并且拥有来自某些技术项目的销售数据。

Lets say you wanted to find out how much sales was generated from each item. We could define a new column called “Total Sales” which simply takes in the product of “Unit Price” and “Quantity” for each item. Similarly, let’s say you wanted to find out which items made up the largest fraction of sales. You could define a “Ratio of Sales” which takes in the “Total Sales” for each item and divides it by the sum of “Total Sales” for all items.

假设您想找出每件商品产生了多少销售额。我们可以定义一个名为“ Total Sales”的新列，该列仅将每个项目的“单价”和“数量”乘积。同样，假设您想找出哪些商品构成了销售额的最大部分。您可以定义一个“销售比率”，该比率将每个项目的“总销售量”计入除以所有项目的“总销售量”之和。

This is straightforward to accomplish with DataFrames:

使用DataFrames可以轻松实现：

And now you can generate some insight as to which items are the most important for your overall profits! Look at you, a whole Data Scientist!

现在，您可以了解哪些项目对您的整体利润最重要！看着你，整个数据科学家！

结论 (Conclusion)

Image for post — Photo by Matteo Grobberio on Unsplash

Pandas is a powerful tool for dealing with data. Whether you want to draw insights from Food Network shows, or drive a small business, Pandas is an integral part of the Python data science ecosystem.

熊猫是处理数据的强大工具。无论您是想从Food Network节目中汲取见解，还是想开一家小企业，Pandas都是Python数据科学生态系统不可或缺的一部分。

I hope you were able to glean some information from this post! It was an interesting topic to look into! If there is any aspect of this post that you would like me to go into more depth with, please contact me.

希望您能够从这篇文章中收集一些信息！这是一个有趣的话题！如果您希望我对本帖子有任何更深入的了解，请与我联系。

Thanks for reading!

谢谢阅读！