python 数据挖掘图书_Python数据科学熊猫图书馆终极指南

python 数据挖掘图书

Pandas (which is a portmanteau of "panel data") is one of the most important packages to grasp when you’re starting to learn Python.

Pandas(这是“面板数据”的缩写)是您开始学习Python时要掌握的最重要的软件包之一。

The package is known for a very useful data structure called the pandas DataFrame. Pandas also allows Python developers to easily deal with tabular data (like spreadsheets) within a Python script.

该软件包以称为pandas DataFrame的非常有用的数据结构而闻名。 Pandas还允许Python开发人员轻松地在Python脚本中处理表格数据(例如电子表格)。

This tutorial will teach you the fundamentals of pandas that you can use to build data-driven Python applications today.

本教程将教您熊猫的基本知识,您现在可以使用它们来构建数据驱动的Python应用程序。

目录 (Table of Contents)

You can skip to a specific section of this pandas tutorial using the table of contents below:

您可以使用以下目录跳至本熊猫教程的特定部分:

熊猫介绍 (Introduction to Pandas)

Pandas is a widely-used Python library built on top of NumPy. Much of the rest of this course will be dedicated to learning about pandas and how it is used in the world of finance.

Pandas是建立在NumPy之上的广泛使用的Python库。 本课程的其余大部分内容将致力于学习有关熊猫及其在金融界的用途。

什么是熊猫? (What is Pandas?)

Pandas is a Python library created by Wes McKinney, who built pandas to help work with datasets in Python for his work in finance at his place of employment.

Pandas是由Wes McKinney创建的Python库,他构建了pandas来帮助使用Python中的数据集工作,从而在他的工作地点从事金融工作。

According to the library’s website, pandas is “a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.”

根据图书馆的网站 ,pandas是“一种快速,强大,灵活且易于使用的开源数据分析和处理工具,建立在Python编程语言之上。”

Pandas stands for ‘panel data’. Note that pandas is typically stylized as an all-lowercase word, although it is considered a best practice to capitalize its first letter at the beginning of sentences.

熊猫代表“面板数据”。 请注意,尽管熊猫被视为在句子开头大写首字母的最佳实践,但通常将其大写为小写字母。

Pandas is an open source library, which means that anyone can view its source code and make suggestions using pull requests. If you are curious about this, visit the pandas source code repository on GitHub

Pandas是一个开放源代码库,这意味着任何人都可以查看其源代码并使用请求请求提出建议。 如果您对此感到好奇,请访问GitHub上的pandas源代码存储库

熊猫的主要好处 (The Main Benefit of Pandas)

Pandas was designed to work with two-dimensional data (similar to Excel spreadsheets). Just as the NumPy library had a built-in data structure called an array with special attributes and methods, the pandas library has a built-in two-dimensional data structure called a DataFrame.

Pandas旨在处理二维数据(类似于Excel电子表格)。 就像NumPy库具有内置数据结构(称为具有特殊属性和方法的array ,pandas库具有内置二维数据结构(称为DataFrame

我们将对熊猫学到什么 (What We Will Learn About Pandas)

As we mentioned earlier in this course, advanced Python practitioners will spend much more time working with pandas than they spend working with NumPy.

正如我们在本课程中前面提到的,与使用NumPy相比,高级Python从业人员与熊猫工作所花费的时间要多得多。

Over the next several sections, we will cover the following information about the pandas library:

在接下来的几节中,我们将介绍有关pandas库的以下信息:

  • Pandas Series

    熊猫系列
  • Pandas DataFrames

    熊猫数据框
  • How To Deal With Missing Data in Pandas

    如何处理熊猫中的丢失数据
  • How To Merge DataFrames in Pandas

    如何在熊猫中合并数据框
  • How To Join DataFrames in Pandas

    如何在熊猫中加入数据框
  • How To Concatenate DataFrames in Pandas

    如何串联熊猫中的数据框
  • Common Operations in Pandas

    熊猫的共同行动
  • Data Input and Output in Pandas

    熊猫的数据输入和输出
  • How To Save Pandas DataFrames as Excel Files for External Users

    如何将Pandas DataFrames保存为Excel文件供外部用户使用

熊猫系列 (Pandas Series)

In this section, we’ll be exploring pandas Series, which are a core component of the pandas library for Python programming.

在本节中,我们将探讨pandas系列 ,这是pandas库中用于Python编程的核心组件。

什么是熊猫系列? (What Are Pandas Series?)

Series are a special type of data structure available in the pandas Python library. Pandas Series are similar to NumPy arrays, except that we can give them a named or datetime index instead of just a numerical index.

系列是pandas Python库中可用的一种特殊类型的数据结构。 Pandas Series与NumPy数组相似,除了我们可以给它们指定一个命名索引或日期时间索引,而不仅仅是数字索引。

您需要使用Pandas系列的进口商品 (The Imports You’ll Require To Work With Pandas Series)

To work with pandas Series, you’ll need to import both NumPy and pandas, as follows:

要使用pandas Series,您需要同时导入NumPy和pandas,如下所示:

import numpy as np

import pandas as pd

For the rest of this section, I will assume that both of those imports have been executed before running any code blocks.

对于本节的其余部分,我将假定在运行任何代码块之前都已执行了所有这些导入。

如何创建熊猫系列 (How To Create a Pandas Series)

There are a number of different ways to create a pandas Series. We will explore all of them in this section.

创建熊猫系列有多种方法。 我们将在本节中探索所有这些。

First, let’s create a few starter variables - specifically, we’ll create two lists, a NumPy array, and a dictionary.

首先,让我们创建一些入门变量-具体来说,我们将创建两个列表,一个NumPy数组和一个字典。

labels = ['a', 'b', 'c']

my_list = [10, 20, 30]

arr = np.array([10, 20, 30])

d = {'a':10, 'b':20, 'c':30}

The easiest way to create a pandas Series is by passing a vanilla Python list into the pd.Series() method. We do this with the my_list variable below:

创建pandas系列的最简单方法是将普通的Python列表传递到pd.Series()方法中。 我们使用下面的my_list变量执行此操作:

pd.Series(my_list)

If you run this in your Jupyter Notebook, you will notice that the output is quite different than it is for a normal Python list:

如果在Jupyter Notebook中运行此命令,您将注意到输出与正常的Python列表完全不同:

0    10

1    20

2    30

dtype: int64

The output shown above is clearly designed to present as two columns. The second column is the data from my_list. What is the first column?

上面显示的输出清楚地设计为显示为两列。 第二列是my_list的数据。 第一栏是什么?

One of the key advantages of using pandas Series over NumPy arrays is that they allow for labeling. As you might have guessed, that first column is a column of labels.

与NumPy数组相比,使用Pandas Series的主要优势之一是它们允许标记。 您可能已经猜到了,第一列是标签列。

We can add labels to a pandas Series using the index argument like this:

我们可以使用index参数将标签添加到pandas系列中,如下所示:

pd.Series(my_list, index=labels)

#Remember - we created the 'labels' list earlier in this section

The output of this code is below:

此代码的输出如下:

a    10

b    20

c    30

dtype: int64

Why would you want to use labels in a pandas Series? The main advantage is that it allows you to reference an element of the Series using its label instead of its numerical index. To be clear, once labels have been applied to a pandas Series, you can use either its numerical index or its label.

您为什么要在熊猫系列中使用标签? 主要优点是它允许您使用其标签而不是其数字索引来引用该系列的元素。 需要明确的是,一旦标签已被应用到熊猫系列,您可以使用其数字索引或者它的标签。

An example of this is below.

下面是一个示例。

Series = pd.Series(my_list, index=labels)

Series[0]

#Returns 10

Series['a']

#Also returns 10

You might have noticed that the ability to reference an element of a Series using its label is similar to how we can reference the value of a key-value pair in a dictionary. Because of this similarity in how they function, you can also pass in a dictionary to create a pandas Series. We’ll use the d={'a': 10, 'b': 20, 'c': 30} that we created earlier as an example:

您可能已经注意到,参考使用一系列其标签的元素的能力,类似于我们如何可以参考value一的key - value在字典中对。 由于它们在功能上的相似之处,您还可以传入字典来创建pandas系列。 我们以之前创建的d={'a': 10, 'b': 20, 'c': 30}为例:

pd.Series(d)

This code’s output is:

该代码的输出为:

a    10

b    20

c    30

dtype: int64

It may not yet be clear why we have explored two new data structures (NumPy arrays and pandas Series) that are so similar. In the next section of this section, we’ll explore the main advantage of pandas Series over NumPy arrays.

尚不清楚为什么我们要探索两个如此相似的新数据结构(NumPy数组和pandas系列)。 在本节的下一部分中,我们将探讨pandas系列相对于NumPy数组的主要优势。

熊猫系列相对于NumPy阵列的主要优势 (The Main Advantage of Pandas Series Over NumPy Arrays)

While we didn’t encounter it at the time, NumPy arrays are highly limited by one characteristic: every element of a NumPy array must be the same type of data structure. Said differently, NumPy array elements must be all string, or all integers, or all booleans - you get the point.

尽管我们当时还没有遇到过,但是NumPy数组受到一个特性的高度限制:NumPy数组的每个元素都必须是相同类型的数据结构。 换句话说,NumPy数组元素必须全部为字符串,或者全部为整数,或者全部为布尔值-您明白了。

Pandas Series do not suffer from this limitation. In fact, pandas Series are highly flexible.

熊猫系列不受此限制。 实际上,熊猫系列是高度灵活的。

As an example, you can pass three of Python’s built-in functions into a pandas Series without getting an error:

例如,您可以将Python的三个内置函数传递给pandas Series,而不会出现错误:

pd.Series([sum, print, len])

Here’s the output of that code:

这是该代码的输出:

0      <built-in function sum>

1    <built-in function print>

2      <built-in function len>

dtype: object

To be clear, the example above is highly impractical and not something we would ever execute in practice. It is, however, an excellent example of the flexibility of the pandas Series data structure.

需要明确的是,上面的示例非常不切实际,不是我们在实践中可以执行的示例。 但是,它是熊猫系列数据结构灵活性的一个很好的例子。

熊猫数据框 (Pandas DataFrames)

NumPy allows developers to work with both one-dimensional NumPy arrays (sometimes called vectors) and two-dimensional NumPy arrays (sometimes called matrices). We explored pandas Series in the last section, which are similar to one-dimensional NumPy arrays.

NumPy允许开发人员使用一维NumPy数组(有时称为向量)和二维NumPy数组(有时称为矩阵)。 在上一节中,我们探讨了熊猫系列,它们与一维NumPy数组相似。

In this section, we will dive into pandas DataFrames, which are similar to two-dimensional NumPy arrays - but with much more functionality. DataFrames are the most important data structure in the pandas library, so pay close attention throughout this section.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值