介绍Jupyter和Pandas

Image 1

系列总览 (Series Overview)

This article is the first in a series that helps working developers get up to speed on data science tools and techniques. We’ll start with a brief introduction to the series, and explain everything we’re going to cover.

本文是帮助工作中的开发人员快速掌握数据科学工具和技术的系列文章的第一篇。 我们将从对该系列的简要介绍开始,并解释我们将要介绍的所有内容。

Developers and data scientists working on data analysis and machine learning (ML) projects spend the majority of their time finding, cleaning, and organizing datasets. In this introductory series, we will walk through some of the most common data cleaning scenarios, including:

从事数据分析和机器学习(ML)项目的开发人员和数据科学家将大部分时间用于查找,清理和组织数据集。 在这个介绍性系列文章中,我们将介绍一些最常见的数据清理方案,包括:

  • Visualizing "messy" data

    可视化“混乱”数据
  • Reshaping datasets

    重塑数据集
  • Replacing missing values

    替换缺失值
  • Removing and fixing incomplete rows

    删除并修复不完整的行
  • Normalizing data types

    规范化数据类型
  • Combining datasets

    合并数据集

We'll do this by using Python, Pandas, and Seaborn in a Jupyter notebook to clean up a sample retail store's messy customer database. This seven-part series will take the initial round of messy data, clean it, and develop a set of visualizations that highlight our work. Here’s what the series will cover:

我们将通过在Jupyter笔记本中使用Python,Pandas和Seaborn来清理样本零售店的凌乱客户数据库,以实现此目的。 这个由七个部分组成的系列文章将对第一轮凌乱的数据进行处理,清理,并开发出一组可视化的图像来突出我们的工作。 以下是本系列的内容:

Before we start cleaning our dataset, let's take a quick look at two of the tools we’ll use: Pandas and Jupyter Notebooks.

在开始清理数据集之前,让我们快速看一下将要使用的两个工具:Pandas和Jupyter Notebooks。

什么是熊猫? (What is Pandas?)

Pandas is a flexible, high-performance, open-source Python library built specifically to provide data structures and analysis tools for data scientists.

Pandas是一个灵活的,高性能的开源Python库,专门为数据科学家提供数据结构和分析工具而构建。

As a developer, you’ll find that Pandas is like a programmatic, GUI-free Excel. When you import data into a Pandas, you get a DataFrame object that represents your data as a series of columns and rows — much like you’d see in an Excel worksheet.

作为开发人员,您会发现Pandas就像一个无GUI的程序化Excel。 将数据导入Pandas时,将获得一个DataFrame对象,该对象将数据表示为一系列的列和行-就像在Excel工作表中看到的那样。

This makes it very easy to analyze and clean up data sets. Performing operations like removing rows that don’t meet certain criteria, automatically removing columns that have too many missing values, or adding new columns calculated from existing columns can usually be done with a single function call.

这使得分析和清理数据集非常容易。 通常可以通过单个函数调用来执行诸如删除不符合特定条件的行,自动删除缺少太多值的列或添加根据现有列计算出的新列之类的操作。

Working with tables of data this way — by cleaning and transforming the data using clean, easy-to-understand Python — is usually much quicker and more portable for developers than point-and-click your way through complex, built-in Excel functions or writing custom VBA code.

通过使用干净,易于理解的Python清理和转换数据,以这种方式处理数据表通常比通过复杂的内置Excel函数或通过鼠标点击的方式更快,更易于开发。编写自定义VBA代码。

什么是Jupyter? (What is Jupyter?)

Jupyter is a web application that acts like a container for data science projects. It allows you to put data, code, visualizations, documentation, and more into a single notebook.

Jupyter是一个Web应用程序,其作用类似于数据科学项目的容器。 它使您可以将数据,代码,可视化,文档以及更多内容放入一个笔记本中。

I’ll be honest, if you’re an experienced software developer who is accustomed to an IDE like Eclipse or a text editor like Visual Studio Code, Jupyter is going to seem weird.

老实说,如果您是一位经验丰富的软件开发人员,并且习惯于Eclipse之类的IDE或Visual Studio Code之类的文本编辑器,那么Jupyter看起来会很奇怪。

Jupyter is, essentially, a modern reincarnation of Donald Knuth’s Literate Programming. Literate programming aims to break down the barriers between code and natural language. In a typical literate programming file, programming code is interspersed with prose in a natural English-like language that describes what the code does.

本质上,Jupyter是Donald Knuth的Literate Programming的现代转世。 精通编程旨在打破代码与自然语言之间的障碍。 在典型的识字编程文件中,编程代码散布在散文中,而散文使用的是类似于英语的自然语言,描述了代码的作用。

This approach might sound repugnant to modern developers. After all, shouldn’t your code be so clear and self-explanatory that it doesn’t need comments?

对于现代开发人员而言,这种方法可能令人讨厌。 毕竟,您的代码难道不应该如此清晰明了, 不需要说明吗?

That may be true for ordinary, mechanical code where it’s clear what is going on. But the situation is different when you’re writing code for data science and machine learning projects. In these scenarios, you’re often writing code that’s going to be consumed by a wider audience including data scientists, business analysts, and even managers.

对于清楚地说明发生了什么的普通机械代码,这可能是正确的。 但是,当您为数据科学和机器学习项目编写代码时,情况就不同了。 在这种情况下,您经常编写的代码将被更广泛的受众(包括数据科学家,业务分析师甚至经理)使用。

In those cases, your code alone isn’t enough. Even if the reader understands the code, you must add prose to give context to your code — to help readers understand why you wrote the code you did, and understand how your code is transforming the data it imports.

在这种情况下,仅您的代码是不够的。 即使读者理解代码,您也必须添加散文来为代码提供上下文-帮助读者理解编写代码的原因 ,并理解代码如何转换其导入的数据。

Jupyter Notebooks take literate programming a step further. Not only is it easy to write a document that alternates between prose and code, the code is also live and executable. You can run the code and observe its output from inside the document. Even better, colleagues who have a copy of your notebook can edit your code, re-run it, and observe the new output — all without leaving the notebook. A notebook that contains code, prose, and output looks something like this:

Jupyter Notebooks使识字编程更进一步。 不仅容易编写在散文和代码之间交替的文档,而且代码是实时且可执行的。 您可以运行代码并从文档内部观察其输出。 甚至更好的是,拥有笔记本副本的同事可以编辑代码,重新运行代码并观察新的输出-所有这些都无需离开笔记本。 包含代码,散文和输出的笔记本看起来像这样:

Image 2

Don’t just take my word for it, though. You can find a live, interactive, editable online Jupyter Notebook to experiment with here.

但是,不要只听我的话。 您可以在此处找到一个实时,交互式,可编辑的在线Jupyter Notebook,以进行试验。

If this weren’t enough, Jupyter makes it easy to embed charts and other visualizations in the document. So if, for example, you import data from a database, transform it, and then want to easily share the results, you can feed your data into several different visualization libraries and the charts they generate will appear right in the notebook.

如果还不够,Jupyter可以轻松地在文档中嵌入图表和其他可视化效果。 因此,例如,如果您从数据库导入数据,对其进行转换,然后想要轻松共享结果,则可以将数据输入几个不同的可视化库中,它们生成的图表将直接出现在笔记本中。

You can even embed Markdown, videos, and mathematical equations using LaTeX or MathML.

您甚至可以使用LaTeX或MathML嵌入Markdown,视频和数学方程式。

Last, but not least: Jupyter is language agnostic. Although Python is the most common use case, you can embed and run many programming languages in your notebooks. These include Julia, R, and even Java, C#, and F#. If you’re a .NET developer, Scott Hanselman has written a great introduction to Jupyter Notebooks for you.

最后但并非最不重要:Jupyter与语言无关。 尽管Python是最常见的用例,但是您可以在笔记本中嵌入并运行许多编程语言。 这些包括Julia,R,甚至Java,C#和F#。 如果您是.NET开发人员,Scott Hanselman会为您撰写Jupyter Notebooks精彩介绍

安装Jupyter和Pandas (Installing Jupyter and Pandas)

Now that you’re (hopefully) excited about Jupyter and Pandas, I’m going to show you the easiest way to get started.

既然您(很希望)对Jupyter和Pandas感到兴奋,我将向您展示最简单的入门方法。

The best way to get Jupyter, Pandas, and other libraries we'll need for future data analysis tasks is to install Anaconda. It’s a Python distribution for data science that comes preloaded with the most popular libraries. This is one of the easiest ways to get up and running with data science using Python.

获取Jupyter,Pandas和我们将来的数据分析任务所需的其他库的最佳方法是安装Anaconda 。 这是用于数据科学的Python发行版,预装了最受欢迎的库。 这是使用Python建立和运行数据科学的最简单方法之一。

I know Anaconda is a big install at over 400MB. While you absolutely can install Python, Jupyter, and Pandas separately, I’m asking you to trust that this is the easiest way to install everything with minimal pain. In addition, if you decide to continue your data science journey after working through this series, you’ll find that Anaconda has already set up most of the tools you’ll need.

我知道Anaconda的安装量超过400MB。 虽然您绝对可以分别安装Python,Jupyter和Pandas,但我还是请您相信这是安装所有组件的最简单方法,而且痛苦最小。 另外,如果您决定在完成本系列文章后继续进行数据科学之旅,您会发现Anaconda已经设置了您需要的大多数工具。

You can find an Anaconda package for your operating system on the Anaconda download page. Follow the download and install instructions. Once the install is complete, you’ll find that the installer set up an application called Anaconda Navigator. In Windows, it’ll be in the start menu. On MacOS, it’ll be in your Applications folder. On Linux, you can run it by opening a terminal and running the following command: anaconda-navigator.

您可以在Anaconda下载页面上找到适用于您的操作系统的Anaconda软件包。 请遵循下载和安装说明。 安装完成后,您会发现安装程序设置了一个名为Anaconda Navigator的应用程序。 在Windows中,它将位于开始菜单中。 在MacOS上,它将位于“应用程序”文件夹中。 在Linux上,您可以通过打开终端并运行以下命令来运行它: anaconda-navigator

Next, we’ll start up our own Jupyter Notebook.

接下来,我们将启动自己的Jupyter Notebook。

Using Anaconda Navigator, open Jupyter Notebook and create a new notebook. A notebook can be thought of as a simplified version of what other IDEs call a project — as we discovered above, it’s a collection of code, prose, and multimedia.

使用Anaconda Navigator,打开Jupyter Notebook并创建一个新的笔记本。 笔记本可以看作是其他IDE称为项目的简化版本-正如我们在上面发现的那样,它是代码,散文和多媒体的集合。

Image 3

When you first open a Jupyter Notebook, you’ll see a single line consisting of In [ ]. This is a code cell. Each cell can contain either code run by the notebook’s kernel or information to be displayed.

首次打开Jupyter笔记本时,您会看到由In [ ]组成的一行。 这是一个代码单元。 每个单元都可以包含笔记本内核运行的代码或要显示的信息。

Each notebook has an associated kernel, which is the runtime used to compile the code within the notebook. The default is Python, and in most cases it’s the language used, but you can use a number of other languages as well.

每个笔记本都有一个关联的内核,该内核是用于在笔记本中编译代码的运行时。 默认值为Python,在大多数情况下,它是使用的语言,但是您也可以使用许多其他语言。

Image 4

You can also change a code cell to a documentation line by switching the code dropdown to markdown.

您还可以通过将代码下拉列表切换为标记下拉列表,将代码单元更改为文档行。

设置和导入库 (Setting Up and Importing Libraries)

It’s important to note that items set and manipulated in one cell are available in the preceding cells. This allows you to break up the code in your notebooks and keep projects more organized. Because of this behavior, it’s common to use the first cell in a notebook to hold all the generic setup and library imports that will be used across the notebook.

重要的是要注意,在一个单元格中设置和操作的项目在前面的单元格中可用。 这使您可以分解笔记本中的代码并使项目更井井有条。 由于这种行为,通常在笔记本中使用第一个单元来保存将在整个笔记本中使用的所有通用设置和库导入。

Since we will be loading and manipulating data with some specific Python libraries that are already included in the Anaconda suite of products, let's set them up by importing them into our Jupyter notebook by adding the following to the first code cell. You can add a code cell to your document by clicking the

Image 5 button.

由于我们将使用Anaconda产品套件中已包含的某些特定Python库加载和处理数据,因此通过将以下内容添加到Jupyter笔记本中,将它们添加到第一个代码单元中来进行设置。 您可以通过单击

Image 6

Once you’ve added the code cell, put the following code in it:

添加代码单元后,将以下代码放入其中:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")

If you hover your mouse over the square brackets beside the code cell, you’ll see a play button that lets your run the code in the cell. This won’t do much at the moment since we’re just importing some libraries, but don’t worry, we’ll have plenty of code to run soon.

如果将鼠标悬停在代码单元旁的方括号上,则会看到一个播放按钮,可让您在该单元格中运行代码。 由于我们仅导入了一些库,因此此刻目前执行不了什么,但是请不要担心,我们将有大量代码可以很快运行。

As you can see, we’re importing four libraries:

如您所见,我们正在导入四个库:

  • Pandas, the data analysis library

    熊猫,数据分析库
  • NumPy, a dependency of Pandas (we won't be using this directly)

    NumPy,Pandas的依赖项(我们不会直接使用它)
  • Matplotlib, a data visualization library

    Matplotlib,数据可视化库
  • Seaborn, which adds a number of visual improvements to matplotlib

    Seaborn,它为matplotlib增加了许多视觉上的改进

Additionally, the last line sets a default style for Seaborn. Let’s save our Notebook by going File, Save As and entering a file path. It’s worth noting that Jupyter will base its file system from your profile directory. On Windows, this is C:\Users\<username>.

此外,最后一行为Seaborn设置了默认样式。 让我们通过进入文件,另存为并输入文件路径来保存笔记本。 值得注意的是,Jupyter将从您的配置文件目录建立其文件系统。 在Windows上,这是C:\ Users \ <用户名>

Image 7

评论 (Review)

We’ve learned about what Pandas and Jupyter are, and why we might want to use them.

我们已经了解了Pandas和Jupyter是什么,以及为什么要使用它们。

Then, we learned how to set up our own data science ready development environment using Anaconda.

然后,我们学习了如何使用Anaconda来建立我们自己的数据科学就绪开发环境。

We finished up by taking a quick look at using Jupyter Notebooks to set up our Python-based data analysis project and imported a few Python libraries, including Pandas for data structures and Seaborn for data visualization.

最后,我们快速使用Jupyter Notebooks建立了基于Python的数据分析项目,并导入了一些Python库,包括用于数据结构的Pandas和用于数据可视化的Seaborn。

Next up, we’ll load our external data source into a data structure provided by Pandas and start analyzing and manipulating our base data set.

接下来,我们将外部数据源加载到Pandas提供的数据结构中,并开始分析和操作我们的基本数据集。

Jupyter image source: https://www.dataquest.io/blog/jupyter-notebook-tutorial/

Jupyter图片来源:https://www.dataquest.io/blog/jupyter-notebook-tutorial/

翻译自: https://www.codeproject.com/Articles/5269215/Introducing-Jupyter-and-Pandas

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值