dask 并行读取csv_Dask –使用Python处理大型CSV文件的更好方法

dask 并行读取csv

Dask dataframeIn a recent post titled

Working with Large CSV files in Python, I shared an approach I use when I have very large CSV files (and other file types) that are too large to load into memory. While the approach I previously highlighted works well, it can be tedious to first load data into sqllite (or any other database) and then access that database to analyze data.   I just found a better approach using Dask.

在最近的一篇标题为“使用Python处理大型CSV文件”的文章中 ,我分享了一种方法,当我有非常大的CSV文件(和其他文件类型)太大而无法加载到内存中时,可以使用该方法。 尽管我之前强调的方法效果很好,但首先将数据加载到sqllite(或任何其他数据库)中,然后访问该数据库以分析数据可能很繁琐。 我只是发现使用Dask更好的方法。

While looking around the web to learn about some parallel processing capabilities, I ran across a python module named Dask, which describes itself as:

在网上浏览以了解一些并行处理功能时,我遇到了一个名为Daskpython模块 ,该模块将自己描述为:

…is a flexible parallel computing library for analytic computing.

…是用于分析计算的灵活并行计算库。

When I saw that, I was intrigued. There’s a lot that can be done with that statement  and I’ve got plans to introduce Dask into my various tool sets for data analytics.

当我看到那件事时,我很感兴趣。 该语句可以完成很多工作,而且我已经计划将Dask引入我的各种数据分析工具集中。

While reading the docs, I ran across the ‘dataframe‘ concept and immediately new I’d found a new tool for working with large CSV files.  With Dask’s dataframe concept,  you can do out-of-core analysis (e.g., analyze data in the CSV without loading the entire CSV file into memory). Other than out-of-core manipulation, dask’s dataframe uses the pandas API, which makes things extremely easy for those of us who use and love pandas.

在阅读文档时,我遇到了“ 数据框 ”概念,立即发现一个用于处理大型CSV文件的新工具。 使用Dask的数据框概念,您可以进行核心外分析(例如,无需将整个CSV文件加载到内存中即可分析CSV中的数据)。 除了进行核心操作外,dask的数据帧还使用了熊猫API,这对于那些使用和喜爱熊猫的人来说非常容易。

With Dask and its dataframe construct, you set up the dataframe must like you would in pandas but rather than loading the data into pandas, this appraoch keeps the dataframe as a sort of ‘pointer’ to the data file and doesn’t load anything until you specifically tell it to do so.

使用Dask及其数据框构造,您必须像在大熊猫中一样设置数据框,而不是将数据加载到大熊猫中,此方法使数据框成为数据文件的一种“指针”,并且在加载之前不会加载任何内容您专门告诉它这样做。

One note (that I always have to share):  If you are planning on working with your data set over time, its probably best to get the data into a database of some type.

需要注意的一点(我总是要分享):如果您打算随着时间的推移使用数据集,最好将数据放入某种类型的数据库中。

使用Dask和数据框的示例 (An example using Dask and the Dataframe)

First, let’s get everything installed. The documentation claims that you just need to install dask, but I had to install ‘toolz’ and ‘cloudpickle’ to get dask’s dataframe to import.  To install dask and its requirements, open a terminal and type (you need pip for this):

首先,让我们安装一切。 该文档声称您只需要安装dask,但是我必须安装'toolz'和'cloudpickle'才能导入dask的数据框。 要安装dask及其要求,请打开一个终端并输入(为此需要pip):

pip install dask toolz cloudpickle

Now, let’s write some code to load csv data and and start analyzing it. For this example, I’m using the 311 Service Requests dataset from NYC’s Open Data portal.   You can download the dataset here: 311 Service Requests – 7Gb+ CSV

现在,让我们编写一些代码来加载csv数据并开始进行分析。 在此示例中,我使用了来自纽约市开放数据门户网站的311服务请求数据集。 您可以在此处下载数据集: 311服务请求– 7Gb + CSV

Set up your dataframe so you can analyze the 311_Service_Requests.csv file. This file is assumed to be stored in the directory that you are working in.

设置数据框,以便您可以分析311_Service_Requests.csv文件。 假定此文件存储在您正在使用的目录中。

 
  1. import dask.dataframe as dd

  2. filename = '311_Service_Requests.csv'

  3. df = dd.read_csv(filename, dtype='str')

Unlike pandas, the data isn’t read into memory…we’ve just set up the dataframe to be ready to do some compute functions on the data in the csv file using familiar functions from pandas. Note: I used “dtype=’str’” in the read_csv to get around some strange formatting issues in this particular file.

与熊猫不同,数据不会读入内存...我们只是将数据帧设置为可以使用熊猫熟悉的函数对csv文件中的数据执行一些计算功能。 注意:我在read_csv中使用“ dtype ='str'”来解决此特定文件中的一些奇怪的格式设置问题。

Let’s take a look at the first few rows of the file using pandas’ head() call.  When you run this, the first X rows (however many rows you are looking at with head(X)) and then displays those rows.

让我们使用pandas的head()调用来查看文件的前几行。 运行此命令时,将显示前X行(无论您用head(X)查看多少行),然后显示这些行。

df.head(2)

Note: a small subset of the columns are shown below for simplicity

注意:为简单起见,下面显示了列的一小部分

Unique Key唯一键Created Date创建日期Closed Date截止日期Agency机构
255134812551348105/09/2013 12:00:00 AM2013/05/09上午12:00:0005/14/2013 12:00:00 AM2013/05/14上午12:00:00HPDHPD
255134822551348205/09/2013 12:00:00 AM2013/05/09上午12:00:0005/13/2013 12:00:00 AM2013/05/13上午12:00:00HPDHPD
255134832551348305/09/2013 12:00:00 AM2013/05/09上午12:00:0005/22/2013 12:00:00 AM2013年5月22日12:00:00HPDHPD
255134842551348405/09/2013 12:00:00 AM2013/05/09上午12:00:0005/12/2013 12:00:00 AM2013/05/12上午12:00:00HPDHPD
255134852551348505/09/2013 12:00:00 AM2013/05/09上午12:00:0005/11/2013 12:00:00 AM2013/05/11上午12:00:00HPDHPD

We see that there’s some spaces in the column names. Let’s remove those spaces to make things easier to work with.

我们看到列名称中有一些空格。 让我们删除那些空间以使事情更容易使用。

df = df.rename(columns={c: c.replace(' ', '') for c in df.columns})

The cool thing about dask is that you can do things like renaming columns without loading all the data into memory.

关于dask的很酷的事情是,您可以执行诸如重命名列之类的操作,而无需将所有数据加载到内存中。

There’s a column in this data called ‘Descriptor’ that has the problem types, and “radiator” is one of those problem types. Let’s take a look at how many service requests were because of some problem with a radiator.  To do this, you can filter the dataframe using standard pandas filtering (see below) to create a new dataframe.

此数据中有一个列称为“描述符”,其中包含问题类型,而“辐射器”就是这些问题类型之一。 让我们看一下由于散热器出现问题而产生的服务请求数量。 为此,您可以使用标准的熊猫过滤(见下文)过滤数据框以创建一个新的数据框。

 
  1. # create a new dataframe with only 'RADIATOR' service calls

  2. radiator_df=df[df.Descriptor=='RADIATOR']

Let’s see how many rows we have using the ‘count’ command

让我们看看使用'count'命令有多少行

radiator_df.Descriptor.count()

You’ll notice that when you run the above command, you don’t actually get count returned. You get a descriptor back similar  like “dd.Scalar<series-…, dtype=int64>”

您会注意到,当您运行上述命令时,实际上并没有获得计数返回。 您会得到类似“ dd.Scalar <series-…,dtype = int64>”的描述符

To actually compute the count, you have to call “compute” to get dask to run through the dataframe and count the number of records.

要实际计算计数,您必须调用“计算”​​以使dask在数据帧中运行并计算记录数。

radiator_df.compute()

When you run this command, you should get something like the following

运行此命令时,应获得类似以下内容的信息

[52077 rows x 52 columns]

The above are just some samples for using dask’s dataframe construct.  Remember, we built a new dataframe using pandas’ filters without loading the entire original data set into memory.  They may not seem like much, but when working with a 7Gb+ file, you can save a great deal of time and effort using dask when compared to using the approach I previously mentioned.

以上只是使用dask的dataframe构造的一些示例。 记住,我们使用熊猫的过滤器构建了一个新的数据框,而没有将整个原始数据集加载到内存中。 它们看起来似乎并不多,但是当使用7Gb +文件时,与使用前面提到方法相比,使用dask可以节省大量时间和精力。

Dask seems to have a ton of other great features that I’ll be diving into at some point in the near future, but for now, the dataframe construct has been an awesome find.

在不久的将来,Dask似乎还有很多其他很棒的功能,我将在其中进行深入探讨,但是就目前而言,dataframe构造是一个了不起的发现。

翻译自: Dask – A better way to work with large CSV files in Python – PyBloggers

 

  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值