熊猫数据集_在熊猫数据框中使用日期时间

熊猫数据集

Datetime is a common data type in data science projects. Often, you’ll work with it and run into problems. I found Pandas is an amazing library that contains extensive capabilities and features for working with date and time.

日期时间是数据科学项目中的常见数据类型。 通常,您将使用它并遇到问题。 我发现Pandas是一个了不起的库,其中包含用于处理日期和时间的广泛功能。

In this article, we will cover the following common datetime problems and should help you get started with data analysis.

在本文中,我们将介绍以下常见的日期时间问题,并应帮助您开始进行数据分析。

  1. Convert strings to datetime

    将字符串转换为日期时间
  2. Assemble a datetime from multiple columns

    从多个列组装日期时间
  3. Get year, month and day

    获取年,月和日
  4. Get the week of year, the day of week, and leap year

    获取一年中的星期几,星期几和leap年
  5. Get the age from the date of birth

    从出生日期获取年龄
  6. Improve performance by setting date column as the index

    通过将日期列设置为索引来提高性能
  7. Select data with a specific year and perform aggregation

    选择特定年份的数据并进行汇总
  8. Select data with a specific month and a specific day of the month

    选择具有特定月份和特定月份的数据
  9. Select data between two dates

    选择两个日期之间的数据
  10. Handle missing values

    处理缺失值

Please check out my Github repo for the source code.

请查看我的Github存储库以获取源代码。

1.将字符串转换为日期时间 (1. Convert strings to datetime)

Pandas has a built-in function called to_datetime() that can be used to convert strings to datetime. Let’s take a look at some examples

Pandas具有一个称为to_datetime()的内置函数,可用于将字符串转换为datetime。 让我们看一些例子

带有默认参数 (With default arguments)

Pandas to_datetime() is able to parse any valid date string to datetime without any additional arguments. For example:

熊猫to_datetime()可以将任何有效的日期字符串解析为datetime,而无需任何其他参数。 例如:

df = pd.DataFrame({'date': ['3/10/2000', '3/11/2000', '3/12/2000'],
'value': [2, 3, 4]})df['date'] = pd.to_datetime(df['date'])df
Image for post

日优先格式 (Day first format)

By default, to_datetime() will parse string with month first (MM/DD, MM DD, or MM-DD) format, and this arrangement is relatively unique in the United State.

默认情况下, to_datetime()将以月份优先( MM / DDMM DDMM-DD )格式解析字符串,并且这种安排在美国是相对独特的。

In most of the rest of the world, the day is written first (DD/MM, DD MM, or DD-MM). If you would like Pandas to consider day first instead of month, you can set the argument dayfirst to True.

在世界上大多数其他地方,日期是第一个写入的日期( DD / MMDD MMDD-MM )。 如果希望Pandas考虑第一天而不是月份,则可以将参数dayfirst设置为True

df = pd.DataFrame({'date': ['3/10/2000', '3/11/2000', '3/12/2000'],
'value': [2, 3, 4]})df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df
Image for post

Alternatively, you pass a custom format to the argument format .

或者,您可以将自定义格式传递给参数format

客户格式 (Custome format)

By default, strings are parsed using the Pandas built-in parser from dateutil.parser.parse. Sometimes, your strings might be in a custom format, for example, YYYY-DD-MM HH:MM:SS. Pandas to_datetime() has an argument called format that allows you to pass a custom format:

默认情况下,使用dateutil.parser.parse的Pandas内置解析器来解析字符串。 有时,您的字符串可能采用自定义格式,例如YYYY-DD-MM HH:MM:SS 。 熊猫to_datetime()有一个名为format的参数,它允许您传递自定义格式:

df = pd.DataFrame({'date': ['2016-6-10 20:30:0', 
'2016-7-1 19:45:30',
'2013-10-12 4:5:1']
,
'value': [2, 3, 4]})df['date'] = pd.to_datetime(df['date'], format="%Y-%d-%m %H:%M:%S")
df
Image for post

使用infer_datetime_format加快解析速度 (Speed up parsing with infer_datetime_format)

Passing infer_datetime_format=True can often speed up a parsing if its not an ISO8601 format exactly but in a regular format. According to [1], in some cases, this can increase the parsing speed by 5–10x.

如果传递的infer_datetime_format=True并非完全是ISO8601格式,而是常规格式,则通常可以加快解析速度。 根据[1],在某些情况下,这可以使解析速度提高5-10倍。

# Make up 3000 rows
df = pd.DataFrame({'date': ['3/11/2000', '3/12/2000', '3/13/2000'] * 1000 })%timeit pd.to_datetime(df['date'], infer_datetime_format=True)
100 loops, best of 3: 10.4 ms per loop%timeit pd.to_datetime(df['date'], infer_datetime_format=False)
1 loop, best of 3: 471 ms per loop

处理解析错误 (Handle parsing error)

You will end up with a TypeError if the date string does not meet the timestamp format.

如果日期字符串不符合时间戳格式,则将导致TypeError结束。

df = pd.DataFrame({'date': ['3/10/2000', 'a/11/2000', '3/12/2000'],
'value': [2, 3, 4]})
df['date'] = pd.to_datetime(df['date'])
Image for post

to_datetime() has an argument called errors that allows you to ignore the error or force an invalid value to NaT.

to_datetime()有一个称为errors的参数,该参数使您可以忽略该错误或将无效值强制给NaT

df['date'] = pd.to_datetime(df['date'], errors='ignore')
df
Image for post
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df
Image for post

In addition, if you would like to parse date columns when reading data from a CSV file, please check out the following article

另外,如果您想在从CSV文件读取数据时解析日期列,请查看以下文章

2.从多个列中组合一个日期时间 (2. Assemble a datetime from multiple columns)

to_datetime() can be used to assemble a datetime from multiple columns as well. The keys (columns label) can be common abbreviations like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or plurals of the same.

to_datetime()也可以用于从多个列中组合一个日期时间。 键(列标签)可以是常见的缩写,例如['year','month','day','minute','second','ms','us','ns'])或相同的复数形式。

df = pd.DataFrame({'year': [2015, 2016],
'month': [2, 3],
'day': [4, 5]}
)df['date'] = pd.to_datetime(df)
df
Image for post

3.获取年,月和日 (3. Get year, month, and day)

dt.year, dt.month and dt.day are the inbuilt attributes to get year, month , and day from Pandas datetime object.

dt.yeardt.monthdt.day是从Pandas datetime对象获取年,月和日的内置属性。

First, let’s create a dummy DateFrame and parse DoB to datetime.

首先,让我们创建一个虚拟的DateFrame并将DoB解析为datetime。

df = pd.DataFrame({'name': ['Tom', 'Andy', 'Lucas'],
'DoB': ['08-05-1997', '04-28-1996', '12-16-1995']})
df['DoB'] = pd.to_datetime(df['DoB'])

And to get year, month, and day

并获得年,月和日

df['year']= df['DoB'].dt.year
df['month']= df['DoB'].dt.month
df['day']= df['DoB'].dt.day
df
Image for post

4.获取一年中的星期几,星期几和leap年 (4. Get the week of year, the day of week and leap year)

Similarly, dt.week, dt.dayofweek, and dt.is_leap_year are the inbuilt attributes to get the week of year, the day of week, and leap year.

同样, dt.weekdt.dayofweekdt.is_leap_year是内置属性,用于获取年份的星期几,星期几和leap年。

df['week_of_year'] = df['DoB'].dt.week
df['day_of_week'] = df['DoB'].dt.dayofweek
df['is_leap_year'] = df['DoB'].dt.is_leap_year
df
Image for post

Note that Pandas dt.dayofweek attribute returns the day of the week and it is assumed the week starts on Monday, which is denoted by 0 and ends on Sunday which is denoted by 6. To replace the number with full name, we can create a mapping and pass it to map() :

请注意,Pandas dt.dayofweek属性返回星期几,并假设星期从星期一开始,由0表示,到星期天由6表示。要用全名替换数字,我们可以创建一个映射并将其传递给map()

dw_mapping={
0: 'Monday',
1: 'Tuesday',
2: 'Wednesday',
3: 'Thursday',
4: 'Friday',
5: 'Saturday',
6: 'Sunday'
}

df['day_of_week_name']=df['DoB'].dt.weekday.map(dw_mapping)
df
Image for post

5.从出生日期算起年龄 (5. Get the age from the date of birth)

The simplest solution to get age is by subtracting year:

获得年龄的最简单方法是减去年份:

today = pd.to_datetime('today')
df['age'] = today.year - df['DoB'].dt.yeardf
Image for post

However, this is not accurate as people might haven't had their birthday this year. A more accurate solution would be to consider the birthday

但是,这并不准确,因为人们今年可能还没有过生日。 一个更准确的解决方案是考虑生日

# Year difference
today = pd.to_datetime('today')
diff_y = today.year - df['DoB'].dt.year
# Haven't had birthday
b_md = df['DoB'].apply(lambda x: (x.month,x.day) )no_birthday = b_md > (today.month,today.day)df['age'] = diff_y - no_birthday
df
Image for post

6.通过将日期列设置为索引来提高性能 (6. Improve performance by setting date column as the index)

A common solution to select data by date is using a boolean maks. For example

按日期选择数据的常见解决方案是使用布尔麦克斯。 例如

condition = 
df.loc[condition]

This solution normally requires start_date, end_date and date column to be datetime format. And in fact, this solution is slow when you are doing a lot of selections by date in a large dataset.

此解决方案通常要求start_dateend_datedate end_date datetime格式。 实际上,当您在大型数据集中按日期进行大量选择时,此解决方案很慢。

If you are going to do a lot of selections by date, it would be faster to set date column as the index first so you take advantage of the Pandas built-in optimization. Then, you can select data by date using df.loc[start_date:end_date] . Let take a look at an example dataset city_sales.csv, which has 1,795,144 rows data

如果您打算按日期进行很多选择,那么将date列首先设置为索引会更快,因此您可以利用Pandas内置的优化功能。 然后,您可以使用df.loc[start_date:end_date]按日期选择数据。 让我们看一个示例数据集city_sales.csv,它具有1,795,144行数据

df = pd.read_csv('data/city_sales.csv',parse_dates=['date'])
df.info()RangeIndex: 1795144 entries, 0 to 1795143
Data columns (total 3 columns):
# Column Dtype
--- ------ -----
0 date datetime64[ns]
1 num int64
2 city object
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 41.1+ MB

To set the date column as the index

将日期列设置为索引

df = df.set_index(['date'])
df
Image for post

7.选择具有特定年份的数据并进行汇总 (7. Select data with a specific year and perform aggregation)

Let’s say we would like to select all data in the year 2018

假设我们要选择2018年的所有数据

df.loc['2018']
Image for post

And to perform aggregation on the selection for example:

并例如对选择执行聚合:

Get the total num in 2018

获取2018年的总数

df.loc['2018','num'].sum()1231190

Get the total num for each city in 2018

获取2018年每个城市的总数

df['2018'].groupby('city').sum()
Image for post

8.选择具有特定月份和特定月份的数据 (8. Select data with a specific month and a specific day of the month)

To select data with a specific month, for example, May 2018

选择特定月份的数据,例如,2018年5月

df.loc['2018-5']
Image for post

Similarly, to select data with a specific day of the month, for example, 1st May 2018

同样,要选择一个月中特定日期的数据,例如,2018年5月1日

df.loc['2018-5-1']
Image for post

9在两个日期之间选择数据 (9 Select data between two dates)

To select data between two dates, you can usedf.loc[start_date:end_date] For example:

要选择两个日期之间的数据,可以使用df.loc[start_date:end_date]例如:

Select data between 2016 and 2018

选择2016年至2018年之间的数据

df.loc['2016' : '2018']

Select data between 10 and 11 o'clock on the 2nd May 2018

在2018年5月2日10点至11点之间选择数据

df.loc['2018-5-2 10' : '2018-5-2 11' ]

Select data between 10:30 and 10:45 on the 2nd May 2018

在2018年5月2日10:30到10:45之间选择数据

df.loc['2018-5-2 10:30' : '2018-5-2 10:45' ]

And to select data between time, we should use between_time(), for example, 10:30 and 10:45

要选择时间之间的数据,我们应该使用between_time() ,例如10:30和10:45

df.between_time('10:30','10:45')

10处理缺失值 (10 Handle missing values)

We often need to compute window statistics such as a rolling mean or a rolling sum.

我们经常需要计算窗口统计信息,例如滚动平均值滚动总和。

Let’s compute the rolling sum over a 3 window period and then have a look at the top 5 rows.

让我们计算3个窗口周期内的滚动总和,然后看一下前5行。

df['rolling_sum'] = df.rolling(3).sum()
df.head()
Image for post

We can see that it only starts having valid values when there are 3 periods over which to look back. One solution to handle this is by backfilling of data.

我们可以看到,只有在有3个要回顾的时间段时,它才开始具有有效值。 解决此问题的一种方法是回填数据

df['rolling_sum_backfilled'] = df['rolling_sum'].fillna(method='backfill')
df.head()
Image for post

For more details about backfilling, please check out the following article

有关回填的更多详细信息,请查看以下文章

而已 (That’s it)

Thanks for reading.

谢谢阅读。

Please checkout the notebook on my Github for the source code.

在我的Github上查看笔记本中的源代码。

Stay tuned if you are interested in the practical aspect of machine learning.

如果您对机器学习的实用方面感兴趣,请继续关注。

Here are some picked articles for you:

以下是为您挑选的一些文章:

翻译自: https://towardsdatascience.com/working-with-datetime-in-pandas-dataframe-663f7af6c587

熊猫数据集

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值