熊猫数据集
Datetime is a common data type in data science projects. Often, you’ll work with it and run into problems. I found Pandas is an amazing library that contains extensive capabilities and features for working with date and time.
日期时间是数据科学项目中的常见数据类型。 通常,您将使用它并遇到问题。 我发现Pandas是一个了不起的库,其中包含用于处理日期和时间的广泛功能。
In this article, we will cover the following common datetime problems and should help you get started with data analysis.
在本文中,我们将介绍以下常见的日期时间问题,并应帮助您开始进行数据分析。
- Convert strings to datetime 将字符串转换为日期时间
- Assemble a datetime from multiple columns 从多个列组装日期时间
- Get year, month and day 获取年,月和日
- Get the week of year, the day of week, and leap year 获取一年中的星期几,星期几和leap年
- Get the age from the date of birth 从出生日期获取年龄
- Improve performance by setting date column as the index 通过将日期列设置为索引来提高性能
- Select data with a specific year and perform aggregation 选择特定年份的数据并进行汇总
- Select data with a specific month and a specific day of the month 选择具有特定月份和特定月份的数据
- Select data between two dates 选择两个日期之间的数据
- Handle missing values 处理缺失值
Please check out my Github repo for the source code.
请查看我的Github存储库以获取源代码。
1.将字符串转换为日期时间 (1. Convert strings to datetime)
Pandas has a built-in function called to_datetime()
that can be used to convert strings to datetime. Let’s take a look at some examples
Pandas具有一个称为to_datetime()
的内置函数,可用于将字符串转换为datetime。 让我们看一些例子
带有默认参数 (With default arguments)
Pandas to_datetime()
is able to parse any valid date string to datetime without any additional arguments. For example:
熊猫to_datetime()
可以将任何有效的日期字符串解析为datetime,而无需任何其他参数。 例如:
df = pd.DataFrame({'date': ['3/10/2000', '3/11/2000', '3/12/2000'],
'value': [2, 3, 4]})df['date'] = pd.to_datetime(df['date'])df
日优先格式 (Day first format)
By default, to_datetime()
will parse string with month first (MM/DD, MM DD, or MM-DD) format, and this arrangement is relatively unique in the United State.
默认情况下, to_datetime()
将以月份优先( MM / DD , MM DD或MM-DD )格式解析字符串,并且这种安排在美国是相对独特的。
In most of the rest of the world, the day is written first (DD/MM, DD MM, or DD-MM). If you would like Pandas to consider day first instead of month, you can set the argument dayfirst
to True
.
在世界上大多数其他地方,日期是第一个写入的日期( DD / MM , DD MM或DD-MM )。 如果希望Pandas考虑第一天而不是月份,则可以将参数dayfirst
设置为True
。
df = pd.DataFrame({'date': ['3/10/2000', '3/11/2000', '3/12/2000'],
'value': [2, 3, 4]})df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df
Alternatively, you pass a custom format to the argument format
.
或者,您可以将自定义格式传递给参数format
。
客户格式 (Custome format)
By default, strings are parsed using the Pandas built-in parser from dateutil.parser.parse
. Sometimes, your strings might be in a custom format, for example, YYYY-DD-MM HH:MM:SS. Pandas to_datetime()
has an argument called format
that allows you to pass a custom format:
默认情况下,使用dateutil.parser.parse
的Pandas内置解析器来解析字符串。 有时,您的字符串可能采用自定义格式,例如YYYY-DD-MM HH:MM:SS 。 熊猫to_datetime()
有一个名为format
的参数,它允许您传递自定义格式:
df = pd.DataFrame({'date': ['2016-6-10 20:30:0',
'2016-7-1 19:45:30',
'2013-10-12 4:5:1'],
'value': [2, 3, 4]})df['date'] = pd.to_datetime(df['date'], format="%Y-%d-%m %H:%M:%S")
df
使用infer_datetime_format
加快解析速度 (Speed up parsing with infer_datetime_format
)
Passing infer_datetime_format=True
can often speed up a parsing if its not an ISO8601 format exactly but in a regular format. According to [1], in some cases, this can increase the parsing speed by 5–10x.
如果传递的infer_datetime_format=True
并非完全是ISO8601格式,而是常规格式,则通常可以加快解析速度。 根据[1],在某些情况下,这可以使解析速度提高5-10倍。
# Make up 3000 rows
df = pd.DataFrame({'date': ['3/11/2000', '3/12/2000', '3/13/2000'] * 1000 })%timeit pd.to_datetime(df['date'], infer_datetime_format=True)
100 loops, best of 3: 10.4 ms per loop%timeit pd.to_datetime(df['date'], infer_datetime_format=False)
1 loop, best of 3: 471 ms per loop
处理解析错误 (Handle parsing error)
You will end up with a TypeError if the date string does not meet the timestamp format.
如果日期字符串不符合时间戳格式,则将导致TypeError结束。
df = pd.DataFrame({'date': ['3/10/2000', 'a/11/2000', '3/12/2000'],
'value': [2, 3, 4]})
df['date'] = pd.to_datetime(df['date'])
to_datetime()
has an argument called errors
that allows you to ignore the error or force an invalid value to NaT
.
to_datetime()
有一个称为errors
的参数,该参数使您可以忽略该错误或将无效值强制给NaT
。
df['date'] = pd.to_datetime(df['date'], errors='ignore')
df
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df
In addition, if you would like to parse date columns when reading data from a CSV file, please check out the following article
另外,如果您想在从CSV文件读取数据时解析日期列,请查看以下文章
2.从多个列中组合一个日期时间 (2. Assemble a datetime from multiple columns)
to_datetime()
can be used to assemble a datetime from multiple columns as well. The keys (columns label) can be common abbreviations like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or plurals of the same.
to_datetime()
也可以用于从多个列中组合一个日期时间。 键(列标签)可以是常见的缩写,例如['year','month','day','minute','second','ms','us','ns'])或相同的复数形式。
df = pd.DataFrame({'year': [2015, 2016],
'month': [2, 3],
'day': [4, 5]})df['date'] = pd.to_datetime(df)
df
3.获取年,月和日 (3. Get year, month, and day)
dt.year
, dt.month
and dt.day
are the inbuilt attributes to get year, month , and day from Pandas datetime object.
dt.year
, dt.month
和dt.day
是从Pandas datetime对象获取年,月和日的内置属性。
First, let’s create a dummy DateFrame and parse DoB to datetime.
首先,让我们创建一个虚拟的DateFrame并将DoB解析为datetime。
df = pd.DataFrame({'name': ['Tom', 'Andy', 'Lucas'],
'DoB': ['08-05-1997', '04-28-1996', '12-16-1995']})
df['DoB'] = pd.to_datetime(df['DoB'])
And to get year, month, and day
并获得年,月和日
df['year']= df['DoB'].dt.year
df['month']= df['DoB'].dt.month
df['day']= df['DoB'].dt.day
df
4.获取一年中的星期几,星期几和leap年 (4. Get the week of year, the day of week and leap year)
Similarly, dt.week
, dt.dayofweek
, and dt.is_leap_year
are the inbuilt attributes to get the week of year, the day of week, and leap year.
同样, dt.week
, dt.dayofweek
和dt.is_leap_year
是内置属性,用于获取年份的星期几,星期几和leap年。
df['week_of_year'] = df['DoB'].dt.week
df['day_of_week'] = df['DoB'].dt.dayofweek
df['is_leap_year'] = df['DoB'].dt.is_leap_year
df
Note that Pandas dt.dayofweek
attribute returns the day of the week and it is assumed the week starts on Monday, which is denoted by 0 and ends on Sunday which is denoted by 6. To replace the number with full name, we can create a mapping and pass it to map()
:
请注意,Pandas dt.dayofweek
属性返回星期几,并假设星期从星期一开始,由0表示,到星期天由6表示。要用全名替换数字,我们可以创建一个映射并将其传递给map()
:
dw_mapping={
0: 'Monday',
1: 'Tuesday',
2: 'Wednesday',
3: 'Thursday',
4: 'Friday',
5: 'Saturday',
6: 'Sunday'
}
df['day_of_week_name']=df['DoB'].dt.weekday.map(dw_mapping)
df
5.从出生日期算起年龄 (5. Get the age from the date of birth)
The simplest solution to get age is by subtracting year:
获得年龄的最简单方法是减去年份:
today = pd.to_datetime('today')
df['age'] = today.year - df['DoB'].dt.yeardf
However, this is not accurate as people might haven't had their birthday this year. A more accurate solution would be to consider the birthday
但是,这并不准确,因为人们今年可能还没有过生日。 一个更准确的解决方案是考虑生日
# Year difference
today = pd.to_datetime('today')
diff_y = today.year - df['DoB'].dt.year
# Haven't had birthday
b_md = df['DoB'].apply(lambda x: (x.month,x.day) )no_birthday = b_md > (today.month,today.day)df['age'] = diff_y - no_birthday
df
6.通过将日期列设置为索引来提高性能 (6. Improve performance by setting date column as the index)
A common solution to select data by date is using a boolean maks. For example
按日期选择数据的常见解决方案是使用布尔麦克斯。 例如
condition =
df.loc[condition]
This solution normally requires start_date
, end_date
and date column to be datetime format. And in fact, this solution is slow when you are doing a lot of selections by date in a large dataset.
此解决方案通常要求start_date
, end_date
和date end_date
datetime格式。 实际上,当您在大型数据集中按日期进行大量选择时,此解决方案很慢。
If you are going to do a lot of selections by date, it would be faster to set date column as the index first so you take advantage of the Pandas built-in optimization. Then, you can select data by date using df.loc[start_date:end_date]
. Let take a look at an example dataset city_sales.csv, which has 1,795,144 rows data
如果您打算按日期进行很多选择,那么将date列首先设置为索引会更快,因此您可以利用Pandas内置的优化功能。 然后,您可以使用df.loc[start_date:end_date]
按日期选择数据。 让我们看一个示例数据集city_sales.csv,它具有1,795,144行数据
df = pd.read_csv('data/city_sales.csv',parse_dates=['date'])
df.info()RangeIndex: 1795144 entries, 0 to 1795143
Data columns (total 3 columns):
# Column Dtype
--- ------ -----
0 date datetime64[ns]
1 num int64
2 city object
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 41.1+ MB
To set the date column as the index
将日期列设置为索引
df = df.set_index(['date'])
df
7.选择具有特定年份的数据并进行汇总 (7. Select data with a specific year and perform aggregation)
Let’s say we would like to select all data in the year 2018
假设我们要选择2018年的所有数据
df.loc['2018']
And to perform aggregation on the selection for example:
并例如对选择执行聚合:
Get the total num in 2018
获取2018年的总数
df.loc['2018','num'].sum()1231190
Get the total num for each city in 2018
获取2018年每个城市的总数
df['2018'].groupby('city').sum()
8.选择具有特定月份和特定月份的数据 (8. Select data with a specific month and a specific day of the month)
To select data with a specific month, for example, May 2018
选择特定月份的数据,例如,2018年5月
df.loc['2018-5']
Similarly, to select data with a specific day of the month, for example, 1st May 2018
同样,要选择一个月中特定日期的数据,例如,2018年5月1日
df.loc['2018-5-1']
9在两个日期之间选择数据 (9 Select data between two dates)
To select data between two dates, you can usedf.loc[start_date:end_date]
For example:
要选择两个日期之间的数据,可以使用df.loc[start_date:end_date]
例如:
Select data between 2016 and 2018
选择2016年至2018年之间的数据
df.loc['2016' : '2018']
Select data between 10 and 11 o'clock on the 2nd May 2018
在2018年5月2日10点至11点之间选择数据
df.loc['2018-5-2 10' : '2018-5-2 11' ]
Select data between 10:30 and 10:45 on the 2nd May 2018
在2018年5月2日10:30到10:45之间选择数据
df.loc['2018-5-2 10:30' : '2018-5-2 10:45' ]
And to select data between time, we should use between_time()
, for example, 10:30 and 10:45
要选择时间之间的数据,我们应该使用between_time()
,例如10:30和10:45
df.between_time('10:30','10:45')
10处理缺失值 (10 Handle missing values)
We often need to compute window statistics such as a rolling mean or a rolling sum.
我们经常需要计算窗口统计信息,例如滚动平均值或滚动总和。
Let’s compute the rolling sum over a 3 window period and then have a look at the top 5 rows.
让我们计算3个窗口周期内的滚动总和,然后看一下前5行。
df['rolling_sum'] = df.rolling(3).sum()
df.head()
We can see that it only starts having valid values when there are 3 periods over which to look back. One solution to handle this is by backfilling of data.
我们可以看到,只有在有3个要回顾的时间段时,它才开始具有有效值。 解决此问题的一种方法是回填数据 。
df['rolling_sum_backfilled'] = df['rolling_sum'].fillna(method='backfill')
df.head()
For more details about backfilling, please check out the following article
有关回填的更多详细信息,请查看以下文章
而已 (That’s it)
Thanks for reading.
谢谢阅读。
Please checkout the notebook on my Github for the source code.
请在我的Github上查看笔记本中的源代码。
Stay tuned if you are interested in the practical aspect of machine learning.
如果您对机器学习的实用方面感兴趣,请继续关注。
Here are some picked articles for you:
以下是为您挑选的一些文章:
翻译自: https://towardsdatascience.com/working-with-datetime-in-pandas-dataframe-663f7af6c587
熊猫数据集