【cookbook-pandas】学习笔记 Time Series Analysis

最新推荐文章于 2023-11-27 14:07:04 发布

萝卜丝皮尔

最新推荐文章于 2023-11-27 14:07:04 发布

阅读量441

点赞数

分类专栏：数据分析Python基础篇文章标签： pandas 时间序列

本文链接：https://blog.csdn.net/qq_43448491/article/details/114804575

版权

数据分析Python基础篇专栏收录该内容

23 篇文章 0 订阅

订阅专栏

由于机器故障，我辛辛苦苦抄半下午的都木有了……然而，笔记还是要做的。

chapter7 Time Series Analysis

understanding the difference between Python and pandas date tools

关于参数error:

plate_time=pd.read_csv('eight_attri.csv',
                  usecols=['plateNumber','passCarTime'],
                 encoding='utf-8_sig',
                 #dtype={"jncCode":"category","deviceCode":"category"},
                 iterator=True
                #,delimiter="\t"
                  )
df=plate_time.get_chunk(2000)
df.dtypes
Out[7]: 
plateNumber    object
passCarTime    object
dtype: object
df.passCarTime=pd.to_datetime(df.passCarTime)
df.dtypes
Out[9]: 
plateNumber            object
passCarTime    datetime64[ns]
dtype: object
# set the 'passCarTime' column as the index to make intelligent Timestamp slicing possible
df=df.set_index('passCarTime')
df
Out[12]: 
                    plateNumber
passCarTime                    
2020-12-20 00:00:00     鄂A2J8C0
2020-12-20 00:00:00     鄂KX0175
2020-12-20 00:00:00     鄂A3K89F
2020-12-20 00:00:00     鄂H1B196
2020-12-20 00:00:00     鄂H1B196
                         ...
2020-12-20 00:05:12     鄂AV2G25
2020-12-20 00:05:12     鄂AV2G25
2020-12-20 00:05:12     鄂AV2G25
2020-12-20 00:05:13     鄂KX0621
2020-12-20 00:05:13     鄂A39B0Y
[2000 rows x 1 columns]
# select all the rows equals to a single inedx by passing that value to the .loc attribute
crime.loc['2020-12-20 00:01:00']
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-14-0d1fff716a6f>", line 1, in <module>
    crime.loc['2020-12-20 00:01:00']
NameError: name 'crime' is not defined
df.loc['2020-12-20 00:01:00']
Out[15]: 
                    plateNumber
passCarTime                    
2020-12-20 00:01:00     鄂A289BS
2020-12-20 00:01:00     鄂A289BS
2020-12-20 00:01:00     鄂KX0579
2020-12-20 00:01:00     鄂A754S2
# select all the rows that partially match an index value
# e.g. we want all the record from Dec 20,2020
df.loc['2020-12-20']
Out[18]: 
                    plateNumber
passCarTime                    
2020-12-20 00:00:00     鄂A2J8C0
2020-12-20 00:00:00     鄂KX0175
2020-12-20 00:00:00     鄂A3K89F
2020-12-20 00:00:00     鄂H1B196
2020-12-20 00:00:00     鄂H1B196
                         ...
2020-12-20 00:05:12     鄂AV2G25
2020-12-20 00:05:12     鄂AV2G25
2020-12-20 00:05:12     鄂AV2G25
2020-12-20 00:05:13     鄂KX0621
2020-12-20 00:05:13     鄂A39B0Y
[2000 rows x 1 columns]
# you also can do so for an entire month
df.loc['2020-12'].shape
Out[20]: (2000, 1)
# the selection strings may also contain the name of the month 
df.loc['Dec 2020'].sort_index()
Out[22]: 
                    plateNumber
passCarTime                    
2020-12-20 00:00:00     鄂A2J8C0
2020-12-20 00:00:00     鄂KX0175
2020-12-20 00:00:00     鄂A3K89F
2020-12-20 00:00:00     鄂H1B196
2020-12-20 00:00:00     鄂H1B196
                         ...
2020-12-20 00:05:12     鄂AV2G25
2020-12-20 00:05:12     鄂AV2G25
2020-12-20 00:05:12     鄂AV2G25
2020-12-20 00:05:13     鄂KX0621
2020-12-20 00:05:13     鄂A39B0Y
[2000 rows x 1 columns]
# many other string patterns with month name included also work
df.loc['2020 Dec, 20'].shape
Out[24]: (2000, 1)
# use the slice notation to select a precise range of datetime
df.loc['2020-12-20 00:00':'2020-12-20 00:01']
Out[26]: 
                    plateNumber
passCarTime                    
2020-12-20 00:00:00     鄂A2J8C0
2020-12-20 00:00:00     鄂KX0175
2020-12-20 00:00:00     鄂A3K89F
2020-12-20 00:00:00     鄂H1B196
2020-12-20 00:00:00     鄂H1B196
                         ...
2020-12-20 00:01:58     鄂KDR688
2020-12-20 00:01:58     鄂KB0858
2020-12-20 00:01:58     鄂K6J705
2020-12-20 00:01:58     鄂A66V8L
2020-12-20 00:01:59     鄂K8N198
[721 rows x 1 columns]
df.loc['2020-12-20 00:00':'2020-12-20 00:01'].shape
Out[27]: (721, 1)
# sorting the index will lead to large gains in performance of slicing, crime_sort=crime.sort_index()
# to select and slice rows by date using the indexing operator, the index must contain date valus
# not necessary to use the .loc attribute for any of the selections or slices, meaning that the index operator by itself will work in the same manner
<string>:1: FutureWarning: Indexing a DataFrame with a datetimelike index using a single string to slice the rows, like `frame[string]`, is deprecated and will be removed in a future version. Use `frame.loc[string]` instead.
df['2020-12-20'].shape
<ipython-input-31-1bfaf89fff3a>:1: FutureWarning: Indexing a DataFrame with a datetimelike index using a single string to slice the rows, like `frame[string]`, is deprecated and will be removed in a future version. Use `frame.loc[string]` instead.
  df['2020-12-20'].shape
Out[31]: (2000, 1)

Counting the number of weekly crimes

2021/3/16 [456(477/627)]
the .resample method provides an easy interface to grouping by any possible span of time
use both the .resample and .groupby methods to count the number of weekly crimes

# the .resample method, by default, works implicitly with a DatatimeIndex, here is 'PEPORTED_DATE'
# the first parameter to .resample is the rule determining how the Timestamps in the index will be grouped, e.g. the offset alias W to form groups one week in length ending on Sunday
# the default ending day is Sunday, but may be changed with an anchored offset by appending a dash and the first three letters of a day of the week
# form groups with .resample method, and chain a method to take action on each of them
#e.g. .size method to count the number of crimes per week

在这里插入图片描述

Aggregating weekly crime and traffic accidents separately

在这里插入图片描述

[472(493/627)]

# with the .groupby method using pd.Grouper to set our group length 
# All DateFrame columns containing Timestamps have access to numerous other attributes and methods with the .dt attribute,
# all of these methods and attributes available  from  the .dt attribute are also available on a Timestamp object
# .dt attribute only works on a Series
# rearrange the order of the index with the .reindex method when a list contains the desired order
# to ease the readability, we pivot the weekdays level into horizontal column names with .unstack

align v. 排整齐；校准；使排成一列
标红的地方，没大看懂
在这里插入图片描述

在这里插入图片描述
计算某种犯罪：

Grouping with anonymous functinons with a DatetimeIndex[477/(498/627)]

# show the versatility of using the .groupby method for DataFrames that have a DatetimeIndex

在这里插入图片描述

# in step3, we use these extra features of the DatetimeIndex to extract the day name.
# in step4, we take advantage of the .groupby method to accept a functinon that is passed the DatetimeIndex
# the 'idx' in the anonymous functinon is the DatetimeIndex, and use it to retrieve the day name 
# possible to pass .groupby a list of any number of custom functions 
# the .round this DatetimeIndex method to round each value to the nearest second hour 
# after the grouping and aggregationg, we .unstack the years as columns

在这里插入图片描述

Grouping by a Timestamp and another column [478(499/627)]

import pandas as pd
# I guess...'parse_dates' is set to make 'JOB_DATE' and 'HIRE_DATE' into 'Datetime' type,
employee=pd.read_csv('employee.csv',parse_dates=['JOB_DATE','HIRE_DATE'],index_col='HIRE_DATE')
employee.dtypes
Out[10]: 
UNIQUE_ID                     int64
POSITION_TITLE               object
DEPARTMENT                   object
BASE_SALARY                 float64
RACE                         object
EMPLOYMENT_TYPE              object
GENDER                       object
EMPLOYMENT_STATUS            object
JOB_DATE             datetime64[ns]
dtype: object
employee.index
Out[11]: 
DatetimeIndex(['2006-06-12', '2000-07-19', '2015-02-03', '1982-02-08',
               '1989-06-19', '1984-11-26', '2012-03-26', '2013-11-04',
               '1993-11-15', '2016-03-14',
               ...
               '1995-10-14', '2005-09-12', '2011-12-12', '1982-05-03',
               '1983-02-07', '2014-06-09', '2003-09-02', '2014-10-13',
               '2009-01-20', '2009-01-12'],
              dtype='datetime64[ns]', name='HIRE_DATE', length=2000, freq=None)
employee
Out[12]: 
            UNIQUE_ID               POSITION_TITLE  ... EMPLOYMENT_STATUS   JOB_DATE
HIRE_DATE                                           ...                             
2006-06-12          0  ASSISTANT DIRECTOR (EX LVL)  ...            Active 2012-10-13
2000-07-19          1            LIBRARY ASSISTANT  ...            Active 2010-09-18
2015-02-03          2               POLICE OFFICER  ...            Active 2015-02-03
1982-02-08          3            ENGINEER/OPERATOR  ...            Active 1991-05-25
1989-06-19          4                  ELECTRICIAN  ...            Active 1994-10-22
               ...                          ...  ...               ...        ...
2014-06-09       1995               POLICE OFFICER  ...            Active 2015-06-09
2003-09-02       1996       COMMUNICATIONS CAPTAIN  ...            Active 2013-10-06
2014-10-13       1997               POLICE OFFICER  ...            Active 2015-10-13
2009-01-20       1998               POLICE OFFICER  ...            Active 2011-07-02
2009-01-12       1999                 FIRE FIGHTER  ...            Active 2010-07-12
[2000 rows x 9 columns]
employee.BASE_SALARY
Out[13]: 
HIRE_DATE
2006-06-12    121862.0
2000-07-19     26125.0
2015-02-03     45279.0
1982-02-08     63166.0
1989-06-19     56347.0
                ...   
2014-06-09     43443.0
2003-09-02     66523.0
2014-10-13     43443.0
2009-01-20     55461.0
2009-01-12     51194.0
Name: BASE_SALARY, Length: 2000, dtype: float64
employee.BASE_SALARY.mean()
Out[14]: 55767.93160127253
employee.BASE_SALARY.mean().round(-2)
Out[15]: 55800.0
# find the average salary for each gender, just grouping by gender
employee.groupby('GENDER').BASE_SALARY.mean().round(-2)
Out[17]: 
GENDER
Female    52200.0
Male      57400.0
Name: BASE_SALARY, dtype: float64
employee.index.min()
Out[18]: Timestamp('1958-12-29 00:00:00')
# find the average salary based on hire date, and group everyone into 10-year buckets
employee.resample('10AS').BASE_SALARY.mean().round(-2)
Out[20]: 
HIRE_DATE
1958-01-01     81200.0
1968-01-01    106500.0
1978-01-01     69600.0
1988-01-01     62300.0
1998-01-01     58200.0
2008-01-01     47200.0
Freq: 10AS-JAN, Name: BASE_SALARY, dtype: float64
# group by both gender and a ten-year time span,
employee.groupby('GENDER').resample('10AS')['BASE_SALARY'].mean().round(-2)
Out[22]: 
GENDER  HIRE_DATE 
Female  1975-01-01     51600.0
        1985-01-01     57600.0
        1995-01-01     55500.0
        2005-01-01     51700.0
        2015-01-01     38600.0
Male    1958-01-01     81200.0
        1968-01-01    106500.0
        1978-01-01     72300.0
        1988-01-01     64600.0
        1998-01-01     59700.0
        2008-01-01     47200.0
Name: BASE_SALARY, dtype: float64
employee.groupby('GENDER').resample('10AS').mean().round(-2).unstack('GENDER')
Out[23]: 
           UNIQUE_ID         BASE_SALARY          
GENDER        Female    Male      Female      Male
HIRE_DATE                                         
1958-01-01       NaN   100.0         NaN   81200.0
1968-01-01       NaN  1200.0         NaN  106500.0
1975-01-01    1000.0     NaN     51600.0       NaN
1978-01-01       NaN  1000.0         NaN   72300.0
1985-01-01     900.0     NaN     57600.0       NaN
1988-01-01       NaN  1000.0         NaN   64600.0
1995-01-01    1000.0     NaN     55500.0       NaN
1998-01-01       NaN  1000.0         NaN   59700.0
2005-01-01    1100.0     NaN     51700.0       NaN
2008-01-01       NaN  1000.0         NaN   47200.0
2015-01-01    1000.0     NaN     38600.0       NaN
employee.groupby('GENDER').resample('10AS')['BASE_SALARY'].mean().round(-2).unstack('GENDER')
Out[24]: 
GENDER       Female      Male
HIRE_DATE                    
1958-01-01      NaN   81200.0
1968-01-01      NaN  106500.0
1975-01-01  51600.0       NaN
1978-01-01      NaN   72300.0
1985-01-01  57600.0       NaN
1988-01-01      NaN   64600.0
1995-01-01  55500.0       NaN
1998-01-01      NaN   59700.0
2005-01-01  51700.0       NaN
2008-01-01      NaN   47200.0
2015-01-01  38600.0       NaN
# verify the first hired date for each gender
employ[employee.GENDER=='male'].HIRE_DATE.min()
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-26-77cb3857a935>", line 1, in <module>
    employ[employee.GENDER=='male'].HIRE_DATE.min()
NameError: name 'employ' is not defined
employee[employee.GENDER=='male'].HIRE_DATE.min()
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-27-2b62d540464d>", line 1, in <module>
    employee[employee.GENDER=='male'].HIRE_DATE.min()
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\generic.py", line 5462, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'HIRE_DATE'
employee[employee['GENDER']=='male'].HIRE_DATE.min()
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-28-d1c13f9b7d3c>", line 1, in <module>
    employee[employee['GENDER']=='male'].HIRE_DATE.min()
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\generic.py", line 5462, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'HIRE_DATE'
employee[employee.GENDER=='male'].index.min()
Out[29]: NaT
employee[employee.GENDER=='Male'].index.min()
Out[30]: Timestamp('1958-12-29 00:00:00')
employee[employee.GENDER=='Male'].info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1397 entries, 2015-02-03 to 2009-01-12
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   UNIQUE_ID          1397 non-null   int64         
 1   POSITION_TITLE     1397 non-null   object        
 2   DEPARTMENT         1397 non-null   object        
 3   BASE_SALARY        1309 non-null   float64       
 4   RACE               1376 non-null   object        
 5   EMPLOYMENT_TYPE    1397 non-null   object        
 6   GENDER             1397 non-null   object        
 7   EMPLOYMENT_STATUS  1397 non-null   object        
 8   JOB_DATE           1395 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(1), object(6)
memory usage: 109.1+ KB
employee.HIRE_DATE
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-32-0db56d2a2f68>", line 1, in <module>
    employee.HIRE_DATE
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\generic.py", line 5462, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'HIRE_DATE'
employee.index
Out[33]: 
DatetimeIndex(['2006-06-12', '2000-07-19', '2015-02-03', '1982-02-08',
               '1989-06-19', '1984-11-26', '2012-03-26', '2013-11-04',
               '1993-11-15', '2016-03-14',
               ...
               '1995-10-14', '2005-09-12', '2011-12-12', '1982-05-03',
               '1983-02-07', '2014-06-09', '2003-09-02', '2014-10-13',
               '2009-01-20', '2009-01-12'],
              dtype='datetime64[ns]', name='HIRE_DATE', length=2000, freq=None)
employee[employee.GENDER=='Falemale'].index.min()
Out[34]: NaT
employee[employee.GENDER=='Female'].index.min()
Out[35]: Timestamp('1975-06-09 00:00:00')
# group the date together with the gender
employee.groupby(['GENDER',pd.Grouper(freq='10AS')])['BASE_SALARY'].mean().round(-2)
Out[37]: 
GENDER  HIRE_DATE 
Female  1968-01-01         NaN
        1978-01-01     57100.0
        1988-01-01     57100.0
        1998-01-01     54700.0
        2008-01-01     47300.0
Male    1958-01-01     81200.0
        1968-01-01    106500.0
        1978-01-01     72300.0
        1988-01-01     64600.0
        1998-01-01     59700.0
        2008-01-01     47200.0
Name: BASE_SALARY, dtype: float64
employee.groupby(['GENDER',pd.Grouper(freq='10AS')])['BASE_SALARY'].mean().round(-2).unstack('GENDER')
Out[38]: 
GENDER       Female      Male
HIRE_DATE                    
1958-01-01      NaN   81200.0
1968-01-01      NaN  106500.0
1978-01-01  57100.0   72300.0
1988-01-01  57100.0   64600.0
1998-01-01  54700.0   59700.0
2008-01-01  47300.0   47200.0

萝卜丝皮尔

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【cookbook-pandas】学习笔记 Time Series Analysis

由于机器故障，我辛辛苦苦抄半下午的都木有了……然而，笔记还是要做的。chapter7 Time Series Analysisunderstanding the difference between Python and pandas date tools关于参数error:
复制链接

扫一扫

专栏目录