由于机器故障,我辛辛苦苦抄半下午的都木有了……然而,笔记还是要做的。
chapter7 Time Series Analysis
understanding the difference between Python and pandas date tools
- 关于参数
error
:
plate_time=pd.read_csv('eight_attri.csv',
usecols=['plateNumber','passCarTime'],
encoding='utf-8_sig',
#dtype={"jncCode":"category","deviceCode":"category"},
iterator=True
#,delimiter="\t"
)
df=plate_time.get_chunk(2000)
df.dtypes
Out[7]:
plateNumber object
passCarTime object
dtype: object
df.passCarTime=pd.to_datetime(df.passCarTime)
df.dtypes
Out[9]:
plateNumber object
passCarTime datetime64[ns]
dtype: object
# set the 'passCarTime' column as the index to make intelligent Timestamp slicing possible
df=df.set_index('passCarTime')
df
Out[12]:
plateNumber
passCarTime
2020-12-20 00:00:00 鄂A2J8C0
2020-12-20 00:00:00 鄂KX0175
2020-12-20 00:00:00 鄂A3K89F
2020-12-20 00:00:00 鄂H1B196
2020-12-20 00:00:00 鄂H1B196
...
2020-12-20 00:05:12 鄂AV2G25
2020-12-20 00:05:12 鄂AV2G25
2020-12-20 00:05:12 鄂AV2G25
2020-12-20 00:05:13 鄂KX0621
2020-12-20 00:05:13 鄂A39B0Y
[2000 rows x 1 columns]
# select all the rows equals to a single inedx by passing that value to the .loc attribute
crime.loc['2020-12-20 00:01:00']
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-14-0d1fff716a6f>", line 1, in <module>
crime.loc['2020-12-20 00:01:00']
NameError: name 'crime' is not defined
df.loc['2020-12-20 00:01:00']
Out[15]:
plateNumber
passCarTime
2020-12-20 00:01:00 鄂A289BS
2020-12-20 00:01:00 鄂A289BS
2020-12-20 00:01:00 鄂KX0579
2020-12-20 00:01:00 鄂A754S2
# select all the rows that partially match an index value
# e.g. we want all the record from Dec 20,2020
df.loc['2020-12-20']
Out[18]:
plateNumber
passCarTime
2020-12-20 00:00:00 鄂A2J8C0
2020-12-20 00:00:00 鄂KX0175
2020-12-20 00:00:00 鄂A3K89F
2020-12-20 00:00:00 鄂H1B196
2020-12-20 00:00:00 鄂H1B196
...
2020-12-20 00:05:12 鄂AV2G25
2020-12-20 00:05:12 鄂AV2G25
2020-12-20 00:05:12 鄂AV2G25
2020-12-20 00:05:13 鄂KX0621
2020-12-20 00:05:13 鄂A39B0Y
[2000 rows x 1 columns]
# you also can do so for an entire month
df.loc['2020-12'].shape
Out[20]: (2000, 1)
# the selection strings may also contain the name of the month
df.loc['Dec 2020'].sort_index()
Out[22]:
plateNumber
passCarTime
2020-12-20 00:00:00 鄂A2J8C0
2020-12-20 00:00:00 鄂KX0175
2020-12-20 00:00:00 鄂A3K89F
2020-12-20 00:00:00 鄂H1B196
2020-12-20 00:00:00 鄂H1B196
...
2020-12-20 00:05:12 鄂AV2G25
2020-12-20 00:05:12 鄂AV2G25
2020-12-20 00:05:12 鄂AV2G25
2020-12-20 00:05:13 鄂KX0621
2020-12-20 00:05:13 鄂A39B0Y
[2000 rows x 1 columns]
# many other string patterns with month name included also work
df.loc['2020 Dec, 20'].shape
Out[24]: (2000, 1)
# use the slice notation to select a precise range of datetime
df.loc['2020-12-20 00:00':'2020-12-20 00:01']
Out[26]:
plateNumber
passCarTime
2020-12-20 00:00:00 鄂A2J8C0
2020-12-20 00:00:00 鄂KX0175
2020-12-20 00:00:00 鄂A3K89F
2020-12-20 00:00:00 鄂H1B196
2020-12-20 00:00:00 鄂H1B196
...
2020-12-20 00:01:58 鄂KDR688
2020-12-20 00:01:58 鄂KB0858
2020-12-20 00:01:58 鄂K6J705
2020-12-20 00:01:58 鄂A66V8L
2020-12-20 00:01:59 鄂K8N198
[721 rows x 1 columns]
df.loc['2020-12-20 00:00':'2020-12-20 00:01'].shape
Out[27]: (721, 1)
# sorting the index will lead to large gains in performance of slicing, crime_sort=crime.sort_index()
# to select and slice rows by date using the indexing operator, the index must contain date valus
# not necessary to use the .loc attribute for any of the selections or slices, meaning that the index operator by itself will work in the same manner
<string>:1: FutureWarning: Indexing a DataFrame with a datetimelike index using a single string to slice the rows, like `frame[string]`, is deprecated and will be removed in a future version. Use `frame.loc[string]` instead.
df['2020-12-20'].shape
<ipython-input-31-1bfaf89fff3a>:1: FutureWarning: Indexing a DataFrame with a datetimelike index using a single string to slice the rows, like `frame[string]`, is deprecated and will be removed in a future version. Use `frame.loc[string]` instead.
df['2020-12-20'].shape
Out[31]: (2000, 1)
Counting the number of weekly crimes
2021/3/16 [456(477/627)]
the .resample
method provides an easy interface to grouping by any possible span of time
use both the .resample
and .groupby
methods to count the number of weekly crimes
# the .resample method, by default, works implicitly with a DatatimeIndex, here is 'PEPORTED_DATE'
# the first parameter to .resample is the rule determining how the Timestamps in the index will be grouped, e.g. the offset alias W to form groups one week in length ending on Sunday
# the default ending day is Sunday, but may be changed with an anchored offset by appending a dash and the first three letters of a day of the week
# form groups with .resample method, and chain a method to take action on each of them
#e.g. .size method to count the number of crimes per week
Aggregating weekly crime and traffic accidents separately
[472(493/627)]
# with the .groupby method using pd.Grouper to set our group length
# All DateFrame columns containing Timestamps have access to numerous other attributes and methods with the .dt attribute,
# all of these methods and attributes available from the .dt attribute are also available on a Timestamp object
# .dt attribute only works on a Series
# rearrange the order of the index with the .reindex method when a list contains the desired order
# to ease the readability, we pivot the weekdays level into horizontal column names with .unstack
align v. 排整齐;校准;使排成一列
标红的地方,没大看懂
计算某种犯罪:
Grouping with anonymous functinons with a DatetimeIndex[477/(498/627)]
# show the versatility of using the .groupby method for DataFrames that have a DatetimeIndex
# in step3, we use these extra features of the DatetimeIndex to extract the day name.
# in step4, we take advantage of the .groupby method to accept a functinon that is passed the DatetimeIndex
# the 'idx' in the anonymous functinon is the DatetimeIndex, and use it to retrieve the day name
# possible to pass .groupby a list of any number of custom functions
# the .round this DatetimeIndex method to round each value to the nearest second hour
# after the grouping and aggregationg, we .unstack the years as columns
Grouping by a Timestamp and another column [478(499/627)]
import pandas as pd
# I guess...'parse_dates' is set to make 'JOB_DATE' and 'HIRE_DATE' into 'Datetime' type,
employee=pd.read_csv('employee.csv',parse_dates=['JOB_DATE','HIRE_DATE'],index_col='HIRE_DATE')
employee.dtypes
Out[10]:
UNIQUE_ID int64
POSITION_TITLE object
DEPARTMENT object
BASE_SALARY float64
RACE object
EMPLOYMENT_TYPE object
GENDER object
EMPLOYMENT_STATUS object
JOB_DATE datetime64[ns]
dtype: object
employee.index
Out[11]:
DatetimeIndex(['2006-06-12', '2000-07-19', '2015-02-03', '1982-02-08',
'1989-06-19', '1984-11-26', '2012-03-26', '2013-11-04',
'1993-11-15', '2016-03-14',
...
'1995-10-14', '2005-09-12', '2011-12-12', '1982-05-03',
'1983-02-07', '2014-06-09', '2003-09-02', '2014-10-13',
'2009-01-20', '2009-01-12'],
dtype='datetime64[ns]', name='HIRE_DATE', length=2000, freq=None)
employee
Out[12]:
UNIQUE_ID POSITION_TITLE ... EMPLOYMENT_STATUS JOB_DATE
HIRE_DATE ...
2006-06-12 0 ASSISTANT DIRECTOR (EX LVL) ... Active 2012-10-13
2000-07-19 1 LIBRARY ASSISTANT ... Active 2010-09-18
2015-02-03 2 POLICE OFFICER ... Active 2015-02-03
1982-02-08 3 ENGINEER/OPERATOR ... Active 1991-05-25
1989-06-19 4 ELECTRICIAN ... Active 1994-10-22
... ... ... ... ...
2014-06-09 1995 POLICE OFFICER ... Active 2015-06-09
2003-09-02 1996 COMMUNICATIONS CAPTAIN ... Active 2013-10-06
2014-10-13 1997 POLICE OFFICER ... Active 2015-10-13
2009-01-20 1998 POLICE OFFICER ... Active 2011-07-02
2009-01-12 1999 FIRE FIGHTER ... Active 2010-07-12
[2000 rows x 9 columns]
employee.BASE_SALARY
Out[13]:
HIRE_DATE
2006-06-12 121862.0
2000-07-19 26125.0
2015-02-03 45279.0
1982-02-08 63166.0
1989-06-19 56347.0
...
2014-06-09 43443.0
2003-09-02 66523.0
2014-10-13 43443.0
2009-01-20 55461.0
2009-01-12 51194.0
Name: BASE_SALARY, Length: 2000, dtype: float64
employee.BASE_SALARY.mean()
Out[14]: 55767.93160127253
employee.BASE_SALARY.mean().round(-2)
Out[15]: 55800.0
# find the average salary for each gender, just grouping by gender
employee.groupby('GENDER').BASE_SALARY.mean().round(-2)
Out[17]:
GENDER
Female 52200.0
Male 57400.0
Name: BASE_SALARY, dtype: float64
employee.index.min()
Out[18]: Timestamp('1958-12-29 00:00:00')
# find the average salary based on hire date, and group everyone into 10-year buckets
employee.resample('10AS').BASE_SALARY.mean().round(-2)
Out[20]:
HIRE_DATE
1958-01-01 81200.0
1968-01-01 106500.0
1978-01-01 69600.0
1988-01-01 62300.0
1998-01-01 58200.0
2008-01-01 47200.0
Freq: 10AS-JAN, Name: BASE_SALARY, dtype: float64
# group by both gender and a ten-year time span,
employee.groupby('GENDER').resample('10AS')['BASE_SALARY'].mean().round(-2)
Out[22]:
GENDER HIRE_DATE
Female 1975-01-01 51600.0
1985-01-01 57600.0
1995-01-01 55500.0
2005-01-01 51700.0
2015-01-01 38600.0
Male 1958-01-01 81200.0
1968-01-01 106500.0
1978-01-01 72300.0
1988-01-01 64600.0
1998-01-01 59700.0
2008-01-01 47200.0
Name: BASE_SALARY, dtype: float64
employee.groupby('GENDER').resample('10AS').mean().round(-2).unstack('GENDER')
Out[23]:
UNIQUE_ID BASE_SALARY
GENDER Female Male Female Male
HIRE_DATE
1958-01-01 NaN 100.0 NaN 81200.0
1968-01-01 NaN 1200.0 NaN 106500.0
1975-01-01 1000.0 NaN 51600.0 NaN
1978-01-01 NaN 1000.0 NaN 72300.0
1985-01-01 900.0 NaN 57600.0 NaN
1988-01-01 NaN 1000.0 NaN 64600.0
1995-01-01 1000.0 NaN 55500.0 NaN
1998-01-01 NaN 1000.0 NaN 59700.0
2005-01-01 1100.0 NaN 51700.0 NaN
2008-01-01 NaN 1000.0 NaN 47200.0
2015-01-01 1000.0 NaN 38600.0 NaN
employee.groupby('GENDER').resample('10AS')['BASE_SALARY'].mean().round(-2).unstack('GENDER')
Out[24]:
GENDER Female Male
HIRE_DATE
1958-01-01 NaN 81200.0
1968-01-01 NaN 106500.0
1975-01-01 51600.0 NaN
1978-01-01 NaN 72300.0
1985-01-01 57600.0 NaN
1988-01-01 NaN 64600.0
1995-01-01 55500.0 NaN
1998-01-01 NaN 59700.0
2005-01-01 51700.0 NaN
2008-01-01 NaN 47200.0
2015-01-01 38600.0 NaN
# verify the first hired date for each gender
employ[employee.GENDER=='male'].HIRE_DATE.min()
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-26-77cb3857a935>", line 1, in <module>
employ[employee.GENDER=='male'].HIRE_DATE.min()
NameError: name 'employ' is not defined
employee[employee.GENDER=='male'].HIRE_DATE.min()
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-27-2b62d540464d>", line 1, in <module>
employee[employee.GENDER=='male'].HIRE_DATE.min()
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\generic.py", line 5462, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'HIRE_DATE'
employee[employee['GENDER']=='male'].HIRE_DATE.min()
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-28-d1c13f9b7d3c>", line 1, in <module>
employee[employee['GENDER']=='male'].HIRE_DATE.min()
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\generic.py", line 5462, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'HIRE_DATE'
employee[employee.GENDER=='male'].index.min()
Out[29]: NaT
employee[employee.GENDER=='Male'].index.min()
Out[30]: Timestamp('1958-12-29 00:00:00')
employee[employee.GENDER=='Male'].info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1397 entries, 2015-02-03 to 2009-01-12
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 UNIQUE_ID 1397 non-null int64
1 POSITION_TITLE 1397 non-null object
2 DEPARTMENT 1397 non-null object
3 BASE_SALARY 1309 non-null float64
4 RACE 1376 non-null object
5 EMPLOYMENT_TYPE 1397 non-null object
6 GENDER 1397 non-null object
7 EMPLOYMENT_STATUS 1397 non-null object
8 JOB_DATE 1395 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(1), object(6)
memory usage: 109.1+ KB
employee.HIRE_DATE
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-32-0db56d2a2f68>", line 1, in <module>
employee.HIRE_DATE
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\generic.py", line 5462, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'HIRE_DATE'
employee.index
Out[33]:
DatetimeIndex(['2006-06-12', '2000-07-19', '2015-02-03', '1982-02-08',
'1989-06-19', '1984-11-26', '2012-03-26', '2013-11-04',
'1993-11-15', '2016-03-14',
...
'1995-10-14', '2005-09-12', '2011-12-12', '1982-05-03',
'1983-02-07', '2014-06-09', '2003-09-02', '2014-10-13',
'2009-01-20', '2009-01-12'],
dtype='datetime64[ns]', name='HIRE_DATE', length=2000, freq=None)
employee[employee.GENDER=='Falemale'].index.min()
Out[34]: NaT
employee[employee.GENDER=='Female'].index.min()
Out[35]: Timestamp('1975-06-09 00:00:00')
# group the date together with the gender
employee.groupby(['GENDER',pd.Grouper(freq='10AS')])['BASE_SALARY'].mean().round(-2)
Out[37]:
GENDER HIRE_DATE
Female 1968-01-01 NaN
1978-01-01 57100.0
1988-01-01 57100.0
1998-01-01 54700.0
2008-01-01 47300.0
Male 1958-01-01 81200.0
1968-01-01 106500.0
1978-01-01 72300.0
1988-01-01 64600.0
1998-01-01 59700.0
2008-01-01 47200.0
Name: BASE_SALARY, dtype: float64
employee.groupby(['GENDER',pd.Grouper(freq='10AS')])['BASE_SALARY'].mean().round(-2).unstack('GENDER')
Out[38]:
GENDER Female Male
HIRE_DATE
1958-01-01 NaN 81200.0
1968-01-01 NaN 106500.0
1978-01-01 57100.0 72300.0
1988-01-01 57100.0 64600.0
1998-01-01 54700.0 59700.0
2008-01-01 47300.0 47200.0