python之pandas基础知识以及练习题

最新推荐文章于 2024-05-12 08:39:39 发布

奋豆儿小米粒

最新推荐文章于 2024-05-12 08:39:39 发布

阅读量7k

点赞数

文章标签： python之panda

本文链接：https://blog.csdn.net/qq_27008327/article/details/89044617

版权

####pandas数据分析与处理库

import pandas as pd

df=pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\data\titanic.csv’)

PassengerId 	Survived 	Pclass 	Name 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande… female 31.0 1 0 345763 18.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S
21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S
22 23 1 3 McGowan, Miss. Anna “Annie” female 15.0 0 0 330923 8.0292 NaN Q
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S
24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia… female 38.0 1 5 347077 31.3875 NaN S
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 C23 C25 C27 S
28 29 1 3 O’Dwyer, Miss. Ellen “Nellie” female NaN 0 0 330959 7.8792 NaN Q
29 30 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S
… … … … … … … … … … … … …
861 862 0 2 Giles, Mr. Frederick Edward male 21.0 1 0 28134 11.5000 NaN S
862 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba… female 48.0 0 0 17466 25.9292 D17 S
863 864 0 3 Sage, Miss. Dorothy Edith “Dolly” female NaN 8 2 CA. 2343 69.5500 NaN S
864 865 0 2 Gill, Mr. John William male 24.0 0 0 233866 13.0000 NaN S
865 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.0000 NaN S
866 867 1 2 Duran y More, Miss. Asuncion female 27.0 1 0 SC/PARIS 2149 13.8583 NaN C
867 868 0 1 Roebling, Mr. Washington Augustus II male 31.0 0 0 PC 17590 50.4958 A24 S
868 869 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
869 870 1 3 Johnson, Master. Harold Theodor male 4.0 1 1 347742 11.1333 NaN S
870 871 0 3 Balkic, Mr. Cerin male 26.0 0 0 349248 7.8958 NaN S
871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.5542 D35 S
872 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 B51 B53 B55 S
873 874 0 3 Vander Cruyssen, Mr. Victor male 47.0 0 0 345765 9.0000 NaN S
874 875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 0 P/PP 3381 24.0000 NaN C
875 876 1 3 Najib, Miss. Adele Kiamie “Jane” female 15.0 0 0 2667 7.2250 NaN C
876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 7534 9.8458 NaN S
877 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.8958 NaN S
878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 NaN S
881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen “Carrie” female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

#head()可以读取前几条数据，可以指定前面的任意几条数据

df.head(6)

PassengerId 	Survived 	Pclass 	Name 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

df.info()#info()返回当前信息

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

df.index

RangeIndex(start=0, stop=891, step=1)

df.columns

Index([‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Name’, ‘Sex’, ‘Age’, ‘SibSp’,
‘Parch’, ‘Ticket’, ‘Fare’, ‘Cabin’, ‘Embarked’],
dtype=‘object’)

df.dtypes

PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object

df.values

array([[1, 0, 3, …, 7.25, nan, ‘S’],
[2, 1, 1, …, 71.2833, ‘C85’, ‘C’],
[3, 1, 3, …, 7.925, nan, ‘S’],
…,
[889, 0, 3, …, 23.45, nan, ‘S’],
[890, 1, 1, …, 30.0, ‘C148’, ‘C’],
[891, 0, 3, …, 7.75, nan, ‘Q’]], dtype=object)

#创建一个dataFrame结构

#创建pandas

data={‘country’:[‘aaa’,‘bbb’,‘ccc’],#指定列名相当于字典结构

   'population':[10,12,24]}

data

{‘country’: [‘aaa’, ‘bbb’, ‘ccc’], ‘population’: [10, 12, 24]}

df_data=pd.DataFrame(data)

df_data

country 	population

0 aaa 10
1 bbb 12
2 ccc 24

df_data.info

#取指定的数据

age=df[‘Age’]

age[:5]

0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64

#series:dataframe中的一行/列

age.index

RangeIndex(start=0, stop=891, step=1)

age.values[:5]

array([ 22., 38., 26., 35., 35.])

df.head()

PassengerId 	Survived 	Pclass 	Name 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

df[‘Age’][:5]

0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64

age=df[‘Age’]

age[:5]

Name
Braund, Mr. Owen Harris 22.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38.0
Heikkinen, Miss. Laina 26.0
Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0
Allen, Mr. William Henry 35.0
Name: Age, dtype: float64

#加减法

age[:5]+10

Name
Braund, Mr. Owen Harris 32.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 48.0
Heikkinen, Miss. Laina 36.0
Futrelle, Mrs. Jacques Heath (Lily May Peel) 45.0
Allen, Mr. William Henry 45.0
Name: Age, dtype: float64

age*10

Name
Braund, Mr. Owen Harris 220.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 380.0
Heikkinen, Miss. Laina 260.0
Futrelle, Mrs. Jacques Heath (Lily May Peel) 350.0
Allen, Mr. William Henry 350.0
Moran, Mr. James NaN
McCarthy, Mr. Timothy J 540.0
Palsson, Master. Gosta Leonard 20.0
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 270.0
Nasser, Mrs. Nicholas (Adele Achem) 140.0
Sandstrom, Miss. Marguerite Rut 40.0
Bonnell, Miss. Elizabeth 580.0
Saundercock, Mr. William Henry 200.0
Andersson, Mr. Anders Johan 390.0
Vestrom, Miss. Hulda Amanda Adolfina 140.0
Hewlett, Mrs. (Mary D Kingcome) 550.0
Rice, Master. Eugene 20.0
Williams, Mr. Charles Eugene NaN
Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele) 310.0
Masselmani, Mrs. Fatima NaN
Fynney, Mr. Joseph J 350.0
Beesley, Mr. Lawrence 340.0
McGowan, Miss. Anna “Annie” 150.0
Sloper, Mr. William Thompson 280.0
Palsson, Miss. Torborg Danira 80.0
Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson) 380.0
Emir, Mr. Farred Chehab NaN
Fortune, Mr. Charles Alexander 190.0
O’Dwyer, Miss. Ellen “Nellie” NaN
Todoroff, Mr. Lalio NaN
…
Giles, Mr. Frederick Edward 210.0
Swift, Mrs. Frederick Joel (Margaret Welles Barron) 480.0
Sage, Miss. Dorothy Edith “Dolly” NaN
Gill, Mr. John William 240.0
Bystrom, Mrs. (Karolina) 420.0
Duran y More, Miss. Asuncion 270.0
Roebling, Mr. Washington Augustus II 310.0
van Melkebeke, Mr. Philemon NaN
Johnson, Master. Harold Theodor 40.0
Balkic, Mr. Cerin 260.0
Beckwith, Mrs. Richard Leonard (Sallie Monypeny) 470.0
Carlsson, Mr. Frans Olof 330.0
Vander Cruyssen, Mr. Victor 470.0
Abelson, Mrs. Samuel (Hannah Wizosky) 280.0
Najib, Miss. Adele Kiamie “Jane” 150.0
Gustafsson, Mr. Alfred Ossian 200.0
Petroff, Mr. Nedelio 190.0
Laleff, Mr. Kristo NaN
Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) 560.0
Shelley, Mrs. William (Imanita Parrish Hall) 250.0
Markun, Mr. Johann 330.0
Dahlberg, Miss. Gerda Ulrika 220.0
Banfield, Mr. Frederick James 280.0
Sutehall, Mr. Henry Jr 250.0
Rice, Mrs. William (Margaret Norton) 390.0
Montvila, Rev. Juozas 270.0
Graham, Miss. Margaret Edith 190.0
Johnston, Miss. Catherine Helen “Carrie” NaN
Behr, Mr. Karl Howell 260.0
Dooley, Mr. Patrick 320.0
Name: Age, Length: 891, dtype: float64

age[:5]*10

age.mean()

29.69911764705882

age.max()

80.0

age.min()

0.41999999999999998

#指标做的简单，通俗易懂.得到数统计的基本特性

df.describe()

PassengerId 	Survived 	Pclass 	Age 	SibSp 	Parch 	Fare

count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

####pandas索引结构

import pandas as pd

df=pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\data\titanic.csv’)

PassengerId 	Survived 	Pclass 	Name 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

891 rows × 12 columns

df[‘Age’][:5]

0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64

df[[‘Age’,‘Fare’]][:5]

Age 	Fare

0 22.0 7.2500
1 38.0 71.2833
2 26.0 7.9250
3 35.0 53.1000
4 35.0 8.0500

loc:用labe来去定位 *iloc:用position来去定位

df.iloc[0]

PassengerId 1
Survived 0
Pclass 3
Name Braund, Mr. Owen Harris
Sex male
Age 22
SibSp 1
Parch 0
Ticket A/5 21171
Fare 7.25
Cabin NaN
Embarked S
Name: 0, dtype: object

df.iloc[0:5]

PassengerId 	Survived 	Pclass 	Name 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

#筛选

df.iloc[0:5,1:3]

Survived 	Pclass

0 0 3
1 1 1
2 1 3
3 1 1
4 0 3

df=df.set_index(‘Name’)

KeyError Traceback (most recent call last)
D:\program\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2441 try:
-> 2442 return self._engine.get_loc(key)
2443 except KeyError:

pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ‘Name’

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
in ()
----> 1 df=df.set_index(‘Name’)

D:\program\Anaconda\lib\site-packages\pandas\core\frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
2828 names.append(None)
2829 else:
-> 2830 level = frame[col]._values
2831 names.append(col)
2832 if drop:

D:\program\Anaconda\lib\site-packages\pandas\core\frame.py in getitem(self, key)
1962 return self._getitem_multilevel(key)
1963 else:
-> 1964 return self._getitem_column(key)
1965
1966 def _getitem_column(self, key):

D:\program\Anaconda\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
1969 # get column
1970 if self.columns.is_unique:
-> 1971 return self._get_item_cache(key)
1972
1973 # duplicate columns & possible reduce dimensionality

D:\program\Anaconda\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
1643 res = cache.get(item)
1644 if res is None:
-> 1645 values = self._data.get(item)
1646 res = self._box_item_values(item, values)
1647 cache[item] = res

D:\program\Anaconda\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
3588
3589 if not isnull(item):
-> 3590 loc = self.items.get_loc(item)
3591 else:
3592 indexer = np.arange(len(self.items))[isnull(self.items)]

D:\program\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2442 return self._engine.get_loc(key)
2443 except KeyError:
-> 2444 return self._engine.get_loc(self._maybe_cast_indexer(key))
2445
2446 indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ‘Name’

df.loc[‘Heikkinen, Miss. Laina’,‘Fare’]=1000

df.head()

PassengerId 	Survived 	Pclass 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

Name
Braund, Mr. Owen Harris 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Heikkinen, Miss. Laina 3 1 3 female 26.0 0 0 STON/O2. 3101282 1000.0000 NaN S
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
Allen, Mr. William Henry 5 0 3 male 35.0 0 0 373450 8.0500 NaN S

####bool类型索引

df[‘Fare’]>40

df.head()

PassengerId 	Survived 	Pclass 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

df[df[‘Fare’]>40]

PassengerId 	Survived 	Pclass 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

Name
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Heikkinen, Miss. Laina 3 1 3 female 26.0 0 0 STON/O2. 3101282 1000.0000 NaN S
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
McCarthy, Mr. Timothy J 7 0 1 male 54.0 0 0 17463 51.8625 E46 S
Fortune, Mr. Charles Alexander 28 0 1 male 19.0 3 2 19950 263.0000 C23 C25 C27 S
Spencer, Mrs. William Augustus (Marie Eugenie) 32 1 1 female NaN 1 0 PC 17569 146.5208 B78 C
Meyer, Mr. Edgar Joseph 35 0 1 male 28.0 1 0 PC 17604 82.1708 NaN C
Holverson, Mr. Alexander Oskar 36 0 1 male 42.0 1 0 113789 52.0000 NaN S
Laroche, Miss. Simonne Marie Anne Andree 44 1 2 female 3.0 1 2 SC/Paris 2123 41.5792 NaN C
Harper, Mrs. Henry Sleeper (Myna Haxtun) 53 1 1 female 49.0 1 0 PC 17572 76.7292 D33 C
Ostby, Mr. Engelhart Cornelius 55 0 1 male 65.0 0 1 113509 61.9792 B30 C
Goodwin, Master. William Frederick 60 0 3 male 11.0 5 2 CA 2144 46.9000 NaN S
Icard, Miss. Amelie 62 1 1 female 38.0 0 0 113572 80.0000 B28 NaN
Harris, Mr. Henry Birkhardt 63 0 1 male 45.0 1 0 36973 83.4750 C83 S
Goodwin, Miss. Lillian Amy 72 0 3 female 16.0 5 2 CA 2144 46.9000 NaN S
Hood, Mr. Ambrose Jr 73 0 2 male 21.0 0 0 S.O.C. 14879 73.5000 NaN S
Bing, Mr. Lee 75 1 3 male 32.0 0 0 1601 56.4958 NaN S
Carrau, Mr. Francisco M 84 0 1 male 28.0 0 0 113059 47.1000 NaN S
Fortune, Miss. Mabel Helen 89 1 1 female 23.0 3 2 19950 263.0000 C23 C25 C27 S
Chaffee, Mr. Herbert Fuller 93 0 1 male 46.0 1 0 W.E.P. 5734 61.1750 E31 S
Greenfield, Mr. William Bertram 98 1 1 male 23.0 0 1 PC 17759 63.3583 D10 D12 C
White, Mr. Richard Frasar 103 0 1 male 21.0 0 1 35281 77.2875 D26 S
Porter, Mr. Walter Chamberlain 111 0 1 male 47.0 0 0 110465 52.0000 C110 S
Baxter, Mr. Quigg Edmond 119 0 1 male 24.0 0 1 PC 17558 247.5208 B58 B60 C
Hickman, Mr. Stanley George 121 0 2 male 21.0 2 0 S.O.C. 14879 73.5000 NaN S
White, Mr. Percival Wayland 125 0 1 male 54.0 0 1 35281 77.2875 D26 S
Futrelle, Mr. Jacques Heath 138 0 1 male 37.0 1 0 113803 53.1000 C123 S
Giglio, Mr. Victor 140 0 1 male 24.0 0 0 PC 17593 79.2000 B86 C
Pears, Mrs. Thomas (Edith Wearne) 152 1 1 female 22.0 1 0 113776 66.6000 C2 S
Williams, Mr. Charles Duane 156 0 1 male 51.0 0 1 PC 17597 61.3792 NaN C
… … … … … … … … … … … …
Endres, Miss. Caroline Louise 717 1 1 female 38.0 0 0 PC 17757 227.5250 C45 C
Chambers, Mr. Norman Campbell 725 1 1 male 27.0 1 0 113806 53.1000 E8 S
Allen, Miss. Elisabeth Walton 731 1 1 female 29.0 0 0 24160 211.3375 B5 S
Lesurer, Mr. Gustave J 738 1 1 male 35.0 0 0 PC 17755 512.3292 B101 C
Cavendish, Mr. Tyrell William 742 0 1 male 36.0 1 0 19877 78.8500 C46 S
Ryerson, Miss. Susan Parker “Suzette” 743 1 1 female 21.0 2 2 PC 17608 262.3750 B57 B59 B63 B66 C
Crosby, Capt. Edward Gifford 746 0 1 male 70.0 1 1 WE/P 5735 71.0000 B22 S
Marvin, Mr. Daniel Warner 749 0 1 male 19.0 1 0 113773 53.1000 D30 S
Herman, Mrs. Samuel (Jane Laver) 755 1 2 female 48.0 1 2 220845 65.0000 NaN S
Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards) 760 1 1 female 33.0 0 0 110152 86.5000 B77 S
Carter, Mrs. William Ernest (Lucile Polk) 764 1 1 female 36.0 1 2 113760 120.0000 B96 B98 S
Hogeboom, Mrs. John C (Anna Andrews) 766 1 1 female 51.0 1 0 13502 77.9583 D11 S
Robert, Mrs. Edward Scott (Elisabeth Walton McMillan) 780 1 1 female 43.0 0 1 24160 211.3375 B3 S
Dick, Mrs. Albert Adrian (Vera Gillespie) 782 1 1 female 17.0 1 0 17474 57.0000 B20 S
Guggenheim, Mr. Benjamin 790 0 1 male 46.0 0 0 PC 17593 79.2000 B82 B84 C
Sage, Miss. Stella Anna 793 0 3 female NaN 8 2 CA. 2343 69.5500 NaN S
Carter, Master. William Thornton II 803 1 1 male 11.0 1 2 113760 120.0000 B96 B98 S
Chambers, Mrs. Norman Campbell (Bertha Griggs) 810 1 1 female 33.0 1 0 113806 53.1000 E8 S
Hays, Mrs. Charles Melville (Clara Jennings Gregg) 821 1 1 female 52.0 1 1 12749 93.5000 B69 S
Lam, Mr. Len 827 0 3 male NaN 0 0 1601 56.4958 NaN S
Stone, Mrs. George Nelson (Martha Evelyn) 830 1 1 female 62.0 0 0 113572 80.0000 B28 NaN
Compton, Miss. Sara Rebecca 836 1 1 female 39.0 1 1 PC 17756 83.1583 E49 C
Chip, Mr. Chang 839 1 3 male 32.0 0 0 1601 56.4958 NaN S
Sage, Mr. Douglas Bullen 847 0 3 male NaN 8 2 CA. 2343 69.5500 NaN S
Goldenberg, Mrs. Samuel L (Edwiga Grabowska) 850 1 1 female NaN 1 0 17453 89.1042 C92 C
Wick, Mrs. George Dennick (Mary Hitchcock) 857 1 1 female 45.0 1 1 36928 164.8667 NaN S
Sage, Miss. Dorothy Edith “Dolly” 864 0 3 female NaN 8 2 CA. 2343 69.5500 NaN S
Roebling, Mr. Washington Augustus II 868 0 1 male 31.0 0 0 PC 17590 50.4958 A24 S
Beckwith, Mrs. Richard Leonard (Sallie Monypeny) 872 1 1 female 47.0 1 1 11751 52.5542 D35 S
Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) 880 1 1 female 56.0 0 1 11767 83.1583 C50 C

177 rows × 11 columns

#选择数据做布尔类型判断

df[df[‘Fare’] > 40][:5]

PassengerId 	Survived 	Pclass 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

df[df[‘Sex’] == ‘male’][:5]

PassengerId 	Survived 	Pclass 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

Name
Braund, Mr. Owen Harris 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
Allen, Mr. William Henry 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
Moran, Mr. James 6 0 3 male NaN 0 0 330877 8.4583 NaN Q
McCarthy, Mr. Timothy J 7 0 1 male 54.0 0 0 17463 51.8625 E46 S
Palsson, Master. Gosta Leonard 8 0 3 male 2.0 3 1 349909 21.0750 NaN S

df.loc[df[‘Sex’]==‘male’,‘Age’].mean()

30.72664459161148

(df[‘Age’] > 70).sum()

###GROUPBy操作

df = pd.DataFrame({‘key’:[‘A’,‘B’,‘C’,‘A’,‘B’,‘C’,‘A’,‘B’,‘C’],

              'data':[0,5,10,5,10,15,10,15,20]})

data 	key

0 0 A
1 5 B
2 10 C
3 5 A
4 10 B
5 15 C
6 10 A
7 15 B
8 20 C

for key in [‘A’,‘B’,‘C’]:

 print (key,df[df['key'] == key].sum())

A data 15
key AAA
dtype: object
B data 30
key BBB
dtype: object
C data 45
key CCC
dtype: object

df.groupby(‘key’).sum()

data

key
A 15
B 30
C 45

import numpy as np

df.groupby(‘key’).aggregate(np.sum)

data

key
A 15
B 30
C 45

df=pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\data\titanic.csv’)

df.groupby(‘Sex’)[‘Age’].mean()

Sex
female 27.915709
male 30.726645
Name: Age, dtype: float64

df.groupby(‘Sex’)[‘Survived’].mean()

Sex
female 0.742038
male 0.188908
Name: Survived, dtype: float64

#数值运算

import pandas as pd

df = pd.DataFrame([[1,2,3],[4,5,6]],index = [‘a’,‘b’],columns = [‘A’,‘B’,‘C’])

A 	B 	C

a 1 2 3
b 4 5 6

df.sum()#默认按列求和

df.sum(axis=0)#按列求和

A 5
B 7
C 9
dtype: int64

df.sum(axis=1)#按行求和

a 6
b 15
dtype: int64

df.sum(axis=‘columns’)

a 6
b 15
dtype: int64

df.mean()

A 2.5
B 3.5
C 4.5
dtype: float64

df.std()

A 2.12132
B 2.12132
C 2.12132
dtype: float64

df.var()

A 4.5
B 4.5
C 4.5
dtype: float64

df.max()

A 4
B 5
C 6
dtype: int64

df.min()

A 1
B 2
C 3
dtype: int64

df.max(axis=0)

A 4
B 5
C 6
dtype: int64

df.min(axis=1)

a 1
b 4
dtype: int64

df.median()

A 2.5
B 3.5
C 4.5
dtype: float64

###按照二元统计

df=pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\data\titanic.csv’)

df.head()

PassengerId 	Survived 	Pclass 	Name 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

df.cov()

PassengerId 	Survived 	Pclass 	Age 	SibSp 	Parch 	Fare

PassengerId 66231.000000 -0.626966 -7.561798 138.696504 -16.325843 -0.342697 161.883369
Survived -0.626966 0.236772 -0.137703 -0.551296 -0.018954 0.032017 6.221787
Pclass -7.561798 -0.137703 0.699015 -4.496004 0.076599 0.012429 -22.830196
Age 138.696504 -0.551296 -4.496004 211.019125 -4.163334 -2.344191 73.849030
SibSp -16.325843 -0.018954 0.076599 -4.163334 1.216043 0.368739 8.748734
Parch -0.342697 0.032017 0.012429 -2.344191 0.368739 0.649728 8.661052
Fare 161.883369 6.221787 -22.830196 73.849030 8.748734 8.661052 2469.436846

df.corr()

PassengerId 	Survived 	Pclass 	Age 	SibSp 	Parch 	Fare

PassengerId 1.000000 -0.005007 -0.035144 0.036847 -0.057527 -0.001652 0.012658
Survived -0.005007 1.000000 -0.338481 -0.077221 -0.035322 0.081629 0.257307
Pclass -0.035144 -0.338481 1.000000 -0.369226 0.083081 0.018443 -0.549500
Age 0.036847 -0.077221 -0.369226 1.000000 -0.308247 -0.189119 0.096067
SibSp -0.057527 -0.035322 0.083081 -0.308247 1.000000 0.414838 0.159651
Parch -0.001652 0.081629 0.018443 -0.189119 0.414838 1.000000 0.216225
Fare 0.012658 0.257307 -0.549500 0.096067 0.159651 0.216225 1.000000

df[‘Age’].value_counts()

24.00 30
22.00 27
18.00 26
19.00 25
30.00 25
28.00 25
21.00 24
25.00 23
36.00 22
29.00 20
32.00 18
27.00 18
35.00 18
26.00 18
16.00 17
31.00 17
20.00 15
33.00 15
23.00 15
34.00 15
39.00 14
17.00 13
42.00 13
40.00 13
45.00 12
38.00 11
50.00 10
2.00 10
4.00 10
47.00 9
…
71.00 2
59.00 2
63.00 2
0.83 2
30.50 2
70.00 2
57.00 2
0.75 2
13.00 2
10.00 2
64.00 2
40.50 2
32.50 2
45.50 2
20.50 1
24.50 1
0.67 1
14.50 1
0.92 1
74.00 1
34.50 1
80.00 1
12.00 1
36.50 1
53.00 1
55.50 1
70.50 1
66.00 1
23.50 1
0.42 1
Name: Age, Length: 88, dtype: int64

df[‘Age’].value_counts(ascending=True,bins=5)#ascending升降序排列分五组，true为正序，false为降序

(64.084, 80.0] 11
(48.168, 64.084] 69
(0.339, 16.336] 100
(32.252, 48.168] 188
(16.336, 32.252] 346
Name: Age, dtype: int64

print(help(pd.value_counts))

Help on function value_counts in module pandas.core.algorithms:

value_counts(values, sort=True, ascending=False, normalize=False, bins=None, dropna=True)
Compute a histogram of the counts of non-null values.

Parameters
----------
values : ndarray (1-d)
sort : boolean, default True
    Sort by values
ascending : boolean, default False
    Sort in ascending order
normalize: boolean, default False
    If True then compute a relative histogram
bins : integer, optional
    Rather than count values, group them into half-open bins,
    convenience for pd.cut, only works with numeric data
dropna : boolean, default True
    Don't include counts of NaN

Returns
-------
value_counts : Series

None

df[‘Age’].count()

714

*Series增删改查

import pandas as pd

data=[10,11,12]

index=[‘a’,‘b’,‘c’]

s=pd.Series(data=data,index=index)

a 10
b 11
c 12
dtype: int64

s[0]

s[0:2]

a 10
b 11
dtype: int64

mask=[True,False,True]

s[mask]

a 10
c 12
dtype: int64

s.loc[‘b’]

s.iloc[1]

**改操作

s1=s.copy()

a 10
b 11
c 12
dtype: int64

s1[‘a’]

s1.replace(to_replace=100,value=101,inplace=False)#false是默认操作，如果修改，可以把false改为true

a 10
b 11
c 12
dtype: int64

s1.replace(to_replace=100,value=101,inplace=True)

a 10
b 11
c 12
dtype: int64

s1.index

Index([‘a’, ‘b’, ‘c’], dtype=‘object’)

s1.index=[‘a’,‘b’,‘c’]

a 10
b 11
c 12
dtype: int64

s1.rename(index={‘a’:‘A’},inplace=True)#作用是把小写字母改为大写字母

A 10
b 11
c 12
dtype: int64

s2=pd.Series([100,500],index=[‘g’,‘h’])

g 100
h 500
dtype: int64

**增操作

data=[100,101]

index=[‘h’,‘k’]

s2=pd.Series(data=data,index=index)

h 100
k 101
dtype: int64

s1.append(s2)

j 500
h 100
k 101
dtype: int64

s1[‘j’]=500

j 500
dtype: int64

s1.append(s2,ignore_index=False)

A 10
b 11
c 12
j 500
h 100
k 101
dtype: int64

s1.append(s2,ignore_index=True)

0 10
1 11
2 12
3 500
4 100
5 101
dtype: int64

**删除操作

s1.append(s2)

j 500
h 100
k 101
dtype: int64

del s1[‘j’]

Series([], dtype: int64)

s1.drop([‘j’],inplace=True)

Series([], dtype: int64)

*DataFrame结构的增删改查

data=[[1,2,3],[4,5,6]]

index=[‘a’,‘b’]

columns=[‘A’,‘B’,‘C’]

df=pd.DataFrame(data=data,index=index,columns=columns)

A 	B 	C

a 1 2 3
b 4 5 6

df[‘A’]#查操作

a 1
b 4
Name: A, dtype: int64

import pandas as pd

df.iloc[0]

A 1
B 2
C 3
Name: a, dtype: int64

df.loc[‘a’]

A 1
B 2
C 3
Name: a, dtype: int64

**改操作

df.loc[‘a’][‘A’]

df.loc[‘a’][‘A’]=150

A 	B 	C

a 150 2 3
b 4 5 6

df.index=[‘f’,‘g’]

A 	B 	C

f 150 2 3
g 4 5 6

**增加操作

df.loc[‘c’]=[1,2,3]

A 	B 	C

f 150 2 3
g 4 5 6
lc 1 2 3
c 1 2 3

data=[[1,2,3],[4,5,6]]

index=[‘j’,‘k’]

columns=[‘A’,‘B’,‘C’]

df2=pd.DataFrame(data=data,index=index,columns=columns)

df2

A 	B 	C

j 1 2 3
k 4 5 6

df3=pd.concat([df,df2],axis=1)

df3

A 	B 	C 	A 	B 	C

c 1.0 2.0 3.0 NaN NaN NaN
f 150.0 2.0 3.0 NaN NaN NaN
g 4.0 5.0 6.0 NaN NaN NaN
j NaN NaN NaN 1.0 2.0 3.0
k NaN NaN NaN 4.0 5.0 6.0
lc 1.0 2.0 3.0 NaN NaN NaN

df2[‘cui’]=[10,11]

df2

A 	B 	C 	cui

j 1 2 3 10
k 4 5 6 11

df4=pd.DataFrame([[10,11],[12,13]],index=[‘j’,‘k’],columns=[‘D’,‘E’])

df4

D 	E

j 10 11
k 12 13

df5=pd.concat([df2,df4],axis=1)

df5

A 	B 	C 	cui 	D 	E

j 1 2 3 10 10 11
k 4 5 6 11 12 13

**删除操作

df5.drop([‘j’],axis=0,inplace=True)

df5

A 	B 	C 	cui 	D 	E

k 4 5 6 11 12 13

del df5[‘cui’]

df5

A 	B 	C 	D 	E

k 4 5 6 12 13

**merge操作

import pandas as pd

left=pd.DataFrame({‘key’:[‘k0’,‘k1’,‘k2’,‘k3’],

               'A':['A0','A1','A2','A3'],

               'B':['B0','B1','B2','B3'],

})

right=pd.DataFrame({‘key’:[‘k0’,‘k1’,‘k2’,‘k3’],

                  'C':['C0','C1','C2','C3'],

               'D':['D0','D1','D2','D3']

               })

left

A 	B 	key

0 A0 B0 k0
1 A1 B1 k1
2 A2 B2 k2
3 A3 B3 k3

right

C 	D 	key

0 C0 D0 k0
1 C1 D1 k1
2 C2 D2 k2
3 C3 D3 k3

pd.merge(left,right)

A 	B 	key 	C 	D

0 A0 B0 k0 C0 D0
1 A1 B1 k1 C1 D1
2 A2 B2 k2 C2 D2
3 A3 B3 k3 C3 D3

pd.merge(left,right,on=‘key’)

A 	B 	key 	C 	D

0 A0 B0 k0 C0 D0
1 A1 B1 k1 C1 D1
2 A2 B2 k2 C2 D2
3 A3 B3 k3 C3 D3

left=pd.DataFrame({‘key1’:[‘k10’,‘k11’,‘k12’,‘k13’],

                'key2':['k20','k21','k22','k23'],

               'A':['A0','A1','A2','A3'],

               'B':['B0','B1','B2','B3'],

})

right=pd.DataFrame({‘key1’:[‘k10’,‘k11’,‘k12’,‘k13’],

                'key2':['k20','k21','k22','k23'],

                  'C':['C0','C1','C2','C3'],

               'D':['D0','D1','D2','D3'],

               })

left

A 	B 	key1 	key2

0 A0 B0 k10 k20
1 A1 B1 k11 k21
2 A2 B2 k12 k22
3 A3 B3 k13 k23

right

C 	D 	key1 	key2

0 C0 D0 k10 k20
1 C1 D1 k11 k21
2 C2 D2 k12 k22
3 C3 D3 k13 k23

pd.merge(left,right,on=[‘key1’,‘key2’],how=‘outer’)

A 	B 	key1 	key2 	C 	D

0 A0 B0 k10 k20 C0 D0
1 A1 B1 k11 k21 C1 D1
2 A2 B2 k12 k22 C2 D2
3 A3 B3 k13 k23 C3 D3

#两个写进去

pd.merge(left,right,on=[‘key1’,‘key2’],how=‘outer’,indicator=True)

A 	B 	key1 	key2 	C 	D 	_merge

0 A0 B0 k10 k20 C0 D0 both
1 A1 B1 k11 k21 C1 D1 both
2 A2 B2 k12 k22 C2 D2 both
3 A3 B3 k13 k23 C3 D3 both

#显示设置

import pandas as pd

pd.get_option(‘display.max_rows’)

pd.Series(index=range(0,100))

0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NaN
21 NaN
22 NaN
23 NaN
24 NaN
25 NaN
26 NaN
27 NaN
28 NaN
29 NaN
…
70 NaN
71 NaN
72 NaN
73 NaN
74 NaN
75 NaN
76 NaN
77 NaN
78 NaN
79 NaN
80 NaN
81 NaN
82 NaN
83 NaN
84 NaN
85 NaN
86 NaN
87 NaN
88 NaN
89 NaN
90 NaN
91 NaN
92 NaN
93 NaN
94 NaN
95 NaN
96 NaN
97 NaN
98 NaN
99 NaN
Length: 100, dtype: float64

pd.get_option(‘display.max_columns’)

pd.DataFrame(columns=range(0,30))

0 	1 	2 	3 	4 	5 	6 	7 	8 	9 	... 	20 	21 	22 	23 	24 	25 	26 	27 	28 	29

0 rows × 30 columns

pd.set_option(‘display.max_columns’,30)

pd.Series(index=[‘A’],data=[‘t’*70])

A tttttttttttttttttttttttttttttttttttttttttttttt…
dtype: object

pd.get_option(‘display.precision’)

pd.set_option(‘display.precision’,5)

pd.Series(data=[1.23456789123456])

0 1.23457
dtype: float64

*pivot操作

#数据透视表

import pandas as pd

example=pd.DataFrame({‘Month’: [“January”, “January”, “January”, “January”,

                              "February", "February", "February", "February", 

                              "March", "March", "March", "March"],

                     'Category': ["Transportation", "Grocery", "Household", "Entertainment",

                            "Transportation", "Grocery", "Household", "Entertainment",

                            "Transportation", "Grocery", "Household", "Entertainment"],

               'Amount': [74., 235., 175., 100., 115., 240., 225., 125., 90., 260., 200., 120.]

                    })

example

Amount 	Category 	Month

0 74.0 Transportation January
1 235.0 Grocery January
2 175.0 Household January
3 100.0 Entertainment January
4 115.0 Transportation February
5 240.0 Grocery February
6 225.0 Household February
7 125.0 Entertainment February
8 90.0 Transportation March
9 260.0 Grocery March
10 200.0 Household March
11 120.0 Entertainment March

example_pivot=example.pivot(index=‘Category’,columns=‘Month’,values=‘Amount’)

example_pivot

Month February January March
Category
Entertainment 125.0 100.0 120.0
Grocery 240.0 235.0 260.0
Household 225.0 175.0 200.0
Transportation 115.0 74.0 90.0

example_pivot.sum(axis=1)

Category
Entertainment 345.0
Grocery 735.0
Household 600.0
Transportation 279.0
dtype: float64

example_pivot.sum(axis=0)

Month
February 705.0
January 584.0
March 670.0
dtype: float64

df=pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\data\titanic.csv’)

#默认值就是平均值

df.pivot_table(index=‘Sex’,columns=‘Pclass’,values=‘Fare’)

Pclass 1 2 3
Sex
female 106.12580 21.97012 16.11881
male 67.22613 19.74178 12.66163

df.pivot_table(index=‘Sex’,columns=‘Pclass’,values=‘Fare’,aggfunc=‘max’)

Pclass 1 2 3
Sex
female 512.3292 65.0 69.55
male 512.3292 73.5 69.55

df.pivot_table(index=‘Sex’,columns=‘Pclass’,values=‘Fare’,aggfunc=‘count’)

Pclass 1 2 3
Sex
female 94 76 144
male 122 108 347

pd.crosstab(index=df[‘Sex’],columns=df[‘Pclass’])

Pclass 1 2 3
Sex
female 94 76 144
male 122 108 347

df[‘underaged’]=df[‘Age’]<=18

#时间操作

import datetime

dt=datetime.datetime(year=2019,month=4,day=3,hour=15,minute=30)

datetime.datetime(2019, 4, 3, 15, 30)

print(dt)

2019-04-03 15:30:00

import pandas as pd

ts=pd.Timestamp(‘2019-04-03’)

Timestamp(‘2019-04-03 00:00:00’)

ts.month

ts.day

#加减法操作

ts+pd.Timedelta(‘5days’)

Timestamp(‘2019-04-08 00:00:00’)

pd.to_datetime(‘2019-4-3’)

Timestamp(‘2019-04-03 00:00:00’)

pd.to_datetime(‘3/4/2019’)

Timestamp(‘2019-03-04 00:00:00’)

s=pd.Series([‘2019-04-03 00:00:00’,‘2019-04-03 00:00:00’,‘2019-04-03 00:00:00’])

0 2019-04-03 00:00:00
1 2019-04-03 00:00:00
2 2019-04-03 00:00:00
dtype: object

ts=pd.to_datetime(s)

0 2019-04-03
1 2019-04-03
2 2019-04-03
dtype: datetime64[ns]

ts.dt.hour

0 0
1 0
2 0
dtype: int64

ts.dt.weekday

0 2
1 2
2 2
dtype: int64

pd.Series(pd.date_range(start=‘2019-4-3’,periods=10,freq=‘12H’))

0 2019-04-03 00:00:00
1 2019-04-03 12:00:00
2 2019-04-04 00:00:00
3 2019-04-04 12:00:00
4 2019-04-05 00:00:00
5 2019-04-05 12:00:00
6 2019-04-06 00:00:00
7 2019-04-06 12:00:00
8 2019-04-07 00:00:00
9 2019-04-07 12:00:00
dtype: datetime64[ns]

data=pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\data\flowdata.csv’)

data

Time 	L06_347 	LS06_347 	LS06_348

0 2009-01-01 00:00:00 0.13742 0.09750 0.01683
1 2009-01-01 03:00:00 0.13125 0.08883 0.01642
2 2009-01-01 06:00:00 0.11350 0.09125 0.01675
3 2009-01-01 09:00:00 0.13575 0.09150 0.01625
4 2009-01-01 12:00:00 0.14092 0.09617 0.01700
5 2009-01-01 15:00:00 0.09917 0.09167 0.01758
6 2009-01-01 18:00:00 0.13267 0.09017 0.01625
7 2009-01-01 21:00:00 0.10942 0.09117 0.01600
8 2009-01-02 00:00:00 0.13383 0.09042 0.01608
9 2009-01-02 03:00:00 0.09208 0.08867 0.01600
10 2009-01-02 06:00:00 0.11292 0.09142 0.01633
11 2009-01-02 09:00:00 0.14192 0.09708 0.01642
12 2009-01-02 12:00:00 0.14783 0.10192 0.01642
13 2009-01-02 15:00:00 0.10792 0.10025 0.01642
14 2009-01-02 18:00:00 0.14358 0.09842 0.01675
15 2009-01-02 21:00:00 0.11308 0.09808 0.01683
16 2009-01-03 00:00:00 0.13583 0.09217 0.01683
17 2009-01-03 03:00:00 0.08325 0.08000 0.01608
18 2009-01-03 06:00:00 0.11942 0.08025 0.01542
19 2009-01-03 09:00:00 0.12458 0.08442 0.01583
20 2009-01-03 12:00:00 0.09167 0.08825 0.01625
21 2009-01-03 15:00:00 0.12500 0.08467 0.01650
22 2009-01-03 18:00:00 0.12158 0.08208 0.01583
23 2009-01-03 21:00:00 0.10717 0.09250 0.01600
24 2009-01-04 00:00:00 0.13525 0.09117 0.01633
25 2009-01-04 03:00:00 0.13558 0.09158 0.01608
26 2009-01-04 06:00:00 0.11717 0.09517 0.01600
27 2009-01-04 09:00:00 0.10900 0.10517 0.01800
28 2009-01-04 12:00:00 0.15742 0.11075 0.01842
29 2009-01-04 15:00:00 0.16042 0.11375 0.01842
… … … … …
11667 2012-12-29 09:00:00 0.78683 0.78683 0.07700
11668 2012-12-29 12:00:00 0.72375 0.72375 0.07267
11669 2012-12-29 15:00:00 0.69067 0.69067 0.06967
11670 2012-12-29 18:00:00 0.66342 0.66342 0.06967
11671 2012-12-29 21:00:00 0.73592 0.73592 0.07283
11672 2012-12-30 00:00:00 0.75367 0.75367 0.06183
11673 2012-12-30 03:00:00 0.66333 0.66333 0.07367
11674 2012-12-30 06:00:00 0.79683 0.79683 0.09517
11675 2012-12-30 09:00:00 0.91600 0.91600 0.10158
11676 2012-12-30 12:00:00 1.46500 1.46500 0.08683
11677 2012-12-30 15:00:00 1.31417 1.31417 0.08542
11678 2012-12-30 18:00:00 1.23917 1.23917 0.09808
11679 2012-12-30 21:00:00 1.06975 1.06975 0.10142
11680 2012-12-31 00:00:00 0.97333 0.97333 0.08500
11681 2012-12-31 03:00:00 0.85083 0.85083 0.07392
11682 2012-12-31 06:00:00 0.73592 0.73592 0.06942
11683 2012-12-31 09:00:00 0.68275 0.68275 0.06658
11684 2012-12-31 12:00:00 0.65125 0.65125 0.06383
11685 2012-12-31 15:00:00 0.62900 0.62900 0.06183
11686 2012-12-31 18:00:00 0.61733 0.61733 0.06058
11687 2012-12-31 21:00:00 0.84650 0.84650 0.17017
11688 2013-01-01 00:00:00 1.68833 1.68833 0.20733
11689 2013-01-01 03:00:00 2.69333 2.69333 0.20150
11690 2013-01-01 06:00:00 2.22083 2.22083 0.16692
11691 2013-01-01 09:00:00 2.05500 2.05500 0.17567
11692 2013-01-01 12:00:00 1.71000 1.71000 0.12958
11693 2013-01-01 15:00:00 1.42000 1.42000 0.09633
11694 2013-01-01 18:00:00 1.17858 1.17858 0.08308
11695 2013-01-01 21:00:00 0.89825 0.89825 0.07717
11696 2013-01-02 00:00:00 0.86000 0.86000 0.07500

11697 rows × 4 columns

data.head()

Time 	L06_347 	LS06_347 	LS06_348

data[‘Time’]=pd.to_datetime(data[‘Time’])

data=data.set_index(‘Time’)

data

L06_347 	LS06_347 	LS06_348

Time
2009-01-01 00:00:00 0.13742 0.09750 0.01683
2009-01-01 03:00:00 0.13125 0.08883 0.01642
2009-01-01 06:00:00 0.11350 0.09125 0.01675
2009-01-01 09:00:00 0.13575 0.09150 0.01625
2009-01-01 12:00:00 0.14092 0.09617 0.01700
2009-01-01 15:00:00 0.09917 0.09167 0.01758
2009-01-01 18:00:00 0.13267 0.09017 0.01625
2009-01-01 21:00:00 0.10942 0.09117 0.01600
2009-01-02 00:00:00 0.13383 0.09042 0.01608
2009-01-02 03:00:00 0.09208 0.08867 0.01600
2009-01-02 06:00:00 0.11292 0.09142 0.01633
2009-01-02 09:00:00 0.14192 0.09708 0.01642
2009-01-02 12:00:00 0.14783 0.10192 0.01642
2009-01-02 15:00:00 0.10792 0.10025 0.01642
2009-01-02 18:00:00 0.14358 0.09842 0.01675
2009-01-02 21:00:00 0.11308 0.09808 0.01683
2009-01-03 00:00:00 0.13583 0.09217 0.01683
2009-01-03 03:00:00 0.08325 0.08000 0.01608
2009-01-03 06:00:00 0.11942 0.08025 0.01542
2009-01-03 09:00:00 0.12458 0.08442 0.01583
2009-01-03 12:00:00 0.09167 0.08825 0.01625
2009-01-03 15:00:00 0.12500 0.08467 0.01650
2009-01-03 18:00:00 0.12158 0.08208 0.01583
2009-01-03 21:00:00 0.10717 0.09250 0.01600
2009-01-04 00:00:00 0.13525 0.09117 0.01633
2009-01-04 03:00:00 0.13558 0.09158 0.01608
2009-01-04 06:00:00 0.11717 0.09517 0.01600
2009-01-04 09:00:00 0.10900 0.10517 0.01800
2009-01-04 12:00:00 0.15742 0.11075 0.01842
2009-01-04 15:00:00 0.16042 0.11375 0.01842
… … … …
2012-12-29 09:00:00 0.78683 0.78683 0.07700
2012-12-29 12:00:00 0.72375 0.72375 0.07267
2012-12-29 15:00:00 0.69067 0.69067 0.06967
2012-12-29 18:00:00 0.66342 0.66342 0.06967
2012-12-29 21:00:00 0.73592 0.73592 0.07283
2012-12-30 00:00:00 0.75367 0.75367 0.06183
2012-12-30 03:00:00 0.66333 0.66333 0.07367
2012-12-30 06:00:00 0.79683 0.79683 0.09517
2012-12-30 09:00:00 0.91600 0.91600 0.10158
2012-12-30 12:00:00 1.46500 1.46500 0.08683
2012-12-30 15:00:00 1.31417 1.31417 0.08542
2012-12-30 18:00:00 1.23917 1.23917 0.09808
2012-12-30 21:00:00 1.06975 1.06975 0.10142
2012-12-31 00:00:00 0.97333 0.97333 0.08500
2012-12-31 03:00:00 0.85083 0.85083 0.07392
2012-12-31 06:00:00 0.73592 0.73592 0.06942
2012-12-31 09:00:00 0.68275 0.68275 0.06658
2012-12-31 12:00:00 0.65125 0.65125 0.06383
2012-12-31 15:00:00 0.62900 0.62900 0.06183
2012-12-31 18:00:00 0.61733 0.61733 0.06058
2012-12-31 21:00:00 0.84650 0.84650 0.17017
2013-01-01 00:00:00 1.68833 1.68833 0.20733
2013-01-01 03:00:00 2.69333 2.69333 0.20150
2013-01-01 06:00:00 2.22083 2.22083 0.16692
2013-01-01 09:00:00 2.05500 2.05500 0.17567
2013-01-01 12:00:00 1.71000 1.71000 0.12958
2013-01-01 15:00:00 1.42000 1.42000 0.09633
2013-01-01 18:00:00 1.17858 1.17858 0.08308
2013-01-01 21:00:00 0.89825 0.89825 0.07717
2013-01-02 00:00:00 0.86000 0.86000 0.07500

11697 rows × 3 columns

data.index

DatetimeIndex([‘2009-01-01 00:00:00’, ‘2009-01-01 03:00:00’,
‘2009-01-01 06:00:00’, ‘2009-01-01 09:00:00’,
‘2009-01-01 12:00:00’, ‘2009-01-01 15:00:00’,
‘2009-01-01 18:00:00’, ‘2009-01-01 21:00:00’,
‘2009-01-02 00:00:00’, ‘2009-01-02 03:00:00’,
…
‘2012-12-31 21:00:00’, ‘2013-01-01 00:00:00’,
‘2013-01-01 03:00:00’, ‘2013-01-01 06:00:00’,
‘2013-01-01 09:00:00’, ‘2013-01-01 12:00:00’,
‘2013-01-01 15:00:00’, ‘2013-01-01 18:00:00’,
‘2013-01-01 21:00:00’, ‘2013-01-02 00:00:00’],
dtype=‘datetime64[ns]’, name=‘Time’, length=11697, freq=None)

data=pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\data\flowdata.csv’,index_col=0,parse_dates=True)

data.head()

L06_347 	LS06_347 	LS06_348

#时间为索引

data[pd.Timestamp(‘2019-04-03 09:00’):pd.Timestamp(‘2019-04-03 19:00’)]

L06_347 	LS06_347 	LS06_348

Time

data.tail(10)

L06_347 	LS06_347 	LS06_348

Time
2012-12-31 21:00:00 0.84650 0.84650 0.17017
2013-01-01 00:00:00 1.68833 1.68833 0.20733
2013-01-01 03:00:00 2.69333 2.69333 0.20150
2013-01-01 06:00:00 2.22083 2.22083 0.16692
2013-01-01 09:00:00 2.05500 2.05500 0.17567
2013-01-01 12:00:00 1.71000 1.71000 0.12958
2013-01-01 15:00:00 1.42000 1.42000 0.09633
2013-01-01 18:00:00 1.17858 1.17858 0.08308
2013-01-01 21:00:00 0.89825 0.89825 0.07717
2013-01-02 00:00:00 0.86000 0.86000 0.07500

data[‘2013’]

L06_347 	LS06_347 	LS06_348

Time
2013-01-01 00:00:00 1.68833 1.68833 0.20733
2013-01-01 03:00:00 2.69333 2.69333 0.20150
2013-01-01 06:00:00 2.22083 2.22083 0.16692
2013-01-01 09:00:00 2.05500 2.05500 0.17567
2013-01-01 12:00:00 1.71000 1.71000 0.12958
2013-01-01 15:00:00 1.42000 1.42000 0.09633
2013-01-01 18:00:00 1.17858 1.17858 0.08308
2013-01-01 21:00:00 0.89825 0.89825 0.07717
2013-01-02 00:00:00 0.86000 0.86000 0.07500

data[‘2019-04-03’:‘2019-05’]

L06_347 	LS06_347 	LS06_348

Time

data[data.index.month==1]

L06_347 	LS06_347 	LS06_348

Time
2009-01-01 00:00:00 0.13742 0.09750 0.01683
2009-01-01 03:00:00 0.13125 0.08883 0.01642
2009-01-01 06:00:00 0.11350 0.09125 0.01675
2009-01-01 09:00:00 0.13575 0.09150 0.01625
2009-01-01 12:00:00 0.14092 0.09617 0.01700
2009-01-01 15:00:00 0.09917 0.09167 0.01758
2009-01-01 18:00:00 0.13267 0.09017 0.01625
2009-01-01 21:00:00 0.10942 0.09117 0.01600
2009-01-02 00:00:00 0.13383 0.09042 0.01608
2009-01-02 03:00:00 0.09208 0.08867 0.01600
2009-01-02 06:00:00 0.11292 0.09142 0.01633
2009-01-02 09:00:00 0.14192 0.09708 0.01642
2009-01-02 12:00:00 0.14783 0.10192 0.01642
2009-01-02 15:00:00 0.10792 0.10025 0.01642
2009-01-02 18:00:00 0.14358 0.09842 0.01675
2009-01-02 21:00:00 0.11308 0.09808 0.01683
2009-01-03 00:00:00 0.13583 0.09217 0.01683
2009-01-03 03:00:00 0.08325 0.08000 0.01608
2009-01-03 06:00:00 0.11942 0.08025 0.01542
2009-01-03 09:00:00 0.12458 0.08442 0.01583
2009-01-03 12:00:00 0.09167 0.08825 0.01625
2009-01-03 15:00:00 0.12500 0.08467 0.01650
2009-01-03 18:00:00 0.12158 0.08208 0.01583
2009-01-03 21:00:00 0.10717 0.09250 0.01600
2009-01-04 00:00:00 0.13525 0.09117 0.01633
2009-01-04 03:00:00 0.13558 0.09158 0.01608
2009-01-04 06:00:00 0.11717 0.09517 0.01600
2009-01-04 09:00:00 0.10900 0.10517 0.01800
2009-01-04 12:00:00 0.15742 0.11075 0.01842
2009-01-04 15:00:00 0.16042 0.11375 0.01842
… … … …
2012-01-29 09:00:00 0.29683 0.31583 0.03475
2012-01-29 12:00:00 0.29400 0.31192 0.03433
2012-01-29 15:00:00 0.26950 0.30800 0.03300
2012-01-29 18:00:00 0.25942 0.30442 0.03183
2012-01-29 21:00:00 0.25458 0.29625 0.03133
2012-01-30 00:00:00 0.24350 0.28733 0.03092
2012-01-30 03:00:00 0.23625 0.28167 0.03025
2012-01-30 06:00:00 0.23033 0.27217 0.02942
2012-01-30 09:00:00 0.22183 0.26325 0.02783
2012-01-30 12:00:00 0.22425 0.26258 0.02925
2012-01-30 15:00:00 0.20600 0.25675 0.02892
2012-01-30 18:00:00 0.20042 0.25842 0.02825
2012-01-30 21:00:00 0.19275 0.25108 0.02725
2012-01-31 00:00:00 0.19125 0.24742 0.02592
2012-01-31 03:00:00 0.18108 0.24158 0.02583
2012-01-31 06:00:00 0.18875 0.23675 0.02600
2012-01-31 09:00:00 0.19100 0.23125 0.02558
2012-01-31 12:00:00 0.18333 0.22717 0.02592
2012-01-31 15:00:00 0.16342 0.22100 0.02375
2012-01-31 18:00:00 0.15708 0.22067 0.02317
2012-01-31 21:00:00 0.16008 0.21475 0.02333
2013-01-01 00:00:00 1.68833 1.68833 0.20733
2013-01-01 03:00:00 2.69333 2.69333 0.20150
2013-01-01 06:00:00 2.22083 2.22083 0.16692
2013-01-01 09:00:00 2.05500 2.05500 0.17567
2013-01-01 12:00:00 1.71000 1.71000 0.12958
2013-01-01 15:00:00 1.42000 1.42000 0.09633
2013-01-01 18:00:00 1.17858 1.17858 0.08308
2013-01-01 21:00:00 0.89825 0.89825 0.07717
2013-01-02 00:00:00 0.86000 0.86000 0.07500

1001 rows × 3 columns

data[(data.index.hour>8)&(data.index.hour<12)]

L06_347 	LS06_347 	LS06_348

Time
2009-01-01 09:00:00 0.13575 0.09150 0.01625
2009-01-02 09:00:00 0.14192 0.09708 0.01642
2009-01-03 09:00:00 0.12458 0.08442 0.01583
2009-01-04 09:00:00 0.10900 0.10517 0.01800
2009-01-05 09:00:00 0.16150 0.11458 0.02158
2009-01-06 09:00:00 0.10008 0.06558 0.01550
2009-01-07 09:00:00 0.13850 0.09392 0.01500
2009-01-08 09:00:00 0.10133 0.06642 0.01683
2009-01-09 09:00:00 0.06175 0.05942 0.01517
2009-01-10 09:00:00 0.19350 0.14700 0.01300
2009-01-11 09:00:00 0.08025 0.07742 0.01358
2009-01-12 09:00:00 0.13250 0.08917 0.01683
2009-01-13 09:00:00 0.19650 0.19267 0.04533
2009-01-14 09:00:00 0.32292 0.29925 0.02933
2009-01-15 09:00:00 0.21075 0.16750 0.02500
2009-01-16 09:00:00 0.15783 0.15392 0.02300
2009-01-17 09:00:00 0.21867 0.17333 0.02292
2009-01-18 09:00:00 0.63300 0.74567 0.07700
2009-01-19 09:00:00 1.04217 1.39850 0.13367
2009-01-20 09:00:00 0.75300 0.77300 0.06558
2009-01-21 09:00:00 0.39850 0.39850 0.04250
2009-01-22 09:00:00 0.36242 0.35125 0.03667
2009-01-23 09:00:00 8.23750 8.56000 0.38375
2009-01-24 09:00:00 1.85750 2.35667 0.09975
2009-01-25 09:00:00 0.57558 0.65775 0.05900
2009-01-26 09:00:00 0.30542 0.27992 0.04417
2009-01-27 09:00:00 0.27992 0.27492 0.03250
2009-01-28 09:00:00 0.28708 0.25383 0.03108
2009-01-29 09:00:00 0.26075 0.22183 0.02817
2009-01-30 09:00:00 0.24200 0.20017 0.02475
… … … …
2012-12-03 09:00:00 0.14450 0.14450 0.07467
2012-12-04 09:00:00 0.29208 0.29208 0.04108
2012-12-05 09:00:00 0.77525 0.77525 0.07567
2012-12-06 09:00:00 0.46792 0.46792 0.06075
2012-12-07 09:00:00 0.50983 0.50983 0.09658
2012-12-08 09:00:00 0.45758 0.45758 0.06467
2012-12-09 09:00:00 0.28875 0.28875 0.05317
2012-12-10 09:00:00 0.28925 0.28925 0.06008
2012-12-11 09:00:00 0.22608 0.22608 0.03783
2012-12-12 09:00:00 0.20133 0.20133 0.03517
2012-12-13 09:00:00 0.17575 0.17575 0.03450
2012-12-14 09:00:00 0.16583 0.16583 0.03542
2012-12-15 09:00:00 0.57683 0.57683 0.06508
2012-12-16 09:00:00 0.38175 0.38175 0.04642
2012-12-17 09:00:00 0.30583 0.30583 0.05092
2012-12-18 09:00:00 0.30217 0.30217 0.07067
2012-12-19 09:00:00 0.28292 0.28292 0.04133
2012-12-20 09:00:00 0.30608 0.30608 0.06825
2012-12-21 09:00:00 0.55033 0.55033 0.05925
2012-12-22 09:00:00 0.37883 0.37883 0.06967
2012-12-23 09:00:00 5.91750 5.91750 0.28658
2012-12-24 09:00:00 1.63833 1.63833 0.15133
2012-12-25 09:00:00 1.71917 1.71917 0.14625
2012-12-26 09:00:00 1.35417 1.35417 0.12758
2012-12-27 09:00:00 1.07667 1.07667 0.10300
2012-12-28 09:00:00 0.96150 0.96150 0.09242
2012-12-29 09:00:00 0.78683 0.78683 0.07700
2012-12-30 09:00:00 0.91600 0.91600 0.10158
2012-12-31 09:00:00 0.68275 0.68275 0.06658
2013-01-01 09:00:00 2.05500 2.05500 0.17567

1462 rows × 3 columns

data.between_time(‘05:00’,‘12:00’)

L06_347 	LS06_347 	LS06_348

Time
2009-01-01 06:00:00 0.11350 0.09125 0.01675
2009-01-01 09:00:00 0.13575 0.09150 0.01625
2009-01-01 12:00:00 0.14092 0.09617 0.01700
2009-01-02 06:00:00 0.11292 0.09142 0.01633
2009-01-02 09:00:00 0.14192 0.09708 0.01642
2009-01-02 12:00:00 0.14783 0.10192 0.01642
2009-01-03 06:00:00 0.11942 0.08025 0.01542
2009-01-03 09:00:00 0.12458 0.08442 0.01583
2009-01-03 12:00:00 0.09167 0.08825 0.01625
2009-01-04 06:00:00 0.11717 0.09517 0.01600
2009-01-04 09:00:00 0.10900 0.10517 0.01800
2009-01-04 12:00:00 0.15742 0.11075 0.01842
2009-01-05 06:00:00 0.14650 0.11517 0.01767
2009-01-05 09:00:00 0.16150 0.11458 0.02158
2009-01-05 12:00:00 0.11567 0.11175 0.02017
2009-01-06 06:00:00 0.09175 0.06825 0.01425
2009-01-06 09:00:00 0.10008 0.06558 0.01550
2009-01-06 12:00:00 0.12267 0.08292 0.01733
2009-01-07 06:00:00 0.12242 0.09333 0.01475
2009-01-07 09:00:00 0.13850 0.09392 0.01500
2009-01-07 12:00:00 0.13925 0.09467 0.01642
2009-01-08 06:00:00 0.10433 0.06875 0.01525
2009-01-08 09:00:00 0.10133 0.06642 0.01683
2009-01-08 12:00:00 0.11517 0.07700 0.01492
2009-01-09 06:00:00 0.06983 0.05192 0.01358
2009-01-09 09:00:00 0.06175 0.05942 0.01517
2009-01-09 12:00:00 0.10467 0.06925 0.01667
2009-01-10 06:00:00 0.13658 0.11342 0.01167
2009-01-10 09:00:00 0.19350 0.14700 0.01300
2009-01-10 12:00:00 0.14708 0.10208 0.01875
… … … …
2012-12-23 06:00:00 6.07917 6.07917 0.41633
2012-12-23 09:00:00 5.91750 5.91750 0.28658
2012-12-23 12:00:00 4.28333 4.28333 0.27575
2012-12-24 06:00:00 2.45167 2.45167 0.18958
2012-12-24 09:00:00 1.63833 1.63833 0.15133
2012-12-24 12:00:00 1.39583 1.39583 0.13075
2012-12-25 06:00:00 1.81083 1.81083 0.24717
2012-12-25 09:00:00 1.71917 1.71917 0.14625
2012-12-25 12:00:00 1.46417 1.46417 0.11942
2012-12-26 06:00:00 1.30583 1.30583 0.16708
2012-12-26 09:00:00 1.35417 1.35417 0.12758
2012-12-26 12:00:00 1.45917 1.45917 0.10833
2012-12-27 06:00:00 1.44333 1.44333 0.10450
2012-12-27 09:00:00 1.07667 1.07667 0.10300
2012-12-27 12:00:00 1.24417 1.24417 0.19542
2012-12-28 06:00:00 1.39417 1.39417 0.09958
2012-12-28 09:00:00 0.96150 0.96150 0.09242
2012-12-28 12:00:00 0.88842 0.88842 0.11592
2012-12-29 06:00:00 0.84583 0.84583 0.08058
2012-12-29 09:00:00 0.78683 0.78683 0.07700
2012-12-29 12:00:00 0.72375 0.72375 0.07267
2012-12-30 06:00:00 0.79683 0.79683 0.09517
2012-12-30 09:00:00 0.91600 0.91600 0.10158
2012-12-30 12:00:00 1.46500 1.46500 0.08683
2012-12-31 06:00:00 0.73592 0.73592 0.06942
2012-12-31 09:00:00 0.68275 0.68275 0.06658
2012-12-31 12:00:00 0.65125 0.65125 0.06383
2013-01-01 06:00:00 2.22083 2.22083 0.16692
2013-01-01 09:00:00 2.05500 2.05500 0.17567
2013-01-01 12:00:00 1.71000 1.71000 0.12958

4386 rows × 3 columns

data.head(6)

L06_347 	LS06_347 	LS06_348

#重采样

data.resample(‘D’).mean()

L06_347 	LS06_347 	LS06_348

Time
2009-01-01 0.12501 0.09228 0.01664
2009-01-02 0.12415 0.09578 0.01641
2009-01-03 0.11356 0.08554 0.01609
2009-01-04 0.14020 0.10271 0.01732
2009-01-05 0.12881 0.10449 0.01817
2009-01-06 0.09577 0.06793 0.01452
2009-01-07 0.11865 0.08364 0.01434
2009-01-08 0.09432 0.07015 0.01507
2009-01-09 0.07816 0.05844 0.01402
2009-01-10 0.11992 0.09384 0.01357
2009-01-11 0.09965 0.07112 0.01383
2009-01-12 0.12826 0.09405 0.01623
2009-01-13 0.50280 0.59831 0.05027
2009-01-14 0.32390 0.30935 0.02859
2009-01-15 0.21419 0.18598 0.02373
2009-01-16 0.18621 0.14951 0.02148
2009-01-17 0.23133 0.20364 0.02932
2009-01-18 0.75227 0.87436 0.08229
2009-01-19 1.00881 1.26615 0.09624
2009-01-20 0.66869 0.78560 0.06565
2009-01-21 0.37927 0.37766 0.04078
2009-01-22 0.39792 0.40364 0.04447
2009-01-23 5.93353 6.19993 0.40471
2009-01-24 1.89376 2.19266 0.10548
2009-01-25 0.57097 0.62696 0.05811
2009-01-26 0.39536 0.39737 0.04149
2009-01-27 0.28993 0.26887 0.03194
2009-01-28 0.26766 0.23915 0.02860
2009-01-29 0.23087 0.18735 0.02529
2009-01-30 0.22431 0.17994 0.02451
… … … …
2012-12-04 0.42692 0.42692 0.06622
2012-12-05 0.85898 0.85898 0.10526
2012-12-06 0.50203 0.50203 0.06621
2012-12-07 0.75386 0.75386 0.10757
2012-12-08 0.45820 0.45820 0.06511
2012-12-09 0.29474 0.29474 0.05185
2012-12-10 0.29318 0.29318 0.04817
2012-12-11 0.22973 0.22973 0.03755
2012-12-12 0.19996 0.19996 0.03577
2012-12-13 0.17421 0.17421 0.03508
2012-12-14 0.48484 0.48484 0.08501
2012-12-15 0.77540 0.77540 0.06838
2012-12-16 0.35960 0.35960 0.04601
2012-12-17 0.32079 0.32079 0.04277
2012-12-18 0.33107 0.33107 0.04792
2012-12-19 0.28434 0.28434 0.03737
2012-12-20 0.61780 0.61780 0.07503
2012-12-21 0.65828 0.65828 0.05967
2012-12-22 1.71293 1.71293 0.20872
2012-12-23 4.76792 4.76792 0.33498
2012-12-24 1.72472 1.72472 0.14780
2012-12-25 1.81454 1.81454 0.19384
2012-12-26 1.34976 1.34976 0.14795
2012-12-27 1.74635 1.74635 0.15516
2012-12-28 1.25864 1.25864 0.11720
2012-12-29 0.80760 0.80760 0.07803
2012-12-30 1.02724 1.02724 0.08800
2012-12-31 0.74836 0.74836 0.08142
2013-01-01 1.73304 1.73304 0.14220
2013-01-02 0.86000 0.86000 0.07500

1463 rows × 3 columns

data.resample(‘D’,how=‘mean’).head()

D:\program\Anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(…).mean()
“”"Entry point for launching an IPython kernel.

L06_347 	LS06_347 	LS06_348

Time
2009-01-01 0.12501 0.09228 0.01664
2009-01-02 0.12415 0.09578 0.01641
2009-01-03 0.11356 0.08554 0.01609
2009-01-04 0.14020 0.10271 0.01732
2009-01-05 0.12881 0.10449 0.01817

%matplotlib notebook

data.resample(‘M’).mean().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x186d47f8ba8>

#常用操作

import pandas as pd

data = pd.DataFrame({‘group’:[‘a’,‘a’,‘a’,‘b’,‘b’,‘b’,‘c’,‘c’,‘c’],

                'data':[4,3,2,1,12,3,4,5,7]})

data

data 	group

0 4 a
1 3 a
2 2 a
3 1 b
4 12 b
5 3 b
6 4 c
7 5 c
8 7 c

data.sort_values(by=[‘group’,‘data’],ascending=[False,True],inplace=True)

data

data 	group

6 4 c
7 5 c
8 7 c
3 1 b
5 3 b
4 12 b
2 2 a
1 3 a
0 4 a

data=pd.DataFrame({‘k1’:[‘one’]*3+[‘two’]*4,‘k2’:[3,2,1,3,3,4,4]})

data

k1 	k2

0 one 3
1 one 2
2 one 1
3 two 3
4 two 3
5 two 4
6 two 4

data.sort_values(by=‘k2’)

k1 	k2

2 one 1
1 one 2
0 one 3
3 two 3
4 two 3
5 two 4
6 two 4

data.drop_duplicates()#调用重复值

k1 	k2

0 one 3
1 one 2
2 one 1
3 two 3
5 two 4

data.drop_duplicates(subset=‘k1’)

k1 	k2

0 one 3
3 two 3

data=pd.DataFrame({‘food’:[‘A1’,‘A2’,‘B1’,‘B2’,‘B3’,‘C1’,‘C2’],‘data’:[1,2,3,4,5,6,7]})

data

data 	food

0 1 A1
1 2 A2
2 3 B1
3 4 B2
4 5 B3
5 6 C1
6 7 C2

def food_map(series):

if series['food'] == 'A1':

    return 'A'

elif series['food'] == 'A2':

    return 'A'

elif series['food'] == 'B1':

    return 'B'

elif series['food'] == 'B2':

    return 'B'

elif series['food'] == 'B3':

    return 'B'

elif series['food'] == 'C1':

    return 'C'

elif series['food'] == 'C2':

    return 'C'

data[‘food_map’] = data.apply(food_map,axis = ‘columns’)

data

data 	food 	food_map

0 1 A1 A
1 2 A2 A
2 3 B1 B
3 4 B2 B
4 5 B3 B
5 6 C1 C
6 7 C2 C

#字典映射

food2Upper={

'A1':'A',

'A2':'A',

'B1':'B',

'B2':'B',

'B3':'B',

'C1':'C',

'C2':'C'}

data[‘upper’]=data[‘food’].map(food2Upper)

data

data 	food 	food_map 	upper

0 1 A1 A A
1 2 A2 A A
2 3 B1 B B
3 4 B2 B B
4 5 B3 B B
5 6 C1 C C
6 7 C2 C C

#常用操作

import numpy as np

df = pd.DataFrame({‘data1’:np.random.randn(5),

              'data2':np.random.randn(5)})

df2 = df.assign(ration = df[‘data1’]/df[‘data2’])

data1 	data2

0 1.80403 0.75122
1 -0.49803 1.43483
2 0.57212 -0.69617
3 0.88256 -0.57113
4 1.31386 -0.10652

df2

data1 	data2 	ration

0 1.80403 0.75122 2.40145
1 -0.49803 1.43483 -0.34710
2 0.57212 -0.69617 -0.82181
3 0.88256 -0.57113 -1.54529
4 1.31386 -0.10652 -12.33390

#召回调数据

df2.drop(‘ration’,axis=‘columns’,inplace=True)

data=pd.Series([1,2,3,4,5,6,7,8,9])

data

0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
dtype: int64

data.replace(9,np.nan,inplace=True)

data

0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 NaN
dtype: float64

ages=[15,18,20,21,22,345,41,52,63,79]

bins=[10,20,30,40,50,60,70,80]

bin_res=pd.cut(ages,bins)

bin_res

[(10, 20], (10, 20], (10, 20], (20, 30], (20, 30], NaN, (40, 50], (50, 60], (60, 70], (70, 80]]
Categories (7, interval[int64]): [(10, 20] < (20, 30] < (30, 40] < (40, 50] < (50, 60] < (60, 70] < (70, 80]]

bin_res.labels

D:\program\Anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: ‘labels’ is deprecated. Use ‘codes’ instead
“”"Entry point for launching an IPython kernel.

array([ 0, 0, 0, 1, 1, -1, 3, 4, 5, 6], dtype=int8)

pd.value_counts(bin_res)

(10, 20] 3
(20, 30] 2
(70, 80] 1
(60, 70] 1
(50, 60] 1
(40, 50] 1
(30, 40] 0
dtype: int64

pd.cut(ages,[10,30,50,80])

[(10, 30], (10, 30], (10, 30], (10, 30], (10, 30], NaN, (30, 50], (50, 80], (50, 80], (50, 80]]
Categories (3, interval[int64]): [(10, 30] < (30, 50] < (50, 80]]

group_names = [‘Yonth’,‘Mille’,‘Old’]

#pd.cut(ages,[10,20,50,80],labels=group_names)

pd.value_counts(pd.cut(ages,[10,20,50,80],labels=group_names))

Old 3
Mille 3
Yonth 3
dtype: int64

df=pd.DataFrame([range(3),[0,np.nan,0],[0,0,np.nan],range(3)])

0 	1 	2

0 0 1.0 2.0
1 0 NaN 0.0
2 0 0.0 NaN
3 0 1.0 2.0

df.isnull()

0 	1 	2

0 False False False
1 False True False
2 False False True
3 False False False

df.isnull().any()

0 False
1 True
2 True
dtype: bool

df.isnull().any(axis=1)

0 False
1 True
2 True
3 False
dtype: bool

df.fillna(5)

0 	1 	2

0 0 1.0 2.0
1 0 5.0 0.0
2 0 0.0 5.0
3 0 1.0 2.0

df[df.isnull().any(axis=1)]#求取所引值

0 	1 	2

1 0 NaN 0.0
2 0 0.0 NaN

#Groupby操作延申

import pandas as pd

import numpy as np

df = pd.DataFrame({‘A’ : [‘foo’, ‘bar’, ‘foo’, ‘bar’,

                       'foo', 'bar', 'foo', 'foo'],

               'B' : ['one', 'one', 'two', 'three',

                      'two', 'two', 'one', 'three'],

               'C' : np.random.randn(8),

               'D' : np.random.randn(8)})

A 	B 	C 	D

0 foo one 0.71168 0.86983
1 bar one 0.91770 0.69098
2 foo two 0.48605 0.40056
3 bar three 0.20739 0.26912
4 foo two 0.60928 2.15210
5 bar two -0.87134 -0.37828
6 foo one 0.20450 -0.49510
7 foo three 0.33635 1.16671

grouped = df.groupby(‘A’)

grouped

<pandas.core.groupby.DataFrameGroupBy object at 0x00000186D63E05F8>

grouped.count()

B 	C 	D

A
bar 3 3 3
foo 5 5 5

grouped=df.groupby([‘A’,‘B’])

grouped.count()

	C 	D

A B
bar one 1 1
three 1 1
two 1 1
foo one 2 2
three 1 1
two 2 2

def get_letter_type(letter):

if letter.lower() in 'aeiou':

    return 'a'

else:

    return 'b'

grouped=df.groupby(get_letter_type,axis=1)

grouped.count().iloc[0]

a 1
b 3
Name: 0, dtype: int64

s = pd.Series([1,2,3,1,2,3],[8,7,5,8,7,5])

8 1
7 2
5 3
8 1
7 2
5 3
dtype: int64

grouped = s.groupby(level = 0)

grouped

<pandas.core.groupby.SeriesGroupBy object at 0x00000186D640F588>

grouped.first()

5 3
7 2
8 1
dtype: int64

grouped.last()

5 3
7 2
8 1
dtype: int64

grouped.sum()

5 6
7 4
8 2
dtype: int64

grouped=s.groupby(level=0,sort=False)

grouped.first()

8 1
7 2
5 3
dtype: int64

df2=pd.DataFrame({‘x’:[‘A’,‘B’,‘A’,‘B’],‘Y’:[1,2,3,4]})

df2

Y 	x

0 1 A
1 2 B
2 3 A
3 4 B

df2.groupby([‘x’]).get_group(‘A’)

Y 	x

0 1 A
2 3 A

df2.groupby([‘x’]).get_group(‘B’)

Y 	x

1 2 B
3 4 B

arrays = [[‘bar’, ‘bar’, ‘baz’, ‘baz’, ‘foo’, ‘foo’, ‘qux’, ‘qux’],

      ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

arrays

[[‘bar’, ‘bar’, ‘baz’, ‘baz’, ‘foo’, ‘foo’, ‘qux’, ‘qux’],
[‘one’, ‘two’, ‘one’, ‘two’, ‘one’, ‘two’, ‘one’, ‘two’]]

index=pd.MultiIndex.from_arrays(arrays,names=[‘first’,‘second’])

index

MultiIndex(levels=[[‘bar’, ‘baz’, ‘foo’, ‘qux’], [‘one’, ‘two’]],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=[‘first’, ‘second’])

s=pd.Series(np.random.randn(8),index=index)

first second
bar one 0.94576
two 0.23519
baz one -1.26270
two -0.60026
foo one -1.08169
two 1.86771
qux one -0.57534
two -0.31680
dtype: float64

grouped=s.groupby(level=0)

grouped

<pandas.core.groupby.SeriesGroupBy object at 0x00000186D640FA20>

grouped=s.groupby(level=‘first’)

grouped.sum()

first
bar 1.18096
baz -1.86296
foo 0.78603
qux -0.89214
dtype: float64

grouped=df.groupby([‘A’,‘B’])

df.aggregate(np.sum)

A foobarfoobarfoobarfoofoo
B oneonetwothreetwotwoonethree
C 2.6016
D 4.6759
dtype: object

grouped = df.groupby([‘A’,‘B’],as_index = False)

grouped.aggregate(np.sum)

A 	B 	C 	D

0 bar one 0.91770 0.69098
1 bar three 0.20739 0.26912
2 bar two -0.87134 -0.37828
3 foo one 0.91618 0.37473
4 foo three 0.33635 1.16671
5 foo two 1.09532 2.55266

df.groupby([‘A’,‘B’]).sum().reset_index()

A 	B 	C 	D

0 bar one 0.91770 0.69098
1 bar three 0.20739 0.26912
2 bar two -0.87134 -0.37828
3 foo one 0.91618 0.37473
4 foo three 0.33635 1.16671
5 foo two 1.09532 2.55266

grouped.size()

A B
bar one 1
three 1
two 1
foo one 2
three 1
two 2
dtype: int64

grouped.describe().head()

C 	D
count 	mean 	std 	min 	25% 	50% 	75% 	max 	count 	mean 	std 	min 	25% 	50% 	75% 	max

0 1.0 0.91770 NaN 0.91770 0.91770 0.91770 0.91770 0.91770 1.0 0.69098 NaN 0.69098 0.69098 0.69098 0.69098 0.69098
1 1.0 0.20739 NaN 0.20739 0.20739 0.20739 0.20739 0.20739 1.0 0.26912 NaN 0.26912 0.26912 0.26912 0.26912 0.26912
2 1.0 -0.87134 NaN -0.87134 -0.87134 -0.87134 -0.87134 -0.87134 1.0 -0.37828 NaN -0.37828 -0.37828 -0.37828 -0.37828 -0.37828
3 2.0 0.45809 0.35863 0.20450 0.33129 0.45809 0.58489 0.71168 2.0 0.18737 0.96515 -0.49510 -0.15387 0.18737 0.52860 0.86983
4 1.0 0.33635 NaN 0.33635 0.33635 0.33635 0.33635 0.33635 1.0 1.16671 NaN 1.16671 1.16671 1.16671 1.16671 1.16671

grouped=df.groupby(‘A’)

grouped = df.groupby(‘A’)

grouped[‘C’].agg([np.sum,np.mean,np.std])

sum 	mean 	std

A
bar 0.25375 0.08458 0.90082
foo 2.34785 0.46957 0.20397

#字符串操作

import numpy as np

import pandas as pd

s=pd.Series([‘A’,‘a’,‘b’,‘B’,‘gaer’,‘AGER’,np.nan])

0 A
1 a
2 b
3 B
4 gaer
5 AGER
6 NaN
dtype: object

s.str.lower()

0 a
1 a
2 b
3 b
4 gaer
5 ager
6 NaN
dtype: object

s.str.upper()

0 A
1 A
2 B
3 B
4 GAER
5 AGER
6 NaN
dtype: object

s.str.len()

0 1.0
1 1.0
2 1.0
3 1.0
4 4.0
5 4.0
6 NaN
dtype: float64

index=pd.Index([‘cui’ ,’ li’ , ‘jun’])

index

Index([‘cui’, ’ li’, ‘jun’], dtype=‘object’)

index.str.strip()

Index([‘cui’, ‘li’, ‘jun’], dtype=‘object’)

index.str.lstrip()

Index([‘cui’, ‘li’, ‘jun’], dtype=‘object’)

index.str.strip()

Index([‘cui’, ‘li’, ‘jun’], dtype=‘object’)

df=pd.DataFrame(np.random.randn(3,2),columns=[‘A’,‘B’],index=range(3))

A 	B

0 -0.172169 1.626435
1 -0.604493 0.374151
2 0.716009 2.219520

df.columns=df.columns.str.replace(’’,’_’)

_A_ 	_B_

0 -0.172169 1.626435
1 -0.604493 0.374151
2 0.716009 2.219520

s=pd.Series([‘a_b-C’,‘c_d_e’,‘f_g_h’])

0 a_b-C
1 c_d_e
2 f_g_h
dtype: object

s.str.split(’_’)

0 [a, b-C]
1 [c, d, e]
2 [f, g, h]
dtype: object

s.str.split(’_’,expand=True,n=1)

0 	1

0 a b-C
1 c d_e
2 f g_h

s=pd.Series([‘A’,‘Aas’,‘Afgew’,‘Ager’,‘Agre’,‘Aw’])

0 A
1 Aas
2 Afgew
3 Ager
4 Agre
5 Aw
dtype: object

s.str.contains(‘Aa’)

0 False
1 True
2 False
3 False
4 False
5 False
dtype: bool

s=pd.Series([‘a’,‘a|b’,‘a|c’])

0 a
1 a|b
2 a|c
dtype: object

s.str.get_dummies(sep=‘1’)

a 	a|b 	a|c

0 1 0 0
1 0 1 0
2 0 0 1

#索引

s = pd.Series(np.arange(5),index = np.arange(5)[::-1],dtype=‘int64’)

4 0
3 1
2 2
1 3
0 4
dtype: int64

s.isin([1,23,4])

4 False
3 True
2 False
1 False
0 True
dtype: bool

s[s.isin([1,3,4])]

3 1
1 3
0 4
dtype: int64

s2 = pd.Series(np.arange(6),index = pd.MultiIndex.from_product([[0,1],[‘a’,‘b’,‘c’]]))

0 a 0
b 1
c 2
1 a 3
b 4
c 5
dtype: int32

s2.iloc[s2.index.isin([(1,‘a’),(2,‘b’)])]

0 a 0
b 1
c 2
1 a 3
b 4
c 5
dtype: int32

s[s>2]

4 0
3 1
2 2
1 3
0 4
dtype: int64

dates=pd.date_range(‘2019-04-05’,periods=8)

df=pd.DataFrame(np.random.randn(8,4),index=dates)

columns=[‘A’,‘B’,‘C’,‘D’]

0 	1 	2 	3

2019-04-05 0.145422 0.342281 0.971241 -0.041731
2019-04-06 -2.102217 0.778930 -1.972598 -0.694885
2019-04-07 -0.158922 1.619844 -0.689797 -0.934461
2019-04-08 0.636213 -0.681186 0.089263 0.550155
2019-04-09 -0.094493 0.721435 1.333688 -0.069475
2019-04-10 1.197129 -0.697439 -0.884878 1.433160
2019-04-11 -0.968315 0.430566 -0.930414 -0.153921
2019-04-12 -0.129315 -0.056980 0.572650 -0.016057

df.select(lambda x:x==‘A’,axis=‘columns’)

2019-04-05
2019-04-06
2019-04-07
2019-04-08
2019-04-09
2019-04-10
2019-04-11
2019-04-12

df.where(df<0)

0 	1 	2 	3

df.where(df<0,-df)

0 	1 	2 	3

df=pd.DataFrame(np.random.randn(10,3),columns=list(‘abc’))

a 	b 	c

0 0.233600 -0.118476 1.910718
1 0.453123 0.328837 1.967945
2 -0.719929 -1.564187 0.457447
3 1.464841 1.631935 0.351648
4 -0.977479 -1.000130 -0.275709
5 -0.253827 0.032827 -1.997572
6 -0.322984 0.226921 0.465433
7 0.018869 -1.393526 1.270390
8 -1.213045 -0.418379 0.584319
9 0.662430 0.761807 -0.990689

df.query(’(a<b)’)

a 	b 	c

df.query(’(a<b)&(b<c)’)

a 	b 	c

#Pandas绘图

%matplotlib inline

import pandas as pd

import numpy as np

s = pd.Series(np.random.randn(10),index = np.arange(0,100,10))

s.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1cdf7bc7908>

df=pd.DataFrame(np.random.randn(10,4).cumsum(0),index=np.arange(0,100,10),columns=[‘A’,‘B’,‘C’,‘D’])

df.head()

A 	B 	C 	D

0 -0.600440 0.862748 -0.902197 -0.372323
10 -0.543945 0.229546 -0.963724 0.452196
20 -1.744248 0.161023 0.073936 -0.225950
30 -3.167785 0.514504 1.225721 0.756929
40 -1.606017 0.472679 1.758449 -0.160899

df.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1cdf7bc7f98>

import matplotlib.pyplot as plt

fig,axes = plt.subplots(2,1)

data = pd.Series(np.random.rand(16),index=list(‘abcdefghijklmnop’))

data.plot(ax = axes[0],kind=‘bar’)

data.plot(ax = axes[1],kind=‘barh’)

<matplotlib.axes._subplots.AxesSubplot at 0x1cdf82e2fd0>

df = pd.DataFrame(np.random.rand(6, 4),

           index = ['one', 'two', 'three', 'four', 'five', 'six'], 

           columns = pd.Index(['A', 'B', 'C', 'D'], name = 'Genus'))

df.head()

Genus A B C D
one 0.350508 0.225946 0.141177 0.353882
two 0.390222 0.989578 0.332295 0.474077
three 0.837848 0.944297 0.442973 0.730698
four 0.604615 0.099858 0.390346 0.336698
five 0.736496 0.055303 0.108531 0.251296

df.plot(kind=‘bar’)

<matplotlib.axes._subplots.AxesSubplot at 0x1cdf828b6d8>

tips=pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\tips.csv’)

tips.head()

total_bill 	tip 	sex 	smoker 	day 	time 	size

0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

tips.total_bill.plot(kind=‘hist’,bins=20)

<matplotlib.axes._subplots.AxesSubplot at 0x1cdf9702198>

macro = pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\macrodata.csv’)

macro.head()

year 	quarter 	realgdp 	realcons 	realinv 	realgovt 	realdpi 	cpi 	m1 	tbilrate 	unemp 	pop 	infl 	realint

0 1959.0 1.0 2710.349 1707.4 286.898 470.045 1886.9 28.98 139.7 2.82 5.8 177.146 0.00 0.00
1 1959.0 2.0 2778.801 1733.7 310.859 481.301 1919.7 29.15 141.7 3.08 5.1 177.830 2.34 0.74
2 1959.0 3.0 2775.488 1751.8 289.226 491.260 1916.4 29.35 140.5 3.82 5.3 178.657 2.74 1.09
3 1959.0 4.0 2785.204 1753.7 299.356 484.052 1931.3 29.37 140.0 4.33 5.6 179.386 0.27 4.06
4 1960.0 1.0 2847.699 1770.5 331.722 462.199 1955.5 29.54 139.6 3.50 5.2 180.007 2.31 1.19

data = macro[[‘quarter’,‘realgdp’,‘realcons’]]

data.plot.scatter(‘quarter’,‘realgdp’)

<matplotlib.axes._subplots.AxesSubplot at 0x1cdfea2c5f8>

pd.scatter_matrix(macro,color=‘g’,alpha=0.3)

D:\program\Anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: pandas.scatter_matrix is deprecated. Use pandas.plotting.scatter_matrix instead
“”"Entry point for launching an IPython kernel.

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001CDFE986FD0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD83CB8470>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD83C5FE10>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD83C1DBE0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD83D9F588>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD83D9F5C0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD83DE1748>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD83E112E8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD83E39DD8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD83E4B0F0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD83E87EB8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD83ECB278>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD83F07198>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD83F42128>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD83F6A908>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD83FA49E8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD83FDE9E8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84018908>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84022630>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD840873C8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD840BF3C8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD840F9908>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84131908>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8415ED68>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84194C88>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD841CFCF8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8420ACF8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD841A91D0>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84276828>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD842AFCF8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD842F6278>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84306080>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8435A2E8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD843935F8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD843B2828>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84407588>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8442DD68>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84467D68>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD844AC208>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD844E7208>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84482940>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84548D68>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84583D68>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD845BDD68>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD845F4D68>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD846292E8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84666278>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD846A0278>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD846D86D8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD846F9400>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8473DC88>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84772CF8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD847B0CF8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD847EBC88>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8481E7B8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84858828>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84893748>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD848CC748>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD849004A8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84938518>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84972438>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD849AD358>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD849BB7F0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84A09F28>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84A54208>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84A5D5F8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84AC8128>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84AF2908>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84B299E8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84B659E8>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84B9F978>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84B71710>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84C0B518>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84C46518>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84C7E518>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84CB8A58>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84CE2F28>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84D1CEB8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84D54EB8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84D993C8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84DBF898>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84E007B8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84E377B8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84E70CF8>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84E7BE80>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84EDF358>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84F18358>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84F51828>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84F8B7B8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84FB7CF8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84FEDC88>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD85028CF8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD85062C88>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD84FFFA58>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD820FA320>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8174FF98>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD81267B38>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CDFF597208>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD850A8048>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD851172B0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD851441D0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD851791D0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD851549B0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD851D9E80>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD85213E80>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD85248E80>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8528E390>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD852B6940>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD852F0860>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8532C860>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD85364D30>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8539A2B0>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD863A5240>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD863DD240>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86415780>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8641F940>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8647BD30>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD864B1D30>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD864F62B0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86530240>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8655B710>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD865956A0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD865CF6A0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8660B5C0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8659F2E8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD866751D0>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD866AF630>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD866E9630>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD867235C0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8674BC50>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD867900F0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD867CA0F0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86803080>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8682CC50>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86866CC0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD868A2BE0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD868DABE0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD868C0400>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86947860>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86981860>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD869BC7F0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD869F4D30>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86A281D0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86A64160>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86A9B080>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86AB2A58>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86A31CF8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86B3B940>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86B739B0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86BB19B0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86BE99B0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86C1C390>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86C59320>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86C91320>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86CC9780>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86CEB630>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86D2EDA0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86D65E10>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86DA0E10>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86DDADA0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86E0E8D0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86E46940>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86E81860>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86EBB860>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86EEF5C0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86F28630>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86F63550>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86F9C4E0>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD86FAC6A0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8700A080>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD87043320>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD870641D0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD870B8240>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD870DEA20>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8711BB00>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD87155B00>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8718EB00>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD871659B0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD871FD630>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD87237630>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD87271550>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD872A7A90>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD872DC080>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD87317080>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD87350080>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8738A4E0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD873AB710>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD873EE940>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD874279B0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD874639B0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8749C8D0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD874CE4E0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8750A550>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD87543470>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD8757D390>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001CD875A3BE0>]], dtype=object)

#大数据处理

import pandas as pd

gl = pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\game_logs.csv’)

gl.head()

D:\program\Anaconda\lib\site-packages\IPython\core\interactiveshell.py:2698: DtypeWarning: Columns (12,13,14,15,19,20,81,83,85,87,93,94,95,96,97,98,99,100,105,106,108,109,111,112,114,115,117,118,120,121,123,124,126,127,129,130,132,133,135,136,138,139,141,142,144,145,147,148,150,151,153,154,156,157,160) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)

date 	number_of_game 	day_of_week 	v_name 	v_league 	v_game_number 	h_name 	h_league 	h_game_number 	v_score 	... 	h_player_7_name 	h_player_7_def_pos 	h_player_8_id 	h_player_8_name 	h_player_8_def_pos 	h_player_9_id 	h_player_9_name 	h_player_9_def_pos 	additional_info 	acquisition_info

0 18710504 0 Thu CL1 na 1 FW1 na 1 0 … Ed Mincher 7.0 mcdej101 James McDermott 8.0 kellb105 Bill Kelly 9.0 NaN Y
1 18710505 0 Fri BS1 na 1 WS3 na 1 20 … Asa Brainard 1.0 burrh101 Henry Burroughs 9.0 berth101 Henry Berthrong 8.0 HTBF Y
2 18710506 0 Sat CL1 na 2 RC1 na 1 12 … Pony Sager 6.0 birdg101 George Bird 7.0 stirg101 Gat Stires 9.0 NaN Y
3 18710508 0 Mon CL1 na 3 CH1 na 1 12 … Ed Duffy 6.0 pinke101 Ed Pinkham 5.0 zettg101 George Zettlein 1.0 NaN Y
4 18710509 0 Tue BS1 na 2 TRO na 1 9 … Steve Bellan 5.0 pikel101 Lip Pike 3.0 cravb101 Bill Craver 6.0 HTBF Y

5 rows × 161 columns

gl.info(memory_usage=‘deep’)

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 171907 entries, 0 to 171906
Columns: 161 entries, date to acquisition_info
dtypes: float64(77), int64(6), object(78)
memory usage: 860.5 MB

for dtype in [‘float64’,‘object’,‘int64’]:

selected_dtype = gl.select_dtypes(include=[dtype])

mean_usage_b = selected_dtype.memory_usage(deep=True).mean()

mean_usage_mb = mean_usage_b / 1024 ** 2

print("Average memory usage for {} columns: {:03.2f} MB".format(dtype,mean_usage_mb))

Average memory usage for float64 columns: 1.29 MB
Average memory usage for object columns: 9.51 MB
Average memory usage for int64 columns: 1.12 MB

import numpy as np

int_types = [“uint8”, “int8”, “int16”,“int32”,“int64”]

for it in int_types:

print(np.iinfo(it))

Machine parameters for uint8

min = 0
max = 255

Machine parameters for int8

min = -128
max = 127

Machine parameters for int16

min = -32768
max = 32767

Machine parameters for int32

min = -2147483648
max = 2147483647

Machine parameters for int64

min = -9223372036854775808
max = 9223372036854775807

def mem_usage(pandas_obj):

if isinstance(pandas_obj,pd.DataFrame):

    usage_b = pandas_obj.memory_usage(deep=True).sum()

else: # we assume if not a df it's a series

    usage_b = pandas_obj.memory_usage(deep=True)

usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes

return "{:03.2f} MB".format(usage_mb)

gl_int = gl.select_dtypes(include=[‘int64’])

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html

converted_int = gl_int.apply(pd.to_numeric,downcast=‘unsigned’)

print(mem_usage(gl_int))

print(mem_usage(converted_int))

7.87 MB
1.48 MB

gl_float = gl.select_dtypes(include=[‘float64’])

converted_float = gl_float.apply(pd.to_numeric,downcast=‘float’)

print(mem_usage(gl_float))

print(mem_usage(converted_float))

100.99 MB
50.49 MB

optimized_gl = gl.copy()

optimized_gl[converted_int.columns] = converted_int

optimized_gl[converted_float.columns] = converted_float

print(mem_usage(gl))

print(mem_usage(optimized_gl))

860.50 MB
803.61 MB

optimized_gl = gl.copy()

optimized_gl[converted_int.columns] = converted_int

optimized_gl[converted_float.columns] = converted_float

print(mem_usage(gl))

print(mem_usage(optimized_gl))

860.50 MB
803.61 MB

for dtype in [‘float64’,‘int64’,‘object’]:

selected_dtype = gl.select_dtypes(include = [dtype])

mean_usage_b = selected_dtype.memory_usage(deep=True).mean()

mean_usage_mb = mean_usage_b/1024**2

print ('平均内存占用',dtype,mean_usage_mb)

平均内存占用 float64 1.2947326073279748
平均内存占用 int64 1.1241934640066964
平均内存占用 object 9.514454069016855

import numpy as np

int_types = [‘uint8’,‘int8’,‘int16’,‘int32’,‘int64’]

for it in int_types:

print (np.iinfo(it))

Machine parameters for uint8

min = 0
max = 255

Machine parameters for int8

min = -128
max = 127

Machine parameters for int16

min = -32768
max = 32767

Machine parameters for int32

min = -2147483648
max = 2147483647

Machine parameters for int64

min = -9223372036854775808
max = 9223372036854775807

def mem_usage(pandas_obj):

if isinstance(pandas_obj,pd.DataFrame):

    usage_b = pandas_obj.memory_usage(deep=True).sum()

else:

    usage_b = pandas_obj.memory_usage(deep=True)

usage_mb = usage_b/1024**2

return '{:03.2f} MB'.format(usage_mb)

gl_int = gl.select_dtypes(include = [‘int64’])

coverted_int = gl_int.apply(pd.to_numeric,downcast=‘unsigned’)

print (mem_usage(gl_int))

print (mem_usage(coverted_int))

7.87 MB
1.48 MB

gl_float = gl.select_dtypes(include=[‘float64’])

converted_float = gl_float.apply(pd.to_numeric,downcast=‘float’)

print(mem_usage(gl_float))

print(mem_usage(converted_float))

100.99 MB
50.49 MB

optimized_gl = gl.copy()

optimized_gl[coverted_int.columns] = coverted_int

optimized_gl[converted_float.columns] = converted_float

print(mem_usage(gl))

print(mem_usage(optimized_gl))

860.50 MB
803.61 MB

gl_obj = gl.select_dtypes(include = [‘object’]).copy()

gl_obj.describe()

day_of_week 	v_name 	v_league 	h_name 	h_league 	day_night 	completion 	forefeit 	protest 	park_id 	... 	h_player_6_id 	h_player_6_name 	h_player_7_id 	h_player_7_name 	h_player_8_id 	h_player_8_name 	h_player_9_id 	h_player_9_name 	additional_info 	acquisition_info

count 171907 171907 171907 171907 171907 140150 116 145 180 171907 … 140838 140838 140838 140838 140838 140838 140838 140838 1456 140841
unique 7 148 7 148 7 2 116 3 5 245 … 4774 4720 5253 5197 4760 4710 5193 5142 332 1
top Sat CHN NL CHN NL D 19810610,CHI11,1,2,45 H V STL07 … grimc101 Charlie Grimm grimc101 Charlie Grimm lopea102 Al Lopez spahw101 Warren Spahn HTBF Y
freq 28891 8870 88866 9024 88867 82724 1 69 90 7022 … 427 427 491 491 676 676 339 339 1112 140841

4 rows × 78 columns

dow = gl_obj.day_of_week

dow.head()

0 Thu
1 Fri
2 Sat
3 Mon
4 Tue
Name: day_of_week, dtype: object

dow_cat = dow.astype(‘category’)

dow_cat.head()

0 Thu
1 Fri
2 Sat
3 Mon
4 Tue
Name: day_of_week, dtype: category
Categories (7, object): [Fri, Mon, Sat, Sun, Thu, Tue, Wed]

dow_cat.head(10).cat.codes

0 4
1 0
2 2
3 1
4 5
5 4
6 2
7 2
8 1
9 5
dtype: int8

print (mem_usage(dow))

print (mem_usage(dow_cat))

9.84 MB
0.16 MB

converted_obj = pd.DataFrame()

for col in gl_obj.columns:

num_unique_values = len(gl_obj[col].unique())

num_total_values = len(gl_obj[col])

if num_unique_values / num_total_values < 0.5:

    converted_obj.loc[:,col] = gl_obj[col].astype('category')

else:

    converted_obj.loc[:,col] = gl_obj[col]

print(mem_usage(gl_obj))

print(mem_usage(converted_obj))

751.64 MB
51.67 MB

date = optimized_gl.date

date[:5]

0 18710504
1 18710505
2 18710506
3 18710508
4 18710509
Name: date, dtype: uint32

print (mem_usage(date))

0.66 MB

optimized_gl[‘date’] = pd.to_datetime(date,format=’%Y%m%d’)

print (mem_usage(optimized_gl[‘date’]))

1.31 MB

optimized_gl[‘date’][:5]

0 1871-05-04
1 1871-05-05
2 1871-05-06
3 1871-05-08
4 1871-05-09
Name: date, dtype: datetime64[ns]

#apply操作

import pandas as pd

import numpy as np

titanic = pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\titanic_train.csv’)

titanic.head()

PassengerId 	Survived 	Pclass 	Name 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

def hundredth_row(columns):

item = columns.iloc[99]

return item

hundredth_row = titanic.apply(hundredth_row)

hundredth_row

PassengerId 100
Survived 0
Pclass 2
Name Kantor, Mr. Sinai
Sex male
Age 34
SibSp 1
Parch 0
Ticket 244367
Fare 26
Cabin NaN
Embarked S
dtype: object

def not_null_count(columns):

columns_null = pd.isnull(columns)

null = columns[columns_null]

return len(null)

columns_null_count = titanic.apply(not_null_count)

columns_null_count

PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

def which_class(row):

pclass = row['Pclass']

if pd.isnull(pclass):

    return 'Unknow'

elif pclass == 1:

    return 'First class'

elif pclass == 2:

    return 'Second class'

elif pclass == 3:

    return 'Third class'

classes = titanic.apply(which_class,axis = 1)

classes

0 Third class
1 First class
2 Third class
3 First class
4 Third class
5 Third class
6 First class
7 Third class
8 Third class
9 Second class
10 Third class
11 First class
12 Third class
13 Third class
14 Third class
15 Second class
16 Third class
17 Second class
18 Third class
19 Third class
20 Second class
21 Second class
22 Third class
23 First class
24 Third class
25 Third class
26 Third class
27 First class
28 Third class
29 Third class
…
861 Second class
862 First class
863 Third class
864 Second class
865 Second class
866 Second class
867 First class
868 Third class
869 Third class
870 Third class
871 First class
872 First class
873 Third class
874 Second class
875 Third class
876 Third class
877 Third class
878 Third class
879 First class
880 Second class
881 Third class
882 Third class
883 Second class
884 Third class
885 Third class
886 Second class
887 First class
888 Third class
889 First class
890 Third class
Length: 891, dtype: object

def is_minor(row):

if row['Age'] < 18:

    return True

else:

    return False

minors = titanic.apply(is_minor,axis = 1)

minors

0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 True
10 True
11 False
12 False
13 False
14 True
15 False
16 True
17 False
18 False
19 False
20 False
21 False
22 True
23 False
24 True
25 False
26 False
27 False
28 False
29 False
…
861 False
862 False
863 False
864 False
865 False
866 False
867 False
868 False
869 True
870 False
871 False
872 False
873 False
874 False
875 True
876 False
877 False
878 False
879 False
880 False
881 False
882 False
883 False
884 False
885 False
886 False
887 False
888 False
889 False
890 False
Length: 891, dtype: bool

****pandas练习题

import pandas as pd

#显示版本信息

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.3
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

import numpy as np

data = {‘animal’: [‘cat’, ‘cat’, ‘snake’, ‘dog’, ‘dog’, ‘cat’, ‘snake’, ‘cat’, ‘dog’, ‘dog’],

    'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],

    'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],

    'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = [‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’, ‘h’, ‘i’, ‘j’]

#创建一个DataFrame

df = pd.DataFrame(data,index = labels)

df.head()

age 	animal 	priority 	visits

a 2.5 cat yes 1
b 3.0 cat yes 3
c 0.5 snake no 2
d NaN dog yes 3
e 5.0 dog no 2

#显示详细信息

df.info()

<class ‘pandas.core.frame.DataFrame’>
Index: 10 entries, a to j
Data columns (total 4 columns):
age 8 non-null float64
animal 10 non-null object
priority 10 non-null object
visits 10 non-null int64
dtypes: float64(1), int64(1), object(2)
memory usage: 400.0+ bytes

#索引

df.iloc[:3]

age 	animal 	priority 	visits

a 2.5 cat yes 1
b 3.0 cat yes 3
c 0.5 snake no 2

#指定选择数据范围

df[df[‘visits’] > 2]

age 	animal 	priority 	visits

b 3.0 cat yes 3
d NaN dog yes 3
f 2.0 cat no 3

#查看缺失值

df[df[‘age’].isnull()]

age 	animal 	priority 	visits

d NaN dog yes 3
h NaN cat yes 1

#通过给定范围查找某一属性

df[(df[‘animal’] ==‘cat’) & (df[‘age’] < 3)]

age 	animal 	priority 	visits

a 2.5 cat yes 1
f 2.0 cat no 3

#改变数值

df.loc[‘f’,‘age’] = 1.5

df[(df[‘animal’] ==‘cat’) & (df[‘age’] < 3)]

age 	animal 	priority 	visits

a 2.5 cat yes 1
f 1.5 cat no 3

#groupby求平均值

df.groupby(‘animal’)[‘age’].mean()

animal
cat 2.333333
dog 5.000000
snake 2.500000
Name: age, dtype: float64

#计算相同属性的个数

df[‘animal’].value_counts()

cat 4
dog 4
snake 2
Name: animal, dtype: int64

#属性值进行映射

df[‘priority’] = df[‘priority’].map({‘yes’:True,‘no’:False})

df.head()

age 	animal 	priority 	visits

a 2.5 cat True 1
b 3.0 cat True 3
c 0.5 snake False 2
d NaN dog True 3
e 5.0 dog False 2

#属性值进行替换

df[‘animal’] = df[‘animal’].replace(‘snake’,‘tangyudi’)

df.head()

age 	animal 	priority 	visits

a 2.5 cat True 1
b 3.0 cat True 3
c 0.5 tangyudi False 2
d NaN dog True 3
e 5.0 dog False 2

#数据透视表

df.pivot_table(index = ‘animal’,columns = ‘visits’,values=‘age’,aggfunc = ‘mean’)

visits 1 2 3
animal
cat 2.5 NaN 2.25
dog 3.0 6.0 NaN
tangyudi 4.5 0.5 NaN

#提取平均值组成新的数据

df = pd.DataFrame(np.random.random(size = (5,3)))

df.head()

0 	1 	2

0 0.787464 0.544326 0.763849
1 0.574682 0.880216 0.688106
2 0.947957 0.526658 0.704592
3 0.073148 0.601730 0.721848
4 0.592968 0.835612 0.710174

df.sub(df.mean(axis = 1),axis = 0)

0 	1 	2

0 0.088918 -0.154221 0.065303
1 -0.139652 0.165881 -0.026229
2 0.221554 -0.199744 -0.021810
3 -0.392427 0.136155 0.256273
4 -0.119950 0.122694 -0.002744

#统计不同属性值的个数

df.sub(df.mean(axis = 1),axis = 0)

0 	1 	2

0 0.088918 -0.154221 0.065303
1 -0.139652 0.165881 -0.026229
2 0.221554 -0.199744 -0.021810
3 -0.392427 0.136155 0.256273
4 -0.119950 0.122694 -0.002744

len(df.drop_duplicates(keep=False))

#给定数据，分别求滑动窗口的平均值（加入补0操作）

import numpy as np

df = pd.DataFrame({‘group’: list(‘aabbabbbabab’),

                   'value': [1, 2, 3, np.nan, 2, 3, 

                             np.nan, 1, 7, 3, np.nan, 8]})

df.head(12)

group 	value

0 a 1.0
1 a 2.0
2 b 3.0
3 b NaN
4 a 2.0
5 b 3.0
6 b NaN
7 b 1.0
8 a 7.0
9 b 3.0
10 a NaN
11 b 8.0

g1 = df.groupby([‘group’])[‘value’]

g2 = df.fillna(0).groupby([‘group’])[‘value’]

s = g2.rolling(3,min_periods=1).sum()/g2.rolling(3,min_periods=1).count()

s.reset_index(level = 0,drop=True).sort_index()

0 1.000000
1 1.500000
2 3.000000
3 1.500000
4 1.666667
5 2.000000
6 1.000000
7 1.333333
8 3.666667
9 1.333333
10 3.000000
11 4.000000
Name: value, dtype: float64

#指定时间序列进行计算

g1 = df.groupby([‘group’])[‘value’]

g2 = df.fillna(0).groupby([‘group’])[‘value’]

s = g2.rolling(3,min_periods=1).sum()/g2.rolling(3,min_periods=1).count()

s.reset_index(level = 0,drop=True).sort_index()

0 1.000000
1 1.500000
2 3.000000
3 1.500000
4 1.666667
5 2.000000
6 1.000000
7 1.333333
8 3.666667
9 1.333333
10 3.000000
11 4.000000
Name: value, dtype: float64

#对缺失值数据自动计算

import pandas as pd

import numpy as np

df = pd.DataFrame({‘From_To’: [‘LoNDon_paris’, ‘MAdrid_miLAN’, ‘londON_StockhOlm’,

                           'Budapest_PaRis', 'Brussels_londOn'],

          'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],

          'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],

               'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', 

                           '12. Air France', '"Swiss Air"']})

df.head()

Airline 	FlightNumber 	From_To 	RecentDelays

0 KLM(!) 10045.0 LoNDon_paris [23, 47]
1 (12) NaN MAdrid_miLAN []
2 (British Airways. ) 10065.0 londON_StockhOlm [24, 43, 87]
3 12. Air France NaN Budapest_PaRis [13]
4 “Swiss Air” 10085.0 Brussels_londOn [67, 32]

df[‘FlightNumber’] = df[‘FlightNumber’].interpolate().astype(int)

df.head()

Airline 	FlightNumber 	From_To 	RecentDelays

0 KLM(!) 10045 LoNDon_paris [23, 47]
1 (12) 10055 MAdrid_miLAN []
2 (British Airways. ) 10065 londON_StockhOlm [24, 43, 87]
3 12. Air France 10075 Budapest_PaRis [13]
4 “Swiss Air” 10085 Brussels_londOn [67, 32]

#将from to这列展开两个特征

temp = df.From_To.str.split(’_’,expand = True)

temp.columns = [‘From’,‘To’]

temp[‘From’] = temp[‘From’].str.capitalize()

temp[‘To’] = temp[‘To’].str.capitalize()

df = df.join(temp)

df.head()

Airline 	FlightNumber 	From_To 	RecentDelays 	From 	To

0 KLM(!) 10045 LoNDon_paris [23, 47] London Paris
1 (12) 10055 MAdrid_miLAN [] Madrid Milan
2 (British Airways. ) 10065 londON_StockhOlm [24, 43, 87] London Stockholm
3 12. Air France 10075 Budapest_PaRis [13] Budapest Paris
4 “Swiss Air” 10085 Brussels_londOn [67, 32] Brussels London

df = df.drop(‘From_To’,axis = 1) df.head()

#首字母大写，其他字母小写

#删除from to这列，并加入temp这列

#去掉airline中多余的字符

df[‘Airline’] = df[‘Airline’].str.extract(’([a-zA-Z\s]+)’,expand = False).str.strip()

df.head()

Airline 	FlightNumber 	RecentDelays 	From 	To

0 KLM 10045 [23, 47] London Paris
1 Air France 10055 [] Madrid Milan
2 British Airways 10065 [24, 43, 87] London Stockholm
3 Air France 10075 [13] Budapest Paris
4 Swiss Air 10085 [67, 32] Brussels London

#将recentDelay中的数据分开写

delays = df[‘RecentDelays’].apply(pd.Series)

delays.columns = [‘delay_{}’.format(n) for n in range(1,len(delays.columns)+1)]

delays

delay_1 	delay_2 	delay_3

0 23.0 47.0 NaN
1 NaN NaN NaN
2 24.0 43.0 87.0
3 13.0 NaN NaN
4 67.0 32.0 NaN

#多重索引

letters = [‘A’,‘B’,‘C’]

numbers = list(range(10))

mi = pd.MultiIndex.from_product([letters,numbers])

s = pd.Series(np.random.rand(30),index=mi)

A 0 0.935360
1 0.197775
2 0.095093
3 0.465841
4 0.907051
5 0.260017
6 0.439027
7 0.051335
8 0.825270
9 0.554543
B 0 0.335595
1 0.913604
2 0.894998
3 0.489151
4 0.322718
5 0.475781
6 0.727297
7 0.065137
8 0.488248
9 0.386090
C 0 0.264502
1 0.826158
2 0.479834
3 0.893296
4 0.058635
5 0.499101
6 0.873221
7 0.877330
8 0.524506
9 0.256802
dtype: float64

#定位数据

s.loc[pd.IndexSlice[:‘B’,5:]]

A 5 0.677665
6 0.533658
7 0.326082
8 0.071546
9 0.434138
B 5 0.339513
6 0.901485
7 0.529628
8 0.409966
9 0.650863
dtype: float64

#按索引计算

s.sum(level = 1)

0 2.062543
1 1.803863
2 1.762901
3 1.795556
4 2.305935
5 1.446287
6 1.854993
7 1.723217
8 0.964694
9 1.742379
dtype: float64

#变换索引

new = s.swaplevel(0,1)

new

0 A 0.660232
1 A 0.749437
2 A 0.907719
3 A 0.617251
4 A 0.837966
5 A 0.677665
6 A 0.533658
7 A 0.326082
8 A 0.071546
9 A 0.434138
0 B 0.779293
1 B 0.924057
2 B 0.425088
3 B 0.652168
4 B 0.811879
5 B 0.339513
6 B 0.901485
7 B 0.529628
8 B 0.409966
9 B 0.650863
0 C 0.623017
1 C 0.130369
2 C 0.430094
3 C 0.526137
4 C 0.656089
5 C 0.429108
6 C 0.419850
7 C 0.867507
8 C 0.483183
9 C 0.657378
dtype: float64

奋豆儿小米粒

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python之pandas基础知识以及练习题

####pandas数据分析与处理库import pandas as pddf=pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\data\titanic.csv’)dfPassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin ...
复制链接

扫一扫