1.1
教程中使用的数据集是机器学习中经典的泰坦尼克号数据集
import numpy as np
import pandas as pd
df= pd. read_csv( 'train.csv' )
df
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S ... ... ... ... ... ... ... ... ... ... ... ... ... 886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S 887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S 888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S 889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C 890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q
891 rows × 12 columns
df. head( 5 )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
df = pd. read_csv( '/Users/apple/Documents/Hands_on_DA/chapter_1/train.csv' )
df. head( 5 )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
我们可以看到我们获得了相同的结果
chunk = pd. read_csv( 'train.csv' , chunksize= 1000 )
将数据集的列名更改为中文,亦或是其他想要的格式。除了教程里提到的方法,在这里我再介绍一种我常用的方法
df_ = df. rename( columns= {
'PassengerId' : '乘客ID' ,
'Survived' : '是否幸存' ,
'Pclass' : '乘客等级(1/2/3等舱位)' ,
'Name' : '乘客姓名' ,
'Sex' : '性别' ,
'Age' : '年龄' ,
'SibSp' : '堂兄弟/妹个数' ,
'Parch' : '父母与小孩个数' ,
'Ticket' : '船票信息' ,
'Fare' : '票价' ,
'Cabin' : '客舱' ,
'Embarked' : '登船港口'
} , inplace= False )
通过上面的步骤,我们已经初步的完成了对数据的处理,通常情况下在对数据进行处理之前,我们还需要先了解一下数据的基本构成,有了足够的了解我们才可以对数据进行预处理
df_. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 乘客ID 891 non-null int64
1 是否幸存 891 non-null int64
2 乘客等级(1/2/3等舱位) 891 non-null int64
3 乘客姓名 891 non-null object
4 性别 891 non-null object
5 年龄 714 non-null float64
6 堂兄弟/妹个数 891 non-null int64
7 父母与小孩个数 891 non-null int64
8 船票信息 891 non-null object
9 票价 891 non-null float64
10 客舱 204 non-null object
11 登船港口 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
df_. tail( 15 )
乘客ID 是否幸存 乘客等级(1/2/3等舱位) 乘客姓名 性别 年龄 堂兄弟/妹个数 父母与小孩个数 船票信息 票价 客舱 登船港口 876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 7534 9.8458 NaN S 877 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.8958 NaN S 878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S 879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C 880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 NaN S 881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S 882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S 883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S 884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S 885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q 886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S 887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S 888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S 889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C 890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q
在数据中,我们非常讨厌缺失值(Null),这种缺失值在我们对数据进行处理以及通过数据观察规律的过程中会带来不小的麻烦,例如在机器学习中,数据中数值的缺失很容易带来梯度的问题
df_. isnull( ) . head( )
乘客ID 是否幸存 乘客等级(1/2/3等舱位) 乘客姓名 性别 年龄 堂兄弟/妹个数 父母与小孩个数 船票信息 票价 客舱 登船港口 0 False False False False False False False False False False True False 1 False False False False False False False False False False False False 2 False False False False False False False False False False True False 3 False False False False False False False False False False False False 4 False False False False False False False False False False True False
df. to_csv( 'train_chinese.csv' )
1.2
在pandas中主要存在两种数据类型,一种是DataFrame,一种是Series。
1.DataFrame
DataFrame是Pandas中用于存储表格型数据的主要数据结构。它类似于Excel表格或SQL表,以及Python中的二维数组(尽管它更加灵活)。DataFrame可以存储多种类型的数据,并且每列可以有不同的数据类型。DataFrame有行和列标签,因此你可以很容易地通过标签来访问、修改数据。 特性: 1.它是二维的、大小可变的、且可以存储异种类型数据的表格型数据结构。 2.它有行标签(index)和列标签(columns)。 3.你可以通过标签来访问、修改数据。
import numpy as np
import pandas as pd
data = { 'state' : [ 'Ohio' , 'Ohio' , 'Ohio' , 'Nevada' , 'Nevada' , 'Nevada' ] ,
'year' : [ 2000 , 2001 , 2002 , 2001 , 2002 , 2003 ] , 'pop' : [ 1.5 , 1.7 , 3.6 , 2.4 , 2.9 , 3.2 ] }
df_example = pd. DataFrame( data)
df_example
state year pop 0 Ohio 2000 1.5 1 Ohio 2001 1.7 2 Ohio 2002 3.6 3 Nevada 2001 2.4 4 Nevada 2002 2.9 5 Nevada 2003 3.2
2.Series
这个词在英文中的含义代表序列,它是一个一维的、大小可变的、可以存储同类型数据的数组。它是带有标签的一维数组,可以存储任何数据类型(整数、字符串、浮点数、Python对象等)。Series的标签称为“索引”。 特性: 1.它是一个一维的、大小可变的、同类型数据的数组。 2.它有一个与之关联的标签数组或索引,用于访问数据。
l = { 'Ohio' : 35000 , 'Texas' : 71000 , 'Oregon' : 16000 , 'Utah' : 5000 }
series_example = pd. Series( l)
series_example
Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
df = pd. read_csv( 'train.csv' )
df. head( 5 )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
df. columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
通常情况下,我们有多种办法来查看某一列的值
df[ 'Cabin' ] . head( 5 )
0 NaN
1 C85
2 NaN
3 C123
4 NaN
Name: Cabin, dtype: object
df. Cabin. head( 5 )
0 NaN
1 C85
2 NaN
3 C123
4 NaN
Name: Cabin, dtype: object
first_five_carbins = df. take( np. arange( 5 ) , axis= 0 ) [ 'Cabin' ]
first_five_carbins
0 NaN
1 C85
2 NaN
3 C123
4 NaN
Name: Cabin, dtype: object
for i, value in enumerate ( df[ 'Cabin' ] ) :
if i>= 5 :
break
print ( value)
nan
C85
nan
C123
nan
carbin_columns_index = df. columns. get_loc( 'Cabin' )
first_five_carbins = df. iloc[ : 5 , carbin_columns_index]
first_five_carbins
0 NaN
1 C85
2 NaN
3 C123
4 NaN
Name: Cabin, dtype: object
有时候我们也要对原数据的一些项进行丢弃或者删除,最常用的是del关键字以及pandas的drop方法
test_1 = pd. read_csv( 'test_1.csv' )
test_1. head( 3 )
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked a 0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 100 1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 100 2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 100
del test_1[ 'a' ]
test_1. head( 3 )
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
test_1 = test_1. drop( 'a' , axis= 1 )
test_1
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[14], line 2
1 # 使用drop方法
----> 2 test_1 = test_1.drop('a', axis=1)
3 test_1
File ~/opt/anaconda3/envs/deeplearning/lib/python3.8/site-packages/pandas/core/frame.py:5258, in DataFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
5110 def drop(
5111 self,
5112 labels: IndexLabel = None,
(...)
5119 errors: IgnoreRaise = "raise",
5120 ) -> DataFrame | None:
5121 """
5122 Drop specified labels from rows or columns.
5123
(...)
5256 weight 1.0 0.8
5257 """
-> 5258 return super().drop(
5259 labels=labels,
5260 axis=axis,
5261 index=index,
5262 columns=columns,
5263 level=level,
5264 inplace=inplace,
5265 errors=errors,
5266 )
File ~/opt/anaconda3/envs/deeplearning/lib/python3.8/site-packages/pandas/core/generic.py:4549, in NDFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
4547 for axis, labels in axes.items():
4548 if labels is not None:
-> 4549 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
4551 if inplace:
4552 self._update_inplace(obj)
File ~/opt/anaconda3/envs/deeplearning/lib/python3.8/site-packages/pandas/core/generic.py:4591, in NDFrame._drop_axis(self, labels, axis, level, errors, only_slice)
4589 new_axis = axis.drop(labels, level=level, errors=errors)
4590 else:
-> 4591 new_axis = axis.drop(labels, errors=errors)
4592 indexer = axis.get_indexer(new_axis)
4594 # Case for non-unique axis
4595 else:
File ~/opt/anaconda3/envs/deeplearning/lib/python3.8/site-packages/pandas/core/indexes/base.py:6699, in Index.drop(self, labels, errors)
6697 if mask.any():
6698 if errors != "ignore":
-> 6699 raise KeyError(f"{list(labels[mask])} not found in axis")
6700 indexer = indexer[~mask]
6701 return self.delete(indexer)
KeyError: "['a'] not found in axis"
col_index = test_1. columns. get_loc( 'a' )
test_1 = test_1. iloc[ : , : col_index] . join( test_1. iloc[ : , col_index+ 1 : ] )
df. drop( [ 'PassengerId' , 'Name' , 'Age' , 'Ticket' ] , axis= 1 , inplace= False ) . head( 5 )
Survived Pclass Sex SibSp Parch Fare Cabin Embarked 0 0 3 male 1 0 7.2500 NaN S 1 1 1 female 1 0 71.2833 C85 C 2 1 3 female 0 0 7.9250 NaN S 3 1 1 female 1 0 53.1000 C123 S 4 0 3 male 0 0 8.0500 NaN S
df
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S ... ... ... ... ... ... ... ... ... ... ... ... ... 886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S 887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S 888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S 889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C 890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q
891 rows × 12 columns
pandas内置了很多方法来对数据进行逻辑筛查,方便我们选取出我们需要的信息,丢弃掉无用的信息
df[ df[ "Age" ] < 10 ] . head( 10 )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 7 8 0 3 Palsson, Master. Gosta Leonard male 2.00 3 1 349909 21.0750 NaN S 10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.00 1 1 PP 9549 16.7000 G6 S 16 17 0 3 Rice, Master. Eugene male 2.00 4 1 382652 29.1250 NaN Q 24 25 0 3 Palsson, Miss. Torborg Danira female 8.00 3 1 349909 21.0750 NaN S 43 44 1 2 Laroche, Miss. Simonne Marie Anne Andree female 3.00 1 2 SC/Paris 2123 41.5792 NaN C 50 51 0 3 Panula, Master. Juha Niilo male 7.00 4 1 3101295 39.6875 NaN S 58 59 1 2 West, Miss. Constance Mirium female 5.00 1 2 C.A. 34651 27.7500 NaN S 63 64 0 3 Skoog, Master. Harald male 4.00 3 2 347088 27.9000 NaN S 78 79 1 2 Caldwell, Master. Alden Gates male 0.83 0 2 248738 29.0000 NaN S 119 120 0 3 Andersson, Miss. Ellis Anna Maria female 2.00 4 2 347082 31.2750 NaN S
mid_age = df[ ( df[ 'Age' ] > 10 ) & ( df[ 'Age' ] < 50 ) ]
mid_age
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S ... ... ... ... ... ... ... ... ... ... ... ... ... 885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q 886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S 887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S 889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C 890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q
576 rows × 12 columns
mid_age = mid_age. reset_index( drop= True )
mid_age. loc[ [ 100 ] , [ 'Pclass' , 'Sex' ] ]
关于loc与iloc方法的区别
loc:loc是基于标签进行索引的,它使用行和列的标签来访问数据。它的使用范围更广,支持切片,名称以及二者混用。在loc中,使用的索引范围是包含结束点的。这意味着如果你指定一个范围,它会返回该范围内的所有元素,包括结束点。 iloc:iloc是基于整数位置进行索引的,使用整数位置来访问数据。它只能使用整数来取数,不支持标签名称。它的索引范围也不包含结束点
mid_age. iloc[ [ 100 , 105 , 108 ] , [ 2 , 3 , 4 ] ]
Pclass Name Sex 100 2 Byles, Rev. Thomas Roussel Davids male 105 3 Cribb, Mr. John Hatfield male 108 3 Calic, Mr. Jovo male
1.3
import numpy as np
import pandas as pd
text = pd. read_csv( 'train_chinese.csv' )
text. head( )
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
一般在数据预处理阶段,我不会直接在原数据的上进行处理,因为错误的操作可能会对原始数据造成污染,所以我一般都会重新定义一个数据对象来进行操作
frame = text
frame. sort_values( by= 'Sex' , ascending= True , inplace= False )
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 383 383 384 1 1 Holverson, Mrs. Alexander Oskar (Mary Aline To... female 35.0 1 0 113789 52.0000 NaN S 218 218 219 1 1 Bazzani, Miss. Albina female 32.0 0 0 11813 76.2917 D15 C 609 609 610 1 1 Shutes, Miss. Elizabeth W female 40.0 0 0 PC 17582 153.4625 C125 S 216 216 217 1 3 Honkanen, Miss. Eliina female 27.0 0 0 STON/O2. 3101283 7.9250 NaN S 215 215 216 1 1 Newell, Miss. Madeleine female 31.0 1 0 35273 113.2750 D36 C ... ... ... ... ... ... ... ... ... ... ... ... ... ... 371 371 372 0 3 Wiklund, Mr. Jakob Alfred male 18.0 1 0 3101267 6.4958 NaN S 372 372 373 0 3 Beavan, Mr. William Thomas male 19.0 0 0 323951 8.0500 NaN S 373 373 374 0 1 Ringhini, Mr. Sante male 22.0 0 0 PC 17760 135.6333 NaN C 360 360 361 0 3 Skoog, Mr. Wilhelm male 40.0 1 4 347088 27.9000 NaN S 890 890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q
891 rows × 13 columns
新建一个DataFrame来看看sort_values方法的更多使用技巧
test = pd. DataFrame( np. arange( 8 ) . reshape( 2 , 4 ) ,
index= [ '2' , '1' ] ,
columns= [ 'd' , 'a' , 'b' , 'c' ] )
test
test. sort_index( )
test. sort_index( axis= 1 )
降序排序就是在刚才的基础上指定ascending参数为False,从而实现降序排序
test. sort_values( by= [ 'a' , 'c' ] , ascending= False )
我们现在利用上面的技巧对泰坦尼克数据集进行处理
text. sort_values( by= [ 'Ticket' , 'Age' ] , ascending= False ) . head( 3 )
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 745 745 746 0 1 Crosby, Capt. Edward Gifford male 70.0 1 1 WE/P 5735 71.0 B22 S 540 540 541 1 1 Crosby, Miss. Harriet R female 36.0 0 2 WE/P 5735 71.0 B22 S 219 219 220 0 2 Harris, Mr. Walter male 30.0 0 0 W/C 14208 10.5 NaN S
在上面的过程中我们仅仅只关注票价和年龄两部分内容,我们可以发现随着乘客票价越高,在这场事故中乘客存活率也就越高
frame1_a = pd. DataFrame( np. arange( 9. ) . reshape( 3 , 3 ) ,
columns= [ 'a' , 'b' , 'c' ] ,
index= [ 'one' , 'two' , 'three' ] )
frame1_b = pd. DataFrame( np. arange( 12. ) . reshape( 4 , 3 ) ,
columns= [ 'a' , 'e' , 'c' ] ,
index= [ 'first' , 'one' , 'two' , 'second' ] )
frame1_a
a b c one 0.0 1.0 2.0 two 3.0 4.0 5.0 three 6.0 7.0 8.0
frame1_b
a e c first 0.0 1.0 2.0 one 3.0 4.0 5.0 two 6.0 7.0 8.0 second 9.0 10.0 11.0
frame1_a + frame1_b
a b c e first NaN NaN NaN NaN one 3.0 NaN 7.0 NaN second NaN NaN NaN NaN three NaN NaN NaN NaN two 9.0 NaN 13.0 NaN
max ( text[ 'SibSp' ] + text[ 'Parch' ] )
10
Pandas中一个很有用的方法就是describe方法,它用于生成数据帧(DataFrame)或序列(Series)的描述性统计信息。这个函数为数据的每一列(对于DataFrame)或整个序列(对于Series)提供了快速概览,帮助用户理解数据的分布和特性。
当你对一个Pandas DataFrame或Series调用describe()方法时,它会返回以下统计信息(对于数值型数据):
count:非空元素的数量。 mean:平均值。 std:标准差,描述数据集的离散程度。 min:最小值。 25%:第一个四分位数,即数据的下四分位数。 50%:中位数,即数据的第二个四分位数。 75%:第三个四分位数,即数据的上四分位数。 max:最大值。 这些统计信息为数据分析师或科学家提供了对数据集的初步了解,有助于识别异常值、理解数据的分布和范围,以及确定可能需要进行的数据预处理步骤。
text[ 'Ticket' ] . describe( )
count 891
unique 681
top CA. 2343
freq 7
Name: Ticket, dtype: object