1.1
教程中使用的数据集是机器学习中经典的泰坦尼克号数据集
# 导入需要的库 import numpy as np import pandas as pd
# 通常使用df来简要表示pandas的dataframe结构 # 使用pandas自带的读入方法读取csv类型数据 df= pd.read_csv('train.csv') df
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
</div>
# 可以使用head方法来指定查看数据的行数 df.head(5)
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
</div>
# 上面导入数据使用的是相对路径,相对路径相较于绝对路径使用简单,但是在一些情况下,绝对路径更不容易出错 # 我们可以通过Pycharm自带的功能获得文件的绝对路径 /Users/apple/Documents/Hands_on_DA/chapter_1/train.csv df = pd.read_csv('/Users/apple/Documents/Hands_on_DA/chapter_1/train.csv') df.head(5)
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
</div>
我们可以看到我们获得了相同的结果
# 还可以使用通过指定chunksize来对数据进行分块读取 chunk = pd.read_csv('train.csv', chunksize=1000)
将数据集的列名更改为中文,亦或是其他想要的格式。除了教程里提到的方法,在这里我再介绍一种我常用的方法
# 在已经读取数据列表之后,我们可以通过字典映射来修改列名 df_ = df.rename(columns={ 'PassengerId' : '乘客ID', 'Survived' : '是否幸存', 'Pclass' : '乘客等级(1/2/3等舱位)', 'Name' : '乘客姓名', 'Sex' : '性别', 'Age' : '年龄', 'SibSp' : '堂兄弟/妹个数', 'Parch' : '父母与小孩个数', 'Ticket' : '船票信息', 'Fare' : '票价', 'Cabin' : '客舱', 'Embarked' : '登船港口' }, inplace=False) # 可以通过指定inplace是否为true来选择是否在原来的df上修改索引
通过上面的步骤,我们已经初步的完成了对数据的处理,通常情况下在对数据进行处理之前,我们还需要先了解一下数据的基本构成,有了足够的了解我们才可以对数据进行预处理
# 使用info方法来查看数据的基本构成 df_.info() # 在这行信息中我们可以看到各个列的具体数据类型以及他们的长度,float64,int64相信大家都很熟悉,但是对于object类型 # 的对象,从我的经验来说在处理的时候会很不方便,之后预处理的过程中我们也可以选择对这些列进行数据的转换
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 乘客ID 891 non-null int64 1 是否幸存 891 non-null int64 2 乘客等级(1/2/3等舱位) 891 non-null int64 3 乘客姓名 891 non-null object 4 性别 891 non-null object 5 年龄 714 non-null float64 6 堂兄弟/妹个数 891 non-null int64 7 父母与小孩个数 891 non-null int64 8 船票信息 891 non-null object 9 票价 891 non-null float64 10 客舱 204 non-null object 11 登船港口 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
# 前面我们提到过可以使用head来查看数据的前几行,我们也可以是用tail函数来查看数据的后几行 df_.tail(15)
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
乘客ID | 是否幸存 | 乘客等级(1/2/3等舱位) | 乘客姓名 | 性别 | 年龄 | 堂兄弟/妹个数 | 父母与小孩个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
876 | 877 | 0 | 3 | Gustafsson, Mr. Alfred Ossian | male | 20.0 | 0 | 0 | 7534 | 9.8458 | NaN | S |
877 | 878 | 0 | 3 | Petroff, Mr. Nedelio | male | 19.0 | 0 | 0 | 349212 | 7.8958 | NaN | S |
878 | 879 | 0 | 3 | Laleff, Mr. Kristo | male | NaN | 0 | 0 | 349217 | 7.8958 | NaN | S |
879 | 880 | 1 | 1 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 0 | 1 | 11767 | 83.1583 | C50 | C |
880 | 881 | 1 | 2 | Shelley, Mrs. William (Imanita Parrish Hall) | female | 25.0 | 0 | 1 | 230433 | 26.0000 | NaN | S |
881 | 882 | 0 | 3 | Markun, Mr. Johann | male | 33.0 | 0 | 0 | 349257 | 7.8958 | NaN | S |
882 | 883 | 0 | 3 | Dahlberg, Miss. Gerda Ulrika | female | 22.0 | 0 | 0 | 7552 | 10.5167 | NaN | S |
883 | 884 | 0 | 2 | Banfield, Mr. Frederick James | male | 28.0 | 0 | 0 | C.A./SOTON 34068 | 10.5000 | NaN | S |
884 | 885 | 0 | 3 | Sutehall, Mr. Henry Jr | male | 25.0 | 0 | 0 | SOTON/OQ 392076 | 7.0500 | NaN | S |
885 | 886 | 0 | 3 | Rice, Mrs. William (Margaret Norton) | female | 39.0 | 0 | 5 | 382652 | 29.1250 | NaN | Q |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
</div>
在数据中,我们非常讨厌缺失值(Null),这种缺失值在我们对数据进行处理以及通过数据观察规律的过程中会带来不小的麻烦,例如在机器学习中,数据中数值的缺失很容易带来梯度的问题
# 首先先查看是否有缺失值存在 df_.isnull().head()
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
乘客ID | 是否幸存 | 乘客等级(1/2/3等舱位) | 乘客姓名 | 性别 | 年龄 | 堂兄弟/妹个数 | 父母与小孩个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | False | False | True | False |
1 | False | False | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False | True | False |
3 | False | False | False | False | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False | True | False |
</div>
# 我们可以将我们上面获得的数据通过pandas自带的to方法存储 df.to_csv('train_chinese.csv')
1.2
在pandas中主要存在两种数据类型,一种是DataFrame,一种是Series。
1.DataFrame
DataFrame是Pandas中用于存储表格型数据的主要数据结构。它类似于Excel表格或SQL表,以及Python中的二维数组(尽管它更加灵活)。DataFrame可以存储多种类型的数据,并且每列可以有不同的数据类型。DataFrame有行和列标签,因此你可以很容易地通过标签来访问、修改数据。 特性: 1.它是二维的、大小可变的、且可以存储异种类型数据的表格型数据结构。 2.它有行标签(index)和列标签(columns)。 3.你可以通过标签来访问、修改数据。
# 可以使用字典来创建df对象 import numpy as np import pandas as pd data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002, 2003],'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]} df_example = pd.DataFrame(data) df_example
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
state | year | pop | |
---|---|---|---|
0 | Ohio | 2000 | 1.5 |
1 | Ohio | 2001 | 1.7 |
2 | Ohio | 2002 | 3.6 |
3 | Nevada | 2001 | 2.4 |
4 | Nevada | 2002 | 2.9 |
5 | Nevada | 2003 | 3.2 |
</div>
2.Series
这个词在英文中的含义代表序列,它是一个一维的、大小可变的、可以存储同类型数据的数组。它是带有标签的一维数组,可以存储任何数据类型(整数、字符串、浮点数、Python对象等)。Series的标签称为“索引”。 特性: 1.它是一个一维的、大小可变的、同类型数据的数组。 2.它有一个与之关联的标签数组或索引,用于访问数据。
# 我们可以使用列表来创建Series l = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000} series_example = pd.Series(l) series_example
Ohio 35000 Texas 71000 Oregon 16000 Utah 5000 dtype: int64
# 载入我们需要的数据集 df = pd.read_csv('train.csv') df.head(5)
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
</div>
# 查看数据每列的名称 df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')
通常情况下,我们有多种办法来查看某一列的值
# 方法1 df['Cabin'].head(5)
0 NaN 1 C85 2 NaN 3 C123 4 NaN Name: Cabin, dtype: object
# 方法2 df.Cabin.head(5)
0 NaN 1 C85 2 NaN 3 C123 4 NaN Name: Cabin, dtype: object
# 方法3, 使用.take方法 first_five_carbins = df.take(np.arange(5), axis=0)['Cabin'] first_five_carbins
0 NaN 1 C85 2 NaN 3 C123 4 NaN Name: Cabin, dtype: object
# 方法4 如果指向查看而不是获取值, 可以迭代指定列的值 for i, value in enumerate(df['Cabin']): if i>=5: break print(value)
nan C85 nan C123 nan
# 方法5,也是最常用的方法,使用iloc方法,基于整数位置索引 carbin_columns_index = df.columns.get_loc('Cabin') first_five_carbins = df.iloc[:5, carbin_columns_index] first_five_carbins
0 NaN 1 C85 2 NaN 3 C123 4 NaN Name: Cabin, dtype: object
有时候我们也要对原数据的一些项进行丢弃或者删除,最常用的是del关键字以及pandas的drop方法
# 先读取test_1.csv test_1 = pd.read_csv('test_1.csv') test_1.head(3)
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | a | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 100 |
1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 100 |
2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 100 |
</div>
# 使用del关键字 del test_1['a'] test_1.head(3)
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
</div>
# 使用drop方法 test_1 = test_1.drop('a', axis=1) test_1
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[14], line 2 1 # 使用drop方法 ----> 2 test_1 = test_1.drop('a', axis=1) 3 test_1
File ~/opt/anaconda3/envs/deeplearning/lib/python3.8/site-packages/pandas/core/frame.py:5258, in DataFrame.drop(self, labels, axis, index, columns, level, inplace, errors) 5110 def drop( 5111 self, 5112 labels: IndexLabel = None, (...) 5119 errors: IgnoreRaise = "raise", 5120 ) -> DataFrame | None: 5121 """ 5122 Drop specified labels from rows or columns. 5123 (...) 5256 weight 1.0 0.8 5257 """ -> 5258 return super().drop( 5259 labels=labels, 5260 axis=axis, 5261 index=index, 5262 columns=columns, 5263 level=level, 5264 inplace=inplace, 5265 errors=errors, 5266 )
File ~/opt/anaconda3/envs/deeplearning/lib/python3.8/site-packages/pandas/core/generic.py:4549, in NDFrame.drop(self, labels, axis, index, columns, level, inplace, errors) 4547 for axis, labels in axes.items(): 4548 if labels is not None: -> 4549 obj = obj._drop_axis(labels, axis, level=level, errors=errors) 4551 if inplace: 4552 self._update_inplace(obj)
File ~/opt/anaconda3/envs/deeplearning/lib/python3.8/site-packages/pandas/core/generic.py:4591, in NDFrame._drop_axis(self, labels, axis, level, errors, only_slice) 4589 new_axis = axis.drop(labels, level=level, errors=errors) 4590 else: -> 4591 new_axis = axis.drop(labels, errors=errors) 4592 indexer = axis.get_indexer(new_axis) 4594 # Case for non-unique axis 4595 else:
File ~/opt/anaconda3/envs/deeplearning/lib/python3.8/site-packages/pandas/core/indexes/base.py:6699, in Index.drop(self, labels, errors) 6697 if mask.any(): 6698 if errors != "ignore": -> 6699 raise KeyError(f"{list(labels[mask])} not found in axis") 6700 indexer = indexer[~mask] 6701 return self.delete(indexer)
KeyError: "['a'] not found in axis"
# 使用列索引来删除(如果知道要删除列的整数位置) col_index = test_1.columns.get_loc('a') test_1 = test_1.iloc[:, :col_index].join(test_1.iloc[:, col_index+1:]) # 使用join方法对列两边进行连接
# 若是想要在观察时单独观察某几个列,可以使用drop方法 df.drop(['PassengerId','Name','Age','Ticket'],axis=1, inplace=False).head(5)
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
Survived | Pclass | Sex | SibSp | Parch | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 1 | 0 | 7.2500 | NaN | S |
1 | 1 | 1 | female | 1 | 0 | 71.2833 | C85 | C |
2 | 1 | 3 | female | 0 | 0 | 7.9250 | NaN | S |
3 | 1 | 1 | female | 1 | 0 | 53.1000 | C123 | S |
4 | 0 | 3 | male | 0 | 0 | 8.0500 | NaN | S |
</div>
df # 我们可以看到在指定了inplace参数之后,原数据并没有发生改变
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
</div>
pandas内置了很多方法来对数据进行逻辑筛查,方便我们选取出我们需要的信息,丢弃掉无用的信息
# 显示年龄在10岁以下的乘客 df[df["Age"]<10].head(10)
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.00 | 3 | 1 | 349909 | 21.0750 | NaN | S |
10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.00 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
16 | 17 | 0 | 3 | Rice, Master. Eugene | male | 2.00 | 4 | 1 | 382652 | 29.1250 | NaN | Q |
24 | 25 | 0 | 3 | Palsson, Miss. Torborg Danira | female | 8.00 | 3 | 1 | 349909 | 21.0750 | NaN | S |
43 | 44 | 1 | 2 | Laroche, Miss. Simonne Marie Anne Andree | female | 3.00 | 1 | 2 | SC/Paris 2123 | 41.5792 | NaN | C |
50 | 51 | 0 | 3 | Panula, Master. Juha Niilo | male | 7.00 | 4 | 1 | 3101295 | 39.6875 | NaN | S |
58 | 59 | 1 | 2 | West, Miss. Constance Mirium | female | 5.00 | 1 | 2 | C.A. 34651 | 27.7500 | NaN | S |
63 | 64 | 0 | 3 | Skoog, Master. Harald | male | 4.00 | 3 | 2 | 347088 | 27.9000 | NaN | S |
78 | 79 | 1 | 2 | Caldwell, Master. Alden Gates | male | 0.83 | 0 | 2 | 248738 | 29.0000 | NaN | S |
119 | 120 | 0 | 3 | Andersson, Miss. Ellis Anna Maria | female | 2.00 | 4 | 2 | 347082 | 31.2750 | NaN | S |
</div>
# 进一步限定查询的信息,并且将此数据重命名 mid_age = df[(df['Age']>10) & (df['Age']<50)] mid_age
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
885 | 886 | 0 | 3 | Rice, Mrs. William (Margaret Norton) | female | 39.0 | 0 | 5 | 382652 | 29.1250 | NaN | Q |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
576 rows × 12 columns
</div>
# 通过上面得到的数据索引是不连续的,如果我们要进一步在mid_age上进行数据的处理,我们首要任务是把数据进行重新索引 mid_age = mid_age.reset_index(drop=True) # drop参数指定是否将其进行丢弃 mid_age.loc[[100], ['Pclass', 'Sex']]
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
Pclass | Sex | |
---|---|---|
100 | 2 | male |
</div>
关于loc与iloc方法的区别
loc:loc是基于标签进行索引的,它使用行和列的标签来访问数据。它的使用范围更广,支持切片,名称以及二者混用。在loc中,使用的索引范围是包含结束点的。这意味着如果你指定一个范围,它会返回该范围内的所有元素,包括结束点。 iloc:iloc是基于整数位置进行索引的,使用整数位置来访问数据。它只能使用整数来取数,不支持标签名称。它的索引范围也不包含结束点
# 还可以指定多条数据项的指定列 mid_age.iloc[[100, 105, 108], [2, 3, 4]]
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
Pclass | Name | Sex | |
---|---|---|---|
100 | 2 | Byles, Rev. Thomas Roussel Davids | male |
105 | 3 | Cribb, Mr. John Hatfield | male |
108 | 3 | Calic, Mr. Jovo | male |
</div>
1.3
import numpy as np import pandas as pd
# 载入之前保存的修改列名为中文的数据 text = pd.read_csv('train_chinese.csv') text.head()
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
</div>
一般在数据预处理阶段,我不会直接在原数据的上进行处理,因为错误的操作可能会对原始数据造成污染,所以我一般都会重新定义一个数据对象来进行操作
frame = text
# 对读取到的数据进行升序排列 frame.sort_values(by='Sex', ascending=True, inplace=False) # ascending参数用于设置排序的方向,我们在这里可以简单对性别进行排序看看效果
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
383 | 383 | 384 | 1 | 1 | Holverson, Mrs. Alexander Oskar (Mary Aline To... | female | 35.0 | 1 | 0 | 113789 | 52.0000 | NaN | S |
218 | 218 | 219 | 1 | 1 | Bazzani, Miss. Albina | female | 32.0 | 0 | 0 | 11813 | 76.2917 | D15 | C |
609 | 609 | 610 | 1 | 1 | Shutes, Miss. Elizabeth W | female | 40.0 | 0 | 0 | PC 17582 | 153.4625 | C125 | S |
216 | 216 | 217 | 1 | 3 | Honkanen, Miss. Eliina | female | 27.0 | 0 | 0 | STON/O2. 3101283 | 7.9250 | NaN | S |
215 | 215 | 216 | 1 | 1 | Newell, Miss. Madeleine | female | 31.0 | 1 | 0 | 35273 | 113.2750 | D36 | C |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
371 | 371 | 372 | 0 | 3 | Wiklund, Mr. Jakob Alfred | male | 18.0 | 1 | 0 | 3101267 | 6.4958 | NaN | S |
372 | 372 | 373 | 0 | 3 | Beavan, Mr. William Thomas | male | 19.0 | 0 | 0 | 323951 | 8.0500 | NaN | S |
373 | 373 | 374 | 0 | 1 | Ringhini, Mr. Sante | male | 22.0 | 0 | 0 | PC 17760 | 135.6333 | NaN | C |
360 | 360 | 361 | 0 | 3 | Skoog, Mr. Wilhelm | male | 40.0 | 1 | 4 | 347088 | 27.9000 | NaN | S |
890 | 890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 13 columns
</div>
新建一个DataFrame来看看sort_values方法的更多使用技巧
test = pd.DataFrame(np.arange(8).reshape(2,4), index=['2', '1'], columns=['d', 'a', 'b', 'c']) test
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
d | a | b | c | |
---|---|---|---|---|
2 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
</div>
# 让行索引升序排序 test.sort_index()
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
d | a | b | c | |
---|---|---|---|---|
1 | 4 | 5 | 6 | 7 |
2 | 0 | 1 | 2 | 3 |
</div>
# 让列索引升序排序 test.sort_index(axis=1)
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
a | b | c | d | |
---|---|---|---|---|
2 | 1 | 2 | 3 | 0 |
1 | 5 | 6 | 7 | 4 |
</div>
降序排序就是在刚才的基础上指定ascending参数为False,从而实现降序排序
# 也可以任选两列进行排序 test.sort_values(by=['a', 'c'], ascending=False)
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
d | a | b | c | |
---|---|---|---|---|
1 | 4 | 5 | 6 | 7 |
2 | 0 | 1 | 2 | 3 |
</div>
我们现在利用上面的技巧对泰坦尼克数据集进行处理
text.sort_values(by=['Ticket', 'Age'], ascending=False).head(3)
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
745 | 745 | 746 | 0 | 1 | Crosby, Capt. Edward Gifford | male | 70.0 | 1 | 1 | WE/P 5735 | 71.0 | B22 | S |
540 | 540 | 541 | 1 | 1 | Crosby, Miss. Harriet R | female | 36.0 | 0 | 2 | WE/P 5735 | 71.0 | B22 | S |
219 | 219 | 220 | 0 | 2 | Harris, Mr. Walter | male | 30.0 | 0 | 0 | W/C 14208 | 10.5 | NaN | S |
</div>
在上面的过程中我们仅仅只关注票价和年龄两部分内容,我们可以发现随着乘客票价越高,在这场事故中乘客存活率也就越高
# Pandas还可以用来做数学运算 frame1_a = pd.DataFrame(np.arange(9.).reshape(3, 3), columns=['a', 'b', 'c'], index=['one', 'two', 'three']) frame1_b = pd.DataFrame(np.arange(12.).reshape(4, 3), columns=['a', 'e', 'c'], index=['first', 'one', 'two', 'second'])
frame1_a
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
a | b | c | |
---|---|---|---|
one | 0.0 | 1.0 | 2.0 |
two | 3.0 | 4.0 | 5.0 |
three | 6.0 | 7.0 | 8.0 |
</div>
frame1_b
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
a | e | c | |
---|---|---|---|
first | 0.0 | 1.0 | 2.0 |
one | 3.0 | 4.0 | 5.0 |
two | 6.0 | 7.0 | 8.0 |
second | 9.0 | 10.0 | 11.0 |
</div>
frame1_a + frame1_b
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
a | b | c | e | |
---|---|---|---|---|
first | NaN | NaN | NaN | NaN |
one | 3.0 | NaN | 7.0 | NaN |
second | NaN | NaN | NaN | NaN |
three | NaN | NaN | NaN | NaN |
two | 9.0 | NaN | 13.0 | NaN |
</div>
# 计算泰坦尼克号上最大的家族有多少人 max(text['SibSp']+text['Parch'])
10
Pandas中一个很有用的方法就是describe方法,它用于生成数据帧(DataFrame)或序列(Series)的描述性统计信息。这个函数为数据的每一列(对于DataFrame)或整个序列(对于Series)提供了快速概览,帮助用户理解数据的分布和特性。
当你对一个Pandas DataFrame或Series调用describe()方法时,它会返回以下统计信息(对于数值型数据):
count:非空元素的数量。 mean:平均值。 std:标准差,描述数据集的离散程度。 min:最小值。 25%:第一个四分位数,即数据的下四分位数。 50%:中位数,即数据的第二个四分位数。 75%:第三个四分位数,即数据的上四分位数。 max:最大值。 这些统计信息为数据分析师或科学家提供了对数据集的初步了解,有助于识别异常值、理解数据的分布和范围,以及确定可能需要进行的数据预处理步骤。
text['Ticket'].describe()
count 891 unique 681 top CA. 2343 freq 7 Name: Ticket, dtype: object