数据分析第一章task1_pandas基础

1 第一章:数据载入及初步观察

1.4 知道你的数据叫什么

我们学习pandas的基础操作,那么上一节通过pandas加载之后的数据,其数据类型是什么呢?

开始前导入numpy和pandas

import numpy as np
import pandas as pd
1.4.1 任务一:pandas中有两个数据类型DateFrame和Series,通过查找简单了解他们。然后自己写一个关于这两个数据类型的小例子🌰[开放题]
#写入代码
data1={'tom':2500,'jack':4500,'mary':5600,'me':15000}
ep1= pd.Series(data1)
print(ep1)
print(ep1.values)
print(ep1.index)
ep1[1:3]
print(ep1['tom'])
tom      2500
jack     4500
mary     5600
me      15000
dtype: int64
[ 2500  4500  5600 15000]
Index(['tom', 'jack', 'mary', 'me'], dtype='object')
2500
#dataframe
data2={'tom':2500,'jack':4500,'mary':5600,'me':15000,'dsfs':4800,'eng':4890}
money_data=pd.DataFrame({'money':data2})
print(money_data)
money_data.index
      money
dsfs   4800
eng    4890
jack   4500
mary   5600
me    15000
tom    2500





Index(['dsfs', 'eng', 'jack', 'mary', 'me', 'tom'], dtype='object')
'''
#我们举的例子
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
example_1 = pd.Series(sdata)
example_1
'''
'''
#我们举的例子
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
example_2 = pd.DataFrame(data)
example_2
'''


1.4.2 任务二:根据上节课的方法载入"train.csv"文件
#写入代码
df = pd.read_csv('train.csv')
df.head(3)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS

也可以加载上一节课保存的"train_chinese.csv"文件。通过翻译版train_chinese.csv熟悉了这个数据集,然后我们对trian.csv来进行操作

1.4.3 任务三:查看DataFrame数据的每列的项
#写入代码
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
1.4.4任务四:查看"cabin"这列的所有项 [有多种方法]
#写入代码
df['Cabin'].head()
0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object
#写入代码
df.Cabin.head()
0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object
#如果想查看不重复的项,可以用dataframe['xxx'].unique()
df['Cabin'].unique()
array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64', 'E24', 'C90', 'C45', 'E8', 'B101', 'D45', 'C46', 'D30',
       'E121', 'D11', 'E77', 'F38', 'B3', 'D6', 'B82 B84', 'D17', 'A36',
       'B102', 'B69', 'E49', 'C47', 'D28', 'E17', 'A24', 'C50', 'B42',
       'C148'], dtype=object)
1.4.5 任务五:加载文件"test_1.csv",然后对比"train.csv",看看有哪些多出的列,然后将多出的列删除

经过我们的观察发现一个测试集test_1.csv有一列是多余的,我们需要将这个多余的列删去

#写入代码
test_1=pd.read_csv('test_1.csv')
test_1.head(3)
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkeda
00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS100
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C100
22313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS100
#写入代码
del test_1['a']
test_1.head(3)
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
22313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS

【思考】还有其他的删除多余的列的方式吗?

# 思考回答
#方法一
# https://www.cnblogs.com/datasnail/p/9767158.html
test_1=pd.read_csv('test_1.csv')
test_1.head(3)
test_1.drop(['a'],axis=1,inplace=True)
test_1.head(3)

Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
22313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
#方法二
test_1=pd.read_csv('test_1.csv')
test_1.head(3)
test_1.drop(columns=['a'],inplace=True)
test_1.head(3)
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
22313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
1.4.6 任务六: 将[‘PassengerId’,‘Name’,‘Age’,‘Ticket’]这几个列元素隐藏,只观察其他几个列元素
#写入代码
df.drop(['PassengerId','Name','Age','Ticket'],axis=1).head(3)
SurvivedPclassSexSibSpParchFareCabinEmbarked
003male107.2500NaNS
111female1071.2833C85C
213female007.9250NaNS

【思考】对比任务五和任务六,是不是使用了不一样的方法(函数),如果使用一样的函数如何完成上面的不同的要求呢?

【思考回答】

如果想要完全的删除你的数据结构,使用inplace=True,因为使用inplace就将原数据覆盖了,所以这里没有用

1.5 筛选的逻辑

表格数据中,最重要的一个功能就是要具有可筛选的能力,选出我所需要的信息,丢弃无用的信息。

下面我们还是用实战来学习pandas这个功能。

1.5.1 任务一: 我们以"Age"为筛选条件,显示年龄在10岁以下的乘客信息。
#写入代码
df[df["Age"]<10].head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
101113Sandstrom, Miss. Marguerite Rutfemale4.011PP 954916.7000G6S
161703Rice, Master. Eugenemale2.04138265229.1250NaNQ
242503Palsson, Miss. Torborg Danirafemale8.03134990921.0750NaNS
434412Laroche, Miss. Simonne Marie Anne Andreefemale3.012SC/Paris 212341.5792NaNC
1.5.2 任务二: 以"Age"为条件,将年龄在10岁以上和50岁以下的乘客信息显示出来,并将这个数据命名为midage
#写入代码
# midage=df[(df["Age"]>10) &(df["Age"]<50)]
# midage.head()
#方式二:
midage=df[(df.Age>10) &(df.Age<50)]
midage.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

【提示】了解pandas的条件筛选方式以及如何使用交集和并集操作

dage=df[df.Age<10]
dage.head()
#多条件必须加上()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
101113Sandstrom, Miss. Marguerite Rutfemale4.011PP 954916.7000G6S
161703Rice, Master. Eugenemale2.04138265229.1250NaNQ
242503Palsson, Miss. Torborg Danirafemale8.03134990921.0750NaNS
434412Laroche, Miss. Simonne Marie Anne Andreefemale3.012SC/Paris 212341.5792NaNC
1.5.3 任务三:将midage的数据中第100行的"Pclass"和"Sex"的数据显示出来
# 写入代码
# https://www.cnblogs.com/nxf-rabbit75/p/10105271.html
#方法一:
midage.ix[100,'Pclass','Sex']
#这个已经不行了
C:\Users\13153\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: FutureWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  after removing the cwd from sys.path.



---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

<ipython-input-47-3cc60dd8e5a7> in <module>
      2 # https://www.cnblogs.com/nxf-rabbit75/p/10105271.html
      3 #方法一:
----> 4 midage.ix[100,'Pclass','Sex']


~\Anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
    123             key = tuple(com.apply_if_callable(x, self.obj) for x in key)
    124             try:
--> 125                 values = self.obj._get_value(*key)
    126             except (KeyError, TypeError, InvalidIndexError, AttributeError):
    127                 # TypeError occurs here if the key has non-hashable entries,


~\Anaconda3\lib\site-packages\pandas\core\frame.py in _get_value(self, index, col, takeable)
   2823 
   2824         if takeable:
-> 2825             series = self._iget_item_cache(col)
   2826             return com.maybe_box_datetimelike(series._values[index])
   2827 


~\Anaconda3\lib\site-packages\pandas\core\generic.py in _iget_item_cache(self, item)
   3292         ax = self._info_axis
   3293         if ax.is_unique:
-> 3294             lower = self._get_item_cache(ax[item])
   3295         else:
   3296             lower = self.take(item, axis=self._info_axis_number)


~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in __getitem__(self, key)
   4278         if is_scalar(key):
   4279             key = com.cast_scalar_indexer(key)
-> 4280             return getitem(key)
   4281 
   4282         if isinstance(key, slice):


IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
#这个没有加reset_index()函数,是错的
#选取的是第101行
midage.loc[[100],['Pclass','Sex']]
PclassSex
1003female
#正确做法:
midage = midage.reset_index(drop=True)
midage.head(3)
midage.loc[[100],['Pclass','Sex']]
PclassSex
1002male

【思考】这个reset_index()函数的作用是什么?如果不用这个函数,下面的任务会出现什么情况?

1.5.4 任务四:将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来
#写入代码
midage.loc[[100,105,108],['Pclass','Name','Sex']] 
PclassNameSex
1002Byles, Rev. Thomas Roussel Davidsmale
1053Cribb, Mr. John Hatfieldmale
1083Calic, Mr. Jovomale

【提示】使用pandas提出的简单方式,你可以看看loc方法

对比整体的数据位置,你有发现什么问题吗?那么如何解决?

#若在JupyterNotebook中直接输出DataFrame格式,则是有线框的HTML格式的表格
#但是这种方式无法同时在一个cell中显示两个表格,只显示最后一个表格
from IPython.display import display
display(midage)
display(midage.loc[[100,105,108],['Pclass','Name','Sex']] )
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
.......................................
57188603Rice, Mrs. William (Margaret Norton)female39.00538265229.1250NaNQ
57288702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
57388811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
57489011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
57589103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

576 rows × 12 columns

PclassNameSex
1002Byles, Rev. Thomas Roussel Davidsmale
1053Cribb, Mr. John Hatfieldmale
1083Calic, Mr. Jovomale
1.5.5 任务五:使用iloc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来
#写入代码
#方法一:
midage.iloc[[100,105,108],[2,3,4]]
PclassNameSex
1002Byles, Rev. Thomas Roussel Davidsmale
1053Cribb, Mr. John Hatfieldmale
1083Calic, Mr. Jovomale
#方法二:
midage.loc[[100,105,108],["Pclass","Name","Sex"]]
PclassNameSex
1002Byles, Rev. Thomas Roussel Davidsmale
1053Cribb, Mr. John Hatfieldmale
1083Calic, Mr. Jovomale

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值