Datawhale组队学习动手学数据分析第一章

1.1载入数据

任务1:导入numpy和pandas

import numpy as np
import pandas as pd
import os

任务二:载入数据

(1) 使用相对路径载入
cwd = os.getcwd()
os.chdir("D:\datasets\Titanic")
df = pd.read_csv('train.csv')
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
(2) 使用绝对路径载入数据
df = pd.read_csv('D:\\datasets\\Titanic\\train.csv')
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

任务三:每1000行为一个数据模块,逐块读取

chunker = pd.read_csv('train.csv', chunksize=1000)
chunker
<pandas.io.parsers.TextFileReader at 0x1f6383329a0>

任务四:将表头改成中文,索引改为乘客ID

df = pd.read_csv('train.csv', names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口'],index_col='乘客ID',header=0)
df.head()
是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
乘客ID
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS

1.2初步观察

任务一:查看数据的基本信息

df.info
<bound method DataFrame.info of       是否幸存  仓位等级                                                 姓名      性别  \
乘客ID                                                                          
1        0     3                            Braund, Mr. Owen Harris    male   
2        1     1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female   
3        1     3                             Heikkinen, Miss. Laina  female   
4        1     1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female   
5        0     3                           Allen, Mr. William Henry    male   
...    ...   ...                                                ...     ...   
887      0     2                              Montvila, Rev. Juozas    male   
888      1     1                       Graham, Miss. Margaret Edith  female   
889      0     3           Johnston, Miss. Catherine Helen "Carrie"  female   
890      1     1                              Behr, Mr. Karl Howell    male   
891      0     3                                Dooley, Mr. Patrick    male   

        年龄  兄弟姐妹个数  父母子女个数              船票信息       票价    客舱 登船港口  
乘客ID                                                              
1     22.0       1       0         A/5 21171   7.2500   NaN    S  
2     38.0       1       0          PC 17599  71.2833   C85    C  
3     26.0       0       0  STON/O2. 3101282   7.9250   NaN    S  
4     35.0       1       0            113803  53.1000  C123    S  
5     35.0       0       0            373450   8.0500   NaN    S  
...    ...     ...     ...               ...      ...   ...  ...  
887   27.0       0       0            211536  13.0000   NaN    S  
888   19.0       0       0            112053  30.0000   B42    S  
889    NaN       1       2        W./C. 6607  23.4500   NaN    S  
890   26.0       0       0            111369  30.0000  C148    C  
891   32.0       0       0            370376   7.7500   NaN    Q  

[891 rows x 11 columns]>
df.head(10)
是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
乘客ID
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS
603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
1012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
df.tail(15)
是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
乘客ID
87703Gustafsson, Mr. Alfred Ossianmale20.00075349.8458NaNS
87803Petroff, Mr. Nedeliomale19.0003492127.8958NaNS
87903Laleff, Mr. KristomaleNaN003492177.8958NaNS
88011Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)female56.0011176783.1583C50C
88112Shelley, Mrs. William (Imanita Parrish Hall)female25.00123043326.0000NaNS
88203Markun, Mr. Johannmale33.0003492577.8958NaNS
88303Dahlberg, Miss. Gerda Ulrikafemale22.000755210.5167NaNS
88402Banfield, Mr. Frederick Jamesmale28.000C.A./SOTON 3406810.5000NaNS
88503Sutehall, Mr. Henry Jrmale25.000SOTON/OQ 3920767.0500NaNS
88603Rice, Mrs. William (Margaret Norton)female39.00538265229.1250NaNQ
88702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
89011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

任务三:判断数据是否为空,为空的地方返回True,其余地方返回False

df.isnull()
是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
乘客ID
1FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
4FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
5FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
....................................
887FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
888FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
889FalseFalseFalseFalseTrueFalseFalseFalseFalseTrueFalse
890FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
891FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse

891 rows × 11 columns

1.3 保存数据

任务一:将你加载并做出改变的数据,在工作目录下保存为一个新文件train_chinese.csv

df.to_csv('train.chinese.csv')

2.1知道你的数据叫什么

任务一:pandas中有两个数据类型DateFrame和Series,通过查找简单了解他们。然后自己写一个关于这两个数据类型的小例子

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
example_1 = pd.Series(sdata)
example_1
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002, 2003],'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
example_2 = pd.DataFrame(data)
example_2
stateyearpop
0Ohio20001.5
1Ohio20011.7
2Ohio20023.6
3Nevada20012.4
4Nevada20022.9
5Nevada20033.2

任务二:根据上节课的方法载入"train.csv"文件

df=pd.read_csv('train.chinese.csv')
df.head()
乘客ID是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

任务三:查看DataFrame数据的每列的项

df.columns
Index(['乘客ID', '是否幸存', '仓位等级', '姓名', '性别', '年龄', '兄弟姐妹个数', '父母子女个数', '船票信息',
       '票价', '客舱', '登船港口'],
      dtype='object')

任务四:查看"cabin"这列的所有项

dir(df['客舱'])
['T',
 '_AXIS_ALIASES',
 '_AXIS_IALIASES',
 '_AXIS_LEN',
 '_AXIS_NAMES',
 '_AXIS_NUMBERS',
 '_AXIS_ORDERS',
 '_AXIS_REVERSED',
 '_HANDLED_TYPES',
 '__abs__',
 '__add__',
 '__and__',
 '__annotations__',
 '__array__',
 '__array_priority__',
 '__array_ufunc__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__div__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__long__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdiv__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rmatmul__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__round__',
 '__rpow__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__setitem__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__weakref__',
 '__xor__',
 '_accessors',
 '_add_numeric_operations',
 '_add_series_or_dataframe_operations',
 '_agg_by_level',
 '_agg_examples_doc',
 '_agg_see_also_doc',
 '_aggregate',
 '_aggregate_multiple_funcs',
 '_align_frame',
 '_align_series',
 '_binop',
 '_box_item_values',
 '_builtin_table',
 '_can_hold_na',
 '_check_inplace_setting',
 '_check_is_chained_assignment_possible',
 '_check_label_or_level_ambiguity',
 '_check_setitem_copy',
 '_clear_item_cache',
 '_clip_with_one_bound',
 '_clip_with_scalar',
 '_consolidate',
 '_consolidate_inplace',
 '_construct_axes_dict',
 '_construct_axes_dict_from',
 '_construct_axes_from_arguments',
 '_constructor',
 '_constructor_expanddim',
 '_constructor_sliced',
 '_convert',
 '_convert_dtypes',
 '_create_indexer',
 '_cython_table',
 '_deprecations',
 '_dir_additions',
 '_dir_deletions',
 '_drop_axis',
 '_drop_labels_or_levels',
 '_find_valid_index',
 '_from_axes',
 '_get_axis',
 '_get_axis_name',
 '_get_axis_number',
 '_get_axis_resolvers',
 '_get_block_manager_axis',
 '_get_bool_data',
 '_get_cacher',
 '_get_cleaned_column_resolvers',
 '_get_cython_func',
 '_get_index_resolvers',
 '_get_item_cache',
 '_get_label_or_level_values',
 '_get_numeric_data',
 '_get_value',
 '_get_values',
 '_get_values_tuple',
 '_get_with',
 '_gotitem',
 '_iget_item_cache',
 '_index',
 '_indexed_same',
 '_info_axis',
 '_info_axis_name',
 '_info_axis_number',
 '_init_dict',
 '_init_mgr',
 '_internal_get_values',
 '_internal_names',
 '_internal_names_set',
 '_is_builtin_func',
 '_is_cached',
 '_is_copy',
 '_is_datelike_mixed_type',
 '_is_label_or_level_reference',
 '_is_label_reference',
 '_is_level_reference',
 '_is_mixed_type',
 '_is_numeric_mixed_type',
 '_is_view',
 '_ix',
 '_ixs',
 '_map_values',
 '_maybe_cache_changed',
 '_maybe_update_cacher',
 '_metadata',
 '_ndarray_values',
 '_needs_reindex_multi',
 '_obj_with_exclusions',
 '_protect_consolidate',
 '_reduce',
 '_reindex_axes',
 '_reindex_indexer',
 '_reindex_multi',
 '_reindex_with_indexers',
 '_repr_data_resource_',
 '_repr_latex_',
 '_reset_cache',
 '_reset_cacher',
 '_selected_obj',
 '_selection',
 '_selection_list',
 '_selection_name',
 '_set_as_cached',
 '_set_axis',
 '_set_axis_name',
 '_set_is_copy',
 '_set_item',
 '_set_labels',
 '_set_name',
 '_set_subtyp',
 '_set_value',
 '_set_values',
 '_set_with',
 '_set_with_engine',
 '_setup_axes',
 '_slice',
 '_stat_axis',
 '_stat_axis_name',
 '_stat_axis_number',
 '_take_with_is_copy',
 '_to_dict_of_blocks',
 '_try_aggregate_string_function',
 '_typ',
 '_unpickle_series_compat',
 '_update_inplace',
 '_validate_dtype',
 '_values',
 '_where',
 '_xs',
 'abs',
 'add',
 'add_prefix',
 'add_suffix',
 'agg',
 'aggregate',
 'align',
 'all',
 'any',
 'append',
 'apply',
 'argmax',
 'argmin',
 'argsort',
 'array',
 'asfreq',
 'asof',
 'astype',
 'at',
 'at_time',
 'attrs',
 'autocorr',
 'axes',
 'between',
 'between_time',
 'bfill',
 'bool',
 'clip',
 'combine',
 'combine_first',
 'convert_dtypes',
 'copy',
 'corr',
 'count',
 'cov',
 'cummax',
 'cummin',
 'cumprod',
 'cumsum',
 'describe',
 'diff',
 'div',
 'divide',
 'divmod',
 'dot',
 'drop',
 'drop_duplicates',
 'droplevel',
 'dropna',
 'dtype',
 'dtypes',
 'duplicated',
 'empty',
 'eq',
 'equals',
 'ewm',
 'expanding',
 'explode',
 'factorize',
 'ffill',
 'fillna',
 'filter',
 'first',
 'first_valid_index',
 'floordiv',
 'ge',
 'get',
 'groupby',
 'gt',
 'hasnans',
 'head',
 'hist',
 'iat',
 'idxmax',
 'idxmin',
 'iloc',
 'index',
 'infer_objects',
 'interpolate',
 'is_monotonic',
 'is_monotonic_decreasing',
 'is_monotonic_increasing',
 'is_unique',
 'isin',
 'isna',
 'isnull',
 'item',
 'items',
 'iteritems',
 'keys',
 'kurt',
 'kurtosis',
 'last',
 'last_valid_index',
 'le',
 'loc',
 'lt',
 'mad',
 'map',
 'mask',
 'max',
 'mean',
 'median',
 'memory_usage',
 'min',
 'mod',
 'mode',
 'mul',
 'multiply',
 'name',
 'nbytes',
 'ndim',
 'ne',
 'nlargest',
 'notna',
 'notnull',
 'nsmallest',
 'nunique',
 'pct_change',
 'pipe',
 'plot',
 'pop',
 'pow',
 'prod',
 'product',
 'quantile',
 'radd',
 'rank',
 'ravel',
 'rdiv',
 'rdivmod',
 'reindex',
 'reindex_like',
 'rename',
 'rename_axis',
 'reorder_levels',
 'repeat',
 'replace',
 'resample',
 'reset_index',
 'rfloordiv',
 'rmod',
 'rmul',
 'rolling',
 'round',
 'rpow',
 'rsub',
 'rtruediv',
 'sample',
 'searchsorted',
 'sem',
 'set_axis',
 'shape',
 'shift',
 'size',
 'skew',
 'slice_shift',
 'sort_index',
 'sort_values',
 'squeeze',
 'std',
 'str',
 'sub',
 'subtract',
 'sum',
 'swapaxes',
 'swaplevel',
 'tail',
 'take',
 'to_clipboard',
 'to_csv',
 'to_dict',
 'to_excel',
 'to_frame',
 'to_hdf',
 'to_json',
 'to_latex',
 'to_list',
 'to_markdown',
 'to_numpy',
 'to_period',
 'to_pickle',
 'to_sql',
 'to_string',
 'to_timestamp',
 'to_xarray',
 'transform',
 'transpose',
 'truediv',
 'truncate',
 'tshift',
 'tz_convert',
 'tz_localize',
 'unique',
 'unstack',
 'update',
 'value_counts',
 'values',
 'var',
 'view',
 'where',
 'xs']
df['客舱'].head()
0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: 客舱, dtype: object

任务五:加载文件"test_1.csv",然后对比"train.csv",看看有哪些多出的列,然后将多出的列删除

test_1 = pd.read_csv("C:\\Users\\Administrator\\Documents\\DataScience\\hands-on-data-analysis\\第一单元项目集合\\test_1.csv")
test_1
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkeda
00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS100
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C100
22313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS100
33411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S100
44503Allen, Mr. William Henrymale35.0003734508.0500NaNS100
.............................................
88688688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS100
88788788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S100
88888888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS100
88988989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C100
89089089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ100

891 rows × 14 columns

test_1.drop(['a'],axis=1)
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
22313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
33411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
44503Allen, Mr. William Henrymale35.0003734508.0500NaNS
..........................................
88688688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88788788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88888888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
88988989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89089089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

891 rows × 13 columns

任务六: 将[‘PassengerId’,‘Name’,‘Age’,‘Ticket’]这几个列元素隐藏,只观察其他几个列元素


df=pd.read_csv('train.csv')
df.drop(['PassengerId','Name','Age','Ticket'],axis=1)
SurvivedPclassSexSibSpParchFareCabinEmbarked
003male107.2500NaNS
111female1071.2833C85C
213female007.9250NaNS
311female1053.1000C123S
403male008.0500NaNS
...........................
88602male0013.0000NaNS
88711female0030.0000B42S
88803female1223.4500NaNS
88911male0030.0000C148C
89003male007.7500NaNQ

891 rows × 8 columns

2.2筛选的逻辑

任务一: 我们以"Age"为筛选条件,显示年龄在10岁以下的乘客信息。

df[df['Age']<10]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
7803Palsson, Master. Gosta Leonardmale2.003134990921.0750NaNS
101113Sandstrom, Miss. Marguerite Rutfemale4.0011PP 954916.7000G6S
161703Rice, Master. Eugenemale2.004138265229.1250NaNQ
242503Palsson, Miss. Torborg Danirafemale8.003134990921.0750NaNS
434412Laroche, Miss. Simonne Marie Anne Andreefemale3.0012SC/Paris 212341.5792NaNC
.......................................
82782812Mallet, Master. Andremale1.0002S.C./PARIS 207937.0042NaNC
83183212Richards, Master. George Sibleymale0.83112910618.7500NaNS
85085103Andersson, Master. Sigvard Harald Eliasmale4.004234708231.2750NaNS
85285303Boulos, Miss. Nourelainfemale9.0011267815.2458NaNC
86987013Johnson, Master. Harold Theodormale4.001134774211.1333NaNS

62 rows × 12 columns

任务二: 以"Age"为条件,将年龄在10岁以上和50岁以下的乘客信息显示出来,并将这个数据命名为midage

midage = df[(df['Age']>10)&(df['Age']<50)]
midage.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

连接两个逻辑条件需要用括号括起来

任务三:将midage的数据中第100行的"Pclass"和"Sex"的数据显示出来

print(midage.iloc[100]['Pclass'])
print(midage.iloc[100]['Sex'])
2
male

还可以写作 midage.loc[[100],[‘Pclass’,‘Sex’]]

任务四:使用loc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来

midage.loc[[100,105,108],['Pclass','Name','Sex']]
PclassNameSex
1003Petranec, Miss. Matildafemale
1053Mionoff, Mr. Stoytchomale
1083Rekic, Mr. Tidomale

任务五:使用iloc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来

midage.iloc[[100,105,108],[2,3,4]] #无法用列名索引吗
PclassNameSex
1492Byles, Rev. Thomas Roussel Davidsmale
1603Cribb, Mr. John Hatfieldmale
1633Calic, Mr. Jovomale

3.1开始之前,导入numpy、pandas包和数据

text = pd.read_csv('train.chinese.csv')
text.head()
乘客ID是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

任务一:利用Pandas对示例数据进行排序,要求升序

frame = pd.DataFrame(np.arange(8).reshape((2, 4)), 
    index=['2', '1'], 
    columns=['d', 'a', 'b', 'c'])
frame.sort_index()
dabc
14567
20123
frame.sort_index(axis=1) #按照列索引升序排列 a-b-c-d
abcd
21230
15674
frame.sort_index(axis=1,ascending=False) #降序拍了
dcba
20321
14765
frame.sort_values(by=['a','c'])
dabc
20123
14567

任务二:对泰坦尼克号数据(trian.csv)按票价和年龄两列进行综合排序(降序排列),从数据中你能发现什么

df.sort_values(['Age','Fare'],ascending=False)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
63063111Barkworth, Mr. Algernon Henry Wilsonmale80.0002704230.0000A23S
85185203Svensson, Mr. Johanmale74.0003470607.7750NaNS
49349401Artagaveytia, Mr. Ramonmale71.000PC 1760949.5042NaNC
969701Goldschmidt, Mr. George Bmale71.000PC 1775434.6542A5C
11611703Connors, Mr. Patrickmale70.5003703697.7500NaNQ
.......................................
48148202Frost, Mr. Anthony Wood "Archie"maleNaN002398540.0000NaNS
63363401Parr, Mr. William Henry MarshmaleNaN001120520.0000NaNS
67467502Watson, Mr. Ennis HastingsmaleNaN002398560.0000NaNS
73273302Knight, Mr. Robert JmaleNaN002398550.0000NaNS
81581601Fry, Mr. RichardmaleNaN001120580.0000B102S

891 rows × 12 columns

df.sort_values(['Fare','Age'],ascending=False)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
67968011Cardeza, Mr. Thomas Drake Martinezmale36.001PC 17755512.3292B51 B53 B55C
25825911Ward, Miss. Annafemale35.000PC 17755512.3292NaNC
73773811Lesurer, Mr. Gustave Jmale35.000PC 17755512.3292B101C
43843901Fortune, Mr. Markmale64.01419950263.0000C23 C25 C27S
34134211Fortune, Miss. Alice Elizabethfemale24.03219950263.0000C23 C25 C27S
.......................................
48148202Frost, Mr. Anthony Wood "Archie"maleNaN002398540.0000NaNS
63363401Parr, Mr. William Henry MarshmaleNaN001120520.0000NaNS
67467502Watson, Mr. Ennis HastingsmaleNaN002398560.0000NaNS
73273302Knight, Mr. Robert JmaleNaN002398550.0000NaNS
81581601Fry, Mr. RichardmaleNaN001120580.0000B102S

891 rows × 12 columns

任务三:利用Pandas进行算术计算,计算两个DataFrame数据相加结果

frame1_a = pd.DataFrame(np.arange(9.).reshape(3, 3),
    columns=['a', 'b', 'c'],
    index=['one', 'two', 'three'])
frame1_b = pd.DataFrame(np.arange(12.).reshape(4, 3),
    columns=['a', 'e', 'c'],
    index=['first', 'one', 'two', 'second'])
frame1_a
abc
one0.01.02.0
two3.04.05.0
three6.07.08.0
frame1_b
aec
first0.01.02.0
one3.04.05.0
two6.07.08.0
second9.010.011.0
frame1_a + frame1_b #结果会自动合并
abce
firstNaNNaNNaNNaN
one3.0NaN7.0NaN
secondNaNNaNNaNNaN
threeNaNNaNNaNNaN
two9.0NaN13.0NaN
两个DataFrame相加后,会返回一个新的DataFrame,对应的行和列的值会相加,没有对应的会变成空值NaN。

任务四:通过泰坦尼克号数据如何计算出在船上最大的家族有多少人?

max(text['兄弟姐妹个数']+text['父母子女个数'])
10

任务五:学会使用Pandasdescribe()函数查看数据基本统计信息

frame2 = pd.DataFrame([[1.4, np.nan], 
    [7.1, -4.5],
    [np.nan, np.nan], 
    [0.75, -1.3]
    ], index=['a', 'b', 'c', 'd'], columns=['one', 'two'])
frame2
onetwo
a1.40NaN
b7.10-4.5
cNaNNaN
d0.75-1.3
frame.describe()
dabc
count2.0000002.0000002.0000002.000000
mean2.0000003.0000004.0000005.000000
std2.8284272.8284272.8284272.828427
min0.0000001.0000002.0000003.000000
25%1.0000002.0000003.0000004.000000
50%2.0000003.0000004.0000005.000000
75%3.0000004.0000005.0000006.000000
max4.0000005.0000006.0000007.000000
count : 样本数据大小
mean : 样本数据的平均值
std : 样本数据的标准差
min : 样本数据的最小值
25% : 样本数据25%的时候的值
50% : 样本数据50%的时候的值
75% : 样本数据75%的时候的值
max : 样本数据的最大值

任务六:分别看看泰坦尼克号数据集中 票价、父母子女 这列数据的基本统计数据,你能发现什么?

text['票价'].describe()
count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: 票价, dtype: float64
text['父母子女个数'].describe()
count    891.000000
mean       0.381594
std        0.806057
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max        6.000000
Name: 父母子女个数, dtype: float64

``

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值