学习打卡:动手数据分析Task01

1 载入数据和数据观察

1.1导入numpy和pandas

import numpy as np
import pandas as pd

1.2 载入数据

(1) 使用相对路径载入数据

data=pd.read_csv('train.csv')   #相对路径
data.head()         #head()默认是前5行
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

(2) 使用绝对路径载入数据

data=pd.read_csv("D:/Jupyter NoteBook/组队学习/hands-on-data-analysis-master/第一单元项目集合/train.csv")
data.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

【提示】相对路径载入报错时,尝试使用os.getcwd()查看当前工作目录。

【思考】知道数据加载的方法后,试试pd.read_csv()和pd.read_table()的不同,如果想让他们效果一样,需要怎么做?了解一下’.tsv’和’.csv’的不同,如何加载这两个数据集?

1)read_table是以制表符 \t 作为数据的标志,以行为单位进行存储;read_csv是以分隔符号 ‘,’ 作为数据的标志。使参数 sep=‘,’ 就能让他们效果一样。

2 ) TSV:tab separated values;即“制表符分隔值”,CSV: comma separated values;即“逗号分隔值”

使用pd.read_table()来读取数据

data=pd.read_table('train.csv',sep=',')
data.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

1.3每1000行为一个数据模块,逐块读取

chunks=pd.read_csv('train.csv',chunksize=1000)
chunks
<pandas.io.parsers.TextFileReader at 0x20b2ef42490>

【思考】什么是逐块读取?为什么要逐块读取呢?

【提示】大家可以chunker(数据块)是什么类型?用for循环打印出来出处具体的样子是什么?

使用read_csv会把整个文件的数据读取到DataFrame中,当数据量大时,就会很吃内存;所以在read_csv中通过设置参数chunksize来指定一个chunksize分块大小来读取文件,它会返回一个可迭代的对象TextFileReader,然后使用for循环取出数据。

chunks=pd.read_csv('train.csv',chunksize=500)      #该数据集不到1000,这里使用500
for chunk in chunks:
    print(chunk)
     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
495          496         0       3   
496          497         1       1   
497          498         0       3   
498          499         0       1   
499          500         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                                 ...     ...   ...    ...   
495                              Yousseff, Mr. Gerious    male   NaN      0   
496                     Eustis, Miss. Elizabeth Mussey  female  54.0      1   
497                    Shellard, Mr. Frederick William    male   NaN      0   
498    Allison, Mrs. Hudson J C (Bessie Waldo Daniels)  female  25.0      1   
499                                 Svensson, Mr. Olof    male  24.0      0   

     Parch            Ticket      Fare    Cabin Embarked  
0        0         A/5 21171    7.2500      NaN        S  
1        0          PC 17599   71.2833      C85        C  
2        0  STON/O2. 3101282    7.9250      NaN        S  
3        0            113803   53.1000     C123        S  
4        0            373450    8.0500      NaN        S  
..     ...               ...       ...      ...      ...  
495      0              2627   14.4583      NaN        C  
496      0             36947   78.2667      D20        C  
497      0         C.A. 6212   15.1000      NaN        S  
498      2            113781  151.5500  C22 C26        S  
499      0            350035    7.7958      NaN        S  

[500 rows x 12 columns]
     PassengerId  Survived  Pclass                                      Name  \
500          501         0       3                          Calic, Mr. Petar   
501          502         0       3                       Canavan, Miss. Mary   
502          503         0       3            O'Sullivan, Miss. Bridget Mary   
503          504         0       3            Laitinen, Miss. Kristina Sofia   
504          505         1       1                     Maioni, Miss. Roberta   
..           ...       ...     ...                                       ...   
886          887         0       2                     Montvila, Rev. Juozas   
887          888         1       1              Graham, Miss. Margaret Edith   
888          889         0       3  Johnston, Miss. Catherine Helen "Carrie"   
889          890         1       1                     Behr, Mr. Karl Howell   
890          891         0       3                       Dooley, Mr. Patrick   

        Sex   Age  SibSp  Parch      Ticket     Fare Cabin Embarked  
500    male  17.0      0      0      315086   8.6625   NaN        S  
501  female  21.0      0      0      364846   7.7500   NaN        Q  
502  female   NaN      0      0      330909   7.6292   NaN        Q  
503  female  37.0      0      0        4135   9.5875   NaN        S  
504  female  16.0      0      0      110152  86.5000   B79        S  
..      ...   ...    ...    ...         ...      ...   ...      ...  
886    male  27.0      0      0      211536  13.0000   NaN        S  
887  female  19.0      0      0      112053  30.0000   B42        S  
888  female   NaN      1      2  W./C. 6607  23.4500   NaN        S  
889    male  26.0      0      0      111369  30.0000  C148        C  
890    male  32.0      0      0      370376   7.7500   NaN        Q  

[391 rows x 12 columns]

1.4 修改表头

PassengerId => 乘客ID
Survived => 是否幸存
Pclass => 乘客等级(1/2/3等舱位)
Name => 乘客姓名
Sex => 性别
Age => 年龄
SibSp => 堂兄弟/妹个数
Parch => 父母与小孩个数
Ticket => 船票信息
Fare => 票价
Cabin => 客舱
Embarked => 登船港口

#查看数据信息

data=pd.read_csv('train.csv')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
data.columns=['乘客ID','是否幸存', '乘客等级', '乘客姓名','性别','年龄',
              '堂兄弟/妹个数','父母与小孩个数','船票信息','票价','客舱','登船港口']
data_=data.set_index("乘客ID")
data_.head(2)
是否幸存 乘客等级 乘客姓名 性别 年龄 堂兄弟/妹个数 父母与小孩个数 船票信息 票价 客舱 登船港口
乘客ID
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
data_1= pd
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值