1 载入数据和数据观察
1.1导入numpy和pandas
import numpy as np
import pandas as pd
1.2 载入数据
(1) 使用相对路径载入数据
data=pd.read_csv('train.csv') #相对路径
data.head() #head()默认是前5行
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
(2) 使用绝对路径载入数据
data=pd.read_csv("D:/Jupyter NoteBook/组队学习/hands-on-data-analysis-master/第一单元项目集合/train.csv")
data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
【提示】相对路径载入报错时,尝试使用os.getcwd()查看当前工作目录。
【思考】知道数据加载的方法后,试试pd.read_csv()和pd.read_table()的不同,如果想让他们效果一样,需要怎么做?了解一下’.tsv’和’.csv’的不同,如何加载这两个数据集?
1)read_table是以制表符 \t 作为数据的标志,以行为单位进行存储;read_csv是以分隔符号 ‘,’ 作为数据的标志。使参数 sep=‘,’ 就能让他们效果一样。
2 ) TSV:tab separated values;即“制表符分隔值”,CSV: comma separated values;即“逗号分隔值”
使用pd.read_table()来读取数据
data=pd.read_table('train.csv',sep=',')
data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
1.3每1000行为一个数据模块,逐块读取
chunks=pd.read_csv('train.csv',chunksize=1000)
chunks
<pandas.io.parsers.TextFileReader at 0x20b2ef42490>
【思考】什么是逐块读取?为什么要逐块读取呢?
【提示】大家可以chunker(数据块)是什么类型?用for
循环打印出来出处具体的样子是什么?
使用read_csv会把整个文件的数据读取到DataFrame中,当数据量大时,就会很吃内存;所以在read_csv中通过设置参数chunksize来指定一个chunksize分块大小来读取文件,它会返回一个可迭代的对象TextFileReader,然后使用for循环取出数据。
chunks=pd.read_csv('train.csv',chunksize=500) #该数据集不到1000,这里使用500
for chunk in chunks:
print(chunk)
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
.. ... ... ...
495 496 0 3
496 497 1 1
497 498 0 3
498 499 0 1
499 500 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
.. ... ... ... ...
495 Yousseff, Mr. Gerious male NaN 0
496 Eustis, Miss. Elizabeth Mussey female 54.0 1
497 Shellard, Mr. Frederick William male NaN 0
498 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0 1
499 Svensson, Mr. Olof male 24.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
.. ... ... ... ... ...
495 0 2627 14.4583 NaN C
496 0 36947 78.2667 D20 C
497 0 C.A. 6212 15.1000 NaN S
498 2 113781 151.5500 C22 C26 S
499 0 350035 7.7958 NaN S
[500 rows x 12 columns]
PassengerId Survived Pclass Name \
500 501 0 3 Calic, Mr. Petar
501 502 0 3 Canavan, Miss. Mary
502 503 0 3 O'Sullivan, Miss. Bridget Mary
503 504 0 3 Laitinen, Miss. Kristina Sofia
504 505 1 1 Maioni, Miss. Roberta
.. ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas
887 888 1 1 Graham, Miss. Margaret Edith
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie"
889 890 1 1 Behr, Mr. Karl Howell
890 891 0 3 Dooley, Mr. Patrick
Sex Age SibSp Parch Ticket Fare Cabin Embarked
500 male 17.0 0 0 315086 8.6625 NaN S
501 female 21.0 0 0 364846 7.7500 NaN Q
502 female NaN 0 0 330909 7.6292 NaN Q
503 female 37.0 0 0 4135 9.5875 NaN S
504 female 16.0 0 0 110152 86.5000 B79 S
.. ... ... ... ... ... ... ... ...
886 male 27.0 0 0 211536 13.0000 NaN S
887 female 19.0 0 0 112053 30.0000 B42 S
888 female NaN 1 2 W./C. 6607 23.4500 NaN S
889 male 26.0 0 0 111369 30.0000 C148 C
890 male 32.0 0 0 370376 7.7500 NaN Q
[391 rows x 12 columns]
1.4 修改表头
PassengerId => 乘客ID
Survived => 是否幸存
Pclass => 乘客等级(1/2/3等舱位)
Name => 乘客姓名
Sex => 性别
Age => 年龄
SibSp => 堂兄弟/妹个数
Parch => 父母与小孩个数
Ticket => 船票信息
Fare => 票价
Cabin => 客舱
Embarked => 登船港口
#查看数据信息
data=pd.read_csv('train.csv')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
data.columns=['乘客ID','是否幸存', '乘客等级', '乘客姓名','性别','年龄',
'堂兄弟/妹个数','父母与小孩个数','船票信息','票价','客舱','登船港口']
data_=data.set_index("乘客ID")
data_.head(2)
是否幸存 | 乘客等级 | 乘客姓名 | 性别 | 年龄 | 堂兄弟/妹个数 | 父母与小孩个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|
乘客ID | |||||||||||
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
data_1= pd