DadaWhale之task01-数据初步探索

import numpy as np
import pandas as pd
#相对路径
train = pd.read_csv("train.csv") # ../上级文件夹
test = pd.read_csv("test.csv") 

#绝对路径
train = pd.read_csv(r‘C:\Users\63509\Desktop\titanic’)

#查看当前工作目录
import os
os.getcwd()
'C:\\Users\\63509\\DataWhale'
  • read_csv: 从文件、URL、文件型对下中加载带分隔符的数据。默认分隔符为逗号

  • read_table: 从文件、URL、文件型对下中加载带分隔符的数据。默认分隔符为制表符

TSV 是Tab-separated values的缩写,即制表符分隔值。
相对来说CSV,Comma-separated values(逗号分隔值)更常见一些。

1.逐块读取

1.只想读取几行,避免读取整个文件,通过 nrows进行指定:
pd.read_csv(path, nrows=5)
2.逐块读取文件,需要设置chunksize(行数)
pd.read_csv(path, chunksize=1000)

train_01 = pd.read_csv("train.csv", nrows=5)
train_01
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale2210A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female3810PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale2600STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female351011380353.1000C123S
4503Allen, Mr. William Henrymale35003734508.0500NaNS
train_02 = pd.read_csv("train.csv", chunksize=1000)
train_02
<pandas.io.parsers.TextFileReader at 0x21f3ad9f9d0>
train.rename(columns={"PassengerId":"乘客ID","Survived":"是否幸存","Pclass":"乘客等级(1/2/3等舱位)","Name":"乘客姓名","Sex":"性别","Age":"年龄","SibSp":"堂兄弟/妹个数","Parch":"父母与小孩个数","Ticket":"船票信息","Fare":"票价","Cabin":"客舱","Embarked":"登船港口"}, inplace=True)
train.head()

'''
df = pd.read_csv('train.csv', names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口'],index_col='乘客ID',header=0)
'''
乘客ID是否幸存乘客等级(1/2/3等舱位)乘客姓名性别年龄堂兄弟/妹个数父母与小孩个数船票信息票价客舱登船港口
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
train.set_index("乘客ID", inplace=True)
train.head()
是否幸存乘客等级(1/2/3等舱位)乘客姓名性别年龄堂兄弟/妹个数父母与小孩个数船票信息票价客舱登船港口
乘客ID
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS

2.初步观察

train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   是否幸存            891 non-null    int64  
 1   乘客等级(1/2/3等舱位)  891 non-null    int64  
 2   乘客姓名            891 non-null    object 
 3   性别              891 non-null    object 
 4   年龄              714 non-null    float64
 5   堂兄弟/妹个数         891 non-null    int64  
 6   父母与小孩个数         891 non-null    int64  
 7   船票信息            891 non-null    object 
 8   票价              891 non-null    float64
 9   客舱              204 non-null    object 
 10  登船港口            889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB
train.shape
(891, 11)
train.describe()
是否幸存乘客等级(1/2/3等舱位)年龄堂兄弟/妹个数父母与小孩个数票价
count891.000000891.000000714.000000891.000000891.000000891.000000
mean0.3838382.30864229.6991180.5230080.38159432.204208
std0.4865920.83607114.5264971.1027430.80605749.693429
min0.0000001.0000000.4200000.0000000.0000000.000000
25%0.0000002.00000020.1250000.0000000.0000007.910400
50%0.0000003.00000028.0000000.0000000.00000014.454200
75%1.0000003.00000038.0000001.0000000.00000031.000000
max1.0000003.00000080.0000008.0000006.000000512.329200
train.head(10)
是否幸存乘客等级(1/2/3等舱位)乘客姓名性别年龄堂兄弟/妹个数父母与小孩个数船票信息票价客舱登船港口
乘客ID
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS
603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
1012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
train.tail(15)
是否幸存乘客等级(1/2/3等舱位)乘客姓名性别年龄堂兄弟/妹个数父母与小孩个数船票信息票价客舱登船港口
乘客ID
87703Gustafsson, Mr. Alfred Ossianmale20.00075349.8458NaNS
87803Petroff, Mr. Nedeliomale19.0003492127.8958NaNS
87903Laleff, Mr. KristomaleNaN003492177.8958NaNS
88011Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)female56.0011176783.1583C50C
88112Shelley, Mrs. William (Imanita Parrish Hall)female25.00123043326.0000NaNS
88203Markun, Mr. Johannmale33.0003492577.8958NaNS
88303Dahlberg, Miss. Gerda Ulrikafemale22.000755210.5167NaNS
88402Banfield, Mr. Frederick Jamesmale28.000C.A./SOTON 3406810.5000NaNS
88503Sutehall, Mr. Henry Jrmale25.000SOTON/OQ 3920767.0500NaNS
88603Rice, Mrs. William (Margaret Norton)female39.00538265229.1250NaNQ
88702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
89011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ
train.isnull()
是否幸存乘客等级(1/2/3等舱位)乘客姓名性别年龄堂兄弟/妹个数父母与小孩个数船票信息票价客舱登船港口
乘客ID
1FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
4FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
5FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
....................................
887FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
888FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
889FalseFalseFalseFalseTrueFalseFalseFalseFalseTrueFalse
890FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
891FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse

891 rows × 11 columns

#哪些列有缺失值
train.isnull().any()
是否幸存              False
乘客等级(1/2/3等舱位)    False
乘客姓名              False
性别                False
年龄                 True
堂兄弟/妹个数           False
父母与小孩个数           False
船票信息              False
票价                False
客舱                 True
登船港口               True
dtype: bool

3.保存数据

train.to_csv("train_chinese.csv")
```## 标题
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值