DW_数分任务1

import numpy as np 
import pandas as pd
import os
df = pd.read_csv('train.csv') 
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
# 查看当前路径
os.getcwd()
'C:\\Users\\royryanwang\\Desktop\\DW数分'
pd.read_table('train.csv')
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
01,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/...
12,1,1,"Cumings, Mrs. John Bradley (Florence Br...
23,1,3,"Heikkinen, Miss. Laina",female,26,0,0,S...
34,1,1,"Futrelle, Mrs. Jacques Heath (Lily May ...
45,0,3,"Allen, Mr. William Henry",male,35,0,0,3...
......
886887,0,2,"Montvila, Rev. Juozas",male,27,0,0,21...
887888,1,1,"Graham, Miss. Margaret Edith",female,...
888889,0,3,"Johnston, Miss. Catherine Helen ""Car...
889890,1,1,"Behr, Mr. Karl Howell",male,26,0,0,11...
890891,0,3,"Dooley, Mr. Patrick",male,32,0,0,3703...

891 rows × 1 columns

chunker = pd.read_csv('train.csv', chunksize=1000)
chunker
<pandas.io.parsers.TextFileReader at 0x22317586f48>
将表头改成中文,索引改为乘客ID
df = pd.read_csv('train.csv', names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐 妹个数','父母子女个数','船票信息','票价','客舱','登船港口'],index_col='乘客ID',header=0) 
df.head()
是否幸存仓位等级姓名性别年龄兄弟姐 妹个数父母子女个数船票信息票价客舱登船港口
乘客ID
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS
dataframe 更改行列名的其他方法
from pandas import DataFrame, Series
data = DataFrame({"a":[1, 2, 3, 4], "b":[4, 5, 6, 7]})
data.columns = ["c", "d"]#直接在原数据上修改
from pandas import DataFrame, Series
data = DataFrame({"a":[1, 2, 3, 4], "b":[4, 5, 6, 7]})
data.rename(columns={"a":"c", "b":"d"})#在原表上修改
cd
014
125
236
347
from pandas import DataFrame, Series
data = DataFrame({"a":[1, 2, 3, 4], "b":[4, 5, 6, 7]})
data.insert(0, 'c', data.pop('a'))#先删除再插入,并且重新命名
data.insert(1, 'd', data.pop('b'))
# 将第1行和第2行的行名更改为aa,bb
d={'one':{'a':1,'b':2,'c':3,'d':4},'two':{'a':5,'b':6,'c':7,'d':8},'three':{'a':9,'b':10,'c':11,'d':12}}
df=pd.DataFrame(d)
print(df)

df.rename(index={'a':'aa','b':'bb'},inplace=True)
print(df)
   one  two  three
a    1    5      9
b    2    6     10
c    3    7     11
d    4    8     12
    one  two  three
aa    1    5      9
bb    2    6     10
c     3    7     11
d     4    8     12
查看数据的基本信息
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   是否幸存     891 non-null    int64  
 1   仓位等级     891 non-null    int64  
 2   姓名       891 non-null    object 
 3   性别       891 non-null    object 
 4   年龄       714 non-null    float64
 5   兄弟姐 妹个数  891 non-null    int64  
 6   父母子女个数   891 non-null    int64  
 7   船票信息     891 non-null    object 
 8   票价       891 non-null    float64
 9   客舱       204 non-null    object 
 10  登船港口     889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB
观察表格前10行的数据和后15行的数据
df.head(10)
是否幸存仓位等级姓名性别年龄兄弟姐 妹个数父母子女个数船票信息票价客舱登船港口
乘客ID
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS
603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
1012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
df.tail(15)
是否幸存仓位等级姓名性别年龄兄弟姐 妹个数父母子女个数船票信息票价客舱登船港口
乘客ID
87703Gustafsson, Mr. Alfred Ossianmale20.00075349.8458NaNS
87803Petroff, Mr. Nedeliomale19.0003492127.8958NaNS
87903Laleff, Mr. KristomaleNaN003492177.8958NaNS
88011Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)female56.0011176783.1583C50C
88112Shelley, Mrs. William (Imanita Parrish Hall)female25.00123043326.0000NaNS
88203Markun, Mr. Johannmale33.0003492577.8958NaNS
88303Dahlberg, Miss. Gerda Ulrikafemale22.000755210.5167NaNS
88402Banfield, Mr. Frederick Jamesmale28.000C.A./SOTON 3406810.5000NaNS
88503Sutehall, Mr. Henry Jrmale25.000SOTON/OQ 3920767.0500NaNS
88603Rice, Mrs. William (Margaret Norton)female39.00538265229.1250NaNQ
88702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
89011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ
判断数据是否为空,为空的地方返回True,其余地方 返回False
df.isnull().head()
是否幸存仓位等级姓名性别年龄兄弟姐 妹个数父母子女个数船票信息票价客舱登船港口
乘客ID
1FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
4FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
5FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
对于一个数据,还可以从哪些方面来观察?

可以从数据本身去看,比如对于此数据,我们可以分别统计仓位各等级有多少人,男女各有多少人,平均年龄是多少,平均兄弟姐妹,父母子女个数,票价。

1、分组groupby
Pandas中最为常用和有效的分组函数。

1)按列分组

注意以下使用groupby()函数生成的group1是一个中间分组变量,为GroupBy类型。

既可依据单个列名’key1’进行为分组,也可依据多个列名[‘key1’,‘key2’]进行分组。

2)按分组统计
在分组group1、group2上应用size()、sum()、count()等统计函数,能分别统计分组数量、不同列的分组和、不同列的分组数量。

详情参加此CSDN 博客 https://blog.csdn.net/elecjack/article/details/50760736

# 先按仓位等级分组看幸存人数
group_level = df.groupby('仓位等级')
group_level.sum()
是否幸存年龄兄弟姐 妹个数父母子女个数票价
仓位等级
11367111.42907718177.4125
2875168.8374703801.8417
31198924.923021936714.6951
# 按性别分组看幸存人数
group_gender = df.groupby('性别').sum()
group_gender
是否幸存仓位等级年龄兄弟姐 妹个数父母子女个数票价
性别
female2336787286.0021820413966.6628
male109137913919.1724813614727.2865
# 求年龄平均值
print(df["年龄"].mean())
29.69911764705882
# 求父母子女个数平均值
print(df["父母子女个数"].mean())
0.38159371492704824
加载并做出改变的数据,在工作目录下保存为一 个新文件train_chinese.csv
df.to_csv('train_chinese.csv')

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值