数据分析 (1)

数据载入、初步观察及探索性数据分析下载相关数据,建立新的jupyter,将两者放在一个文件夹。1、导入numpy及pandasimport pandas as pdimport numpy as np2、载入数据#相对路径:不涉及到域名,相当于将相关资料等放在一个文件夹中,相对比较灵活,只要层级结构没有变化,比如上面的a和b文件夹只是换了个名字,不影响寻址。#绝对路径:会涉及到域名,相当于在硬盘中的哪个具体位置,路径必须明确,一旦换电脑即找不到文件。#相对路径df = pd.read_cs
摘要由CSDN通过智能技术生成

数据载入、初步观察及探索性数据分析

下载相关数据,建立新的jupyter,将两者放在一个文件夹。

1、导入numpy及pandas

import pandas as pd
import numpy as np

2、载入数据

#相对路径:不涉及到域名,相当于将相关资料等放在一个文件夹中,相对比较灵活,只要层级结构没有变化,比如上面的a和b文件夹只是换了个名字,不影响寻址。
#绝对路径:会涉及到域名,相当于在硬盘中的哪个具体位置,路径必须明确,一旦换电脑即找不到文件。

#相对路径
df = pd.read_csv('train.csv')
df.head(3)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S

#df即为dataframe,表示矩阵的数据表,包含已排序的列集合,每一列可以是不同的值类型(数值、字符串、布尔值等)。
#df.head()会将excel表格中第一行看作列名,并默认输出之后的五行。若括号中填写3,即输出3行。
#df = pd.read_csv():读取csv格式的文件。
#df_txt = pd.read_table(’.txt’):读取txt格式的文件。
#df_excel = pd.read_excel(‘data/table.xlsx’)):读取excel格式的文件。

df = pd.read_excel('test11.xlsx')
df.head(3)
#实际上,df在读取文件的表述上,可以自行编辑,可以是df,可以是df__excel。
周一 周二 周一.1 周二.1 周一.2 周二.2 周一.3
0 1 2 3 4 5 6 7
#绝对路径
df = pd.read_csv(r'C:\Users\小静\Desktop\train.csv')
df.head(3)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S

#r:代表处理不转义现象,表示raw string,也叫原始字符串常量。系统路径如下面的路径,使用r就防止了\t的转义。
#转义字符:转义字符是很多程序语言、数据格式和通信协议的形式文法的一部分。所有的ASCII码都可以用“\”加数字(一般是8进制数字)来表示。而C中定义了一些字母前加""来表示常见的那些不能显示的ASCII字符,如\0,\t,\n等,就称为转义字符,因为后面的字符,都不是它本来的ASCII字符意思了。
#转义符包括
\:续行符(在行尾时)
\:反斜杠符号
’:单引号
":双引号
\a:响铃
\b:退格(Backspace)
\e:转义
\000:空
\n:换行
\v:纵向制表符
\t:横向制表符
\r:回车
\f:换页
\oyy:八进制数yy代表的字符,例如:\o12代表换行
\xyy:十进制数yy代表的字符,例如:\x0a代表换行
\other:其它的字符以普通格式输出

3、逐块读取

#info()函数用于打印DataFrame的简要摘要,显示有关DataFrame的信息,包括索引的数据类型dtype和列的数据类型dtype,非空值的数量和内存使用情况。

df = pd.read_csv('train.csv')
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
chunker = pd.read_csv('train.csv', chunksize=500)
for chunk in chunker:
    display(chunk)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
495 496 0 3 Yousseff, Mr. Gerious male NaN 0 0 2627 14.4583 NaN C
496 497 1 1 Eustis, Miss. Elizabeth Mussey female 54.0 1 0 36947 78.2667 D20 C
497 498 0 3 Shellard, Mr. Frederick William male NaN 0 0 C.A. 6212 15.1000 NaN S
498 499 0 1 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0 1 2 113781 151.5500 C22 C26 S
499 500 0 3 Svensson, Mr. Olof male 24.0 0 0 350035 7.7958 NaN S

500 rows × 12 columns

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
500 501 0 3 Calic, Mr. Petar male 17.0 0 0 315086 8.6625 NaN S
501 502 0 3 Canavan, Miss. Mary female 21.0 0 0 364846 7.7500 NaN Q
502 503 0 3 O'Sullivan, Miss. Bridget Mary female NaN 0 0 330909 7.6292 NaN Q
503 504 0 3 Laitinen, Miss. Kristina Sofia female 37.0 0 0 4135 9.5875 NaN S
504 505 1 1 Maioni, Miss. Roberta female 16.0 0 0 110152 86.5000 B79 S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

391 rows × 12 columns

chunker = pd.read_csv('train.csv', chunksize=500)
for chunk in chunker:
    print(type(chunk))
    display(chunk)
<class 'pandas.core.frame.DataFrame'>
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
495 496 0 3 Yousseff, Mr. Gerious male NaN 0 0 2627 14.4583 NaN C
496 497 1 1 Eustis, Miss. Elizabeth Mussey female 54.0 1 0 36947 78.2667 D20 C
497 498 0 3 Shellard, Mr. Frederick William male NaN 0 0 C.A. 6212 15.1000 NaN S
498 499 0 1 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0 1 2 113781 151.5500 C22 C26 S
499 500 0 3 Svensson, Mr. Olof male 24.0 0 0 350035 7.7958 NaN S

500 rows × 12 columns

<class 'pandas.core.frame.DataFrame'>
<
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
500 501 0 3 Calic, Mr. Petar male 17.0 0 0 315086 8.6625 NaN S
501 502 0 3 Canavan, Miss. Mary female 21.0 0 0 364846 7.7500 NaN Q
502 503 0 3 O'Sullivan, Miss. Bridget Mary female NaN 0 0 330909 7.6292 NaN Q
503 504 0 3 Laitinen, Miss. Kristina Sofia female 37.0 0 0 4135 9.5875 NaN S
504 505 1 1 Maioni, Miss. Roberta female 16.0 0 0 110152 86.5000 B79 S
... ... ... ... ... ... ... ... ... ... ... ... ...
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值