数据载入、初步观察及探索性数据分析
下载相关数据,建立新的jupyter,将两者放在一个文件夹。
1、导入numpy及pandas
import pandas as pd
import numpy as np
2、载入数据
#相对路径:不涉及到域名,相当于将相关资料等放在一个文件夹中,相对比较灵活,只要层级结构没有变化,比如上面的a和b文件夹只是换了个名字,不影响寻址。
#绝对路径:会涉及到域名,相当于在硬盘中的哪个具体位置,路径必须明确,一旦换电脑即找不到文件。
#相对路径
df = pd.read_csv('train.csv')
df.head(3)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
#df即为dataframe,表示矩阵的数据表,包含已排序的列集合,每一列可以是不同的值类型(数值、字符串、布尔值等)。
#df.head()会将excel表格中第一行看作列名,并默认输出之后的五行。若括号中填写3,即输出3行。
#df = pd.read_csv():读取csv格式的文件。
#df_txt = pd.read_table(’.txt’):读取txt格式的文件。
#df_excel = pd.read_excel(‘data/table.xlsx’)):读取excel格式的文件。
df = pd.read_excel('test11.xlsx')
df.head(3)
#实际上,df在读取文件的表述上,可以自行编辑,可以是df,可以是df__excel。
周一 | 周二 | 周一.1 | 周二.1 | 周一.2 | 周二.2 | 周一.3 | |
---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
#绝对路径
df = pd.read_csv(r'C:\Users\小静\Desktop\train.csv')
df.head(3)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
#r:代表处理不转义现象,表示raw string,也叫原始字符串常量。系统路径如下面的路径,使用r就防止了\t的转义。
#转义字符:转义字符是很多程序语言、数据格式和通信协议的形式文法的一部分。所有的ASCII码都可以用“\”加数字(一般是8进制数字)来表示。而C中定义了一些字母前加""来表示常见的那些不能显示的ASCII字符,如\0,\t,\n等,就称为转义字符,因为后面的字符,都不是它本来的ASCII字符意思了。
#转义符包括
\:续行符(在行尾时)
\:反斜杠符号
’:单引号
":双引号
\a:响铃
\b:退格(Backspace)
\e:转义
\000:空
\n:换行
\v:纵向制表符
\t:横向制表符
\r:回车
\f:换页
\oyy:八进制数yy代表的字符,例如:\o12代表换行
\xyy:十进制数yy代表的字符,例如:\x0a代表换行
\other:其它的字符以普通格式输出
3、逐块读取
#info()函数用于打印DataFrame的简要摘要,显示有关DataFrame的信息,包括索引的数据类型dtype和列的数据类型dtype,非空值的数量和内存使用情况。
df = pd.read_csv('train.csv')
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
chunker = pd.read_csv('train.csv', chunksize=500)
for chunk in chunker:
display(chunk)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
495 | 496 | 0 | 3 | Yousseff, Mr. Gerious | male | NaN | 0 | 0 | 2627 | 14.4583 | NaN | C |
496 | 497 | 1 | 1 | Eustis, Miss. Elizabeth Mussey | female | 54.0 | 1 | 0 | 36947 | 78.2667 | D20 | C |
497 | 498 | 0 | 3 | Shellard, Mr. Frederick William | male | NaN | 0 | 0 | C.A. 6212 | 15.1000 | NaN | S |
498 | 499 | 0 | 1 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S |
499 | 500 | 0 | 3 | Svensson, Mr. Olof | male | 24.0 | 0 | 0 | 350035 | 7.7958 | NaN | S |
500 rows × 12 columns
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
500 | 501 | 0 | 3 | Calic, Mr. Petar | male | 17.0 | 0 | 0 | 315086 | 8.6625 | NaN | S |
501 | 502 | 0 | 3 | Canavan, Miss. Mary | female | 21.0 | 0 | 0 | 364846 | 7.7500 | NaN | Q |
502 | 503 | 0 | 3 | O'Sullivan, Miss. Bridget Mary | female | NaN | 0 | 0 | 330909 | 7.6292 | NaN | Q |
503 | 504 | 0 | 3 | Laitinen, Miss. Kristina Sofia | female | 37.0 | 0 | 0 | 4135 | 9.5875 | NaN | S |
504 | 505 | 1 | 1 | Maioni, Miss. Roberta | female | 16.0 | 0 | 0 | 110152 | 86.5000 | B79 | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
391 rows × 12 columns
chunker = pd.read_csv('train.csv', chunksize=500)
for chunk in chunker:
print(type(chunk))
display(chunk)
<class 'pandas.core.frame.DataFrame'>
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
495 | 496 | 0 | 3 | Yousseff, Mr. Gerious | male | NaN | 0 | 0 | 2627 | 14.4583 | NaN | C |
496 | 497 | 1 | 1 | Eustis, Miss. Elizabeth Mussey | female | 54.0 | 1 | 0 | 36947 | 78.2667 | D20 | C |
497 | 498 | 0 | 3 | Shellard, Mr. Frederick William | male | NaN | 0 | 0 | C.A. 6212 | 15.1000 | NaN | S |
498 | 499 | 0 | 1 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S |
499 | 500 | 0 | 3 | Svensson, Mr. Olof | male | 24.0 | 0 | 0 | 350035 | 7.7958 | NaN | S |
500 rows × 12 columns
<class 'pandas.core.frame.DataFrame'>
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
500 | 501 | 0 | 3 | Calic, Mr. Petar | male | 17.0 | 0 | 0 | 315086 | 8.6625 | NaN | S |
501 | 502 | 0 | 3 | Canavan, Miss. Mary | female | 21.0 | 0 | 0 | 364846 | 7.7500 | NaN | Q |
502 | 503 | 0 | 3 | O'Sullivan, Miss. Bridget Mary | female | NaN | 0 | 0 | 330909 | 7.6292 | NaN | Q |
503 | 504 | 0 | 3 | Laitinen, Miss. Kristina Sofia | female | 37.0 | 0 | 0 | 4135 | 9.5875 | NaN | S |
504 | 505 | 1 | 1 | Maioni, Miss. Roberta | female | 16.0 | 0 | 0 | 110152 | 86.5000 | B79 | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |