python中使用pandas数据库
利用pandas进行数据分析,以及试用一下jupyter notebook
import pandas as pd
from pandas import DataFrame, Series
pandas中有两种基本数据类型,Series 和 DataFrame,Series就是带有index的序列,而DataFrame是能够定义index和column的标签的二维数据结构,有点相似于excel表格的样式。而excel表格也能够存成csv格式,而后用pandas的DataFrame读进来。python
Series数据结构
先看Series数据结构:web
s = Series(['jin','mu','shui','huo','tu'])
s
0 jin
1 mu
2 shui
3 huo
4 tu
dtype: object
Series有values和index两个属性,分别查看序列的值和序号,序号也是能够自定义的。spring
s.values
array(['jin', 'mu', 'shui', 'huo', 'tu'], dtype=object)
s.index
RangeIndex(start=0, stop=5, step=1)
s = Series(['jin','mu','shui','huo','tu'],index = ['autumn','spring','winter','summer','long summer'])
s
autumn jin
spring mu
winter shui
summer huo
long summer tu
dtype: object
s = Series(s,index = ['autumn','spring','winter','summer','long summer','others'])
s
autumn jin
spring mu
winter shui
summer huo
long summer tu
others NaN
dtype: object
上面能够看出,添加一个index会增长一个value为NaN的项目。数据库
s.isnull()
autumn False
spring False
winter False
summer False
long summer False
others True
dtype: bool
pd.isnull(s)
autumn False
spring False
winter False
summer False
long summer False
others True
dtype: bool
pd.notnull(s)
autumn True
spring True
winter True
summer True
long summer True
others False
dtype: bool
s['winter']
'shui'
s = Series({'winter':1,'summer':2,'spring':3})
s
spring 3
summer 2
winter 1
dtype: int64
用python的字典dict也能生成Series。而后Series能够相加,也能够定点赋值。bash
b = Series([67,78,89],index=['winter','summer','spring'])
b
winter 67
summer 78
spring 89
dtype: int64
s + b
spring 92
summer 80
winter 68
dtype: int64
s['spring'] = 89
s
spring 89
summer 2
winter 1
dtype: int64
s[s>1]
spring 89
summer 2
dtype: int64
总结,Series能够用list或者字典生成,其中包含index和对应的values,生成的时候前面是value后面的是index,能够不指定index,也能够自定义。能够对Series数据结构进行根据条件取出部分元素的操做,也能够不一样的Series之间加减数据结构
DataFrame数据结构
DataFrame是二维的数据结构,存成矩阵,行叫作index,和Series一直,列叫作column。生成一个这样的数据体能够用一下的方法,就是字典的方法,其中字典的key对应于column,联想excel表格中的项目,是对应的。dict中的value用一个list表示,这个list就是DataFrame的key属性这一列的values。svg
df = DataFrame({'jin':[0.1,0.8,0.4],'mu':[0.5,0.7,0.6]})
df
jin
mu
0
0.1
1
0.8
2
0.4
df = DataFrame({'jin':[0.1,0.8,0.4],'mu':[0.5,0.7,0.6]},index=['person1','person2','person3'])
df
jin
mu
person1
0.1
person2
0.8
person3
0.4
df = DataFrame(df,columns=['jin','mu','shui','huo','tu'],index=['person1','person2','person3','person4'])
df
jin
mu
shui
huo
tu
person1
0.1
0.5
NaN
NaN
person2
0.8
0.7
NaN
NaN
person3
0.4
0.6
NaN
NaN
person4
NaN
NaN
NaN
NaN
提取某一个column,即某一个属性值。函数
df['jin']
person1 0.1
person2 0.8
person3 0.4
person4 NaN
Name: jin, dtype: float64
from numpy import arange
somecolumn = arange(0.1,0.8,0.3)
scln = Series(somecolumn,index = ['person1','person3','person4'])
df['shui'] = scln
df
jin
mu
shui
huo
tu
person1
0.1
0.5
0.1
NaN
person2
0.8
0.7
NaN
NaN
person3
0.4
0.6
0.4
NaN
person4
NaN
NaN
0.7
NaN
用DataFrame结构能够操做csv文件,而且得到其中的信息:ui
csv文件:逗号分隔值(Comma-Separated Values,CSV,有时也称为字符分隔值,由于分隔字符也能够不是逗号),其文件以纯文本形式存储表格数据(数字和文本)。纯文本意味着该文件是一个字符序列,不含必须象二进制数字那样被解读的数据。CSV文件由任意数目的记录组成,记录间以某种换行符分隔;每条记录由字段组成,字段间的分隔符是其它字符或字符串,最多见的是逗号或制表符。一般,全部记录都有彻底相同的字段序列。CSV文件格式的通用标准并不存在,可是在RFC 4180中有基础性的描述。使用的字符编码一样没有被指定,可是7-bit ASCII是最基本的通用编码。CSV是一种通用的、相对简单的文件格式,被用户、商业和科学普遍应用。最普遍的应用是在程序之间转移表格数据,而这些程序自己是在不兼容的格式上进行操做的(每每是私有的和/或无规范的格式)。由于大量程序都支持某种CSV变体,至少是做为一种可选择的输入/输出格式。编码
trainpath = './titanic/train.csv'
testpath = './titanic/test.csv'
trainset = pd.read_csv(trainpath)
trainset.head() # 看看前几行的内容
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th…
female
38.0
1
0
PC 17599
71.2833
C85
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
type(trainset)
pandas.core.frame.DataFrame
trainset.info()
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
trainset.shape
(891, 12)
trainset.describe()
PassengerId
Survived
Pclass
Age
SibSp
Parch
Fare
count
891.000000
891.000000
891.000000
714.000000
891.000000
891.000000
mean
446.000000
0.383838
2.308642
29.699118
0.523008
0.381594
std
257.353842
0.486592
0.836071
14.526497
1.102743
0.806057
min
1.000000
0.000000
1.000000
0.420000
0.000000
0.000000
25%
223.500000
0.000000
2.000000
20.125000
0.000000
0.000000
50%
446.000000
0.000000
3.000000
28.000000
0.000000
0.000000
75%
668.500000
1.000000
3.000000
38.000000
1.000000
0.000000
max
891.000000
1.000000
3.000000
80.000000
8.000000
6.000000
数据基本样貌已经了解,接下来能够利用pandas结合seaborn和matplotlib等库函数对数据进行EDA(exploratory data analysis),深刻了解数据样态,并为后续处理作准备。