********************************
一、信息表信息查看
********************************
查看前2行
df.head(2)id date city category age price
010012013-01-02 Beijing 100-A 231200.9110022013-01-03 SH 100-B 44 NaN
===============================
查看后2行
df.tail(2)id date city category age price
410052013-01-06 shanghai 210-A 34 NaN
510062013-01-07 BEIJING 130-F 454432.0===============================
维度查看
(6,6)===============================
数据表基本信息(维度、列名称、数据格式、所占空间等)
<class'pandas.core.frame.DataFrame'>
RangeIndex:6 entries,0 to 5
Data columns (total 6 columns):id6 non-null int64
date 6 non-null datetime64[ns]
city 6 non-null object
category 6 non-null object
age 6 non-null int64
price 4 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(2),object(2)
memory usage:276.0+bytesNone===============================
每一列数据的格式
id int64
date datetime64[ns]
city object
category object
age int64
price float64
dtype:object===============================
category列格式
object===============================
空值
id date city category age price
0FalseFalseFalseFalseFalseFalse1FalseFalseFalseFalseFalseTrue2FalseFalseFalseFalseFalseFalse3FalseFalseFalseFalseFalseFalse4FalseFalseFalseFalseFalseTrue5FalseFalseFalseFalseFalseFalse===============================
查看price列空值
0False1True2False3False4True5False
Name: price, dtype:bool===============================
查看age列的唯一值
[234454323445]===============================
查看数据表的值
[[1001L Timestamp('2013-01-02 00:00:00')'Beijing ''100-A' 23L 1200.9][1002L Timestamp('2013-01-03 00:00:00')'SH''100-B' 44L nan][1003L Timestamp('2013-01-04 00:00:00')' guangzhou ''110-A' 54L 2133.0][1004L Timestamp('2013-01-05 00:00:00')'Shenzhen''110-C' 32L 5433.2][1005L Timestamp('2013-01-06 00:00:00')'shanghai''210-A' 34L nan][1006L Timestamp('2013-01-07 00:00:00')'BEIJING ''130-F' 45L 4432.0]]===============================
查看列名称
Index([u'id', u'date', u'city', u'category', u'age', u'price'], dtype='object')===============================********************************
二、数据表清洗
********************************
使用列prince的均值对NA进行填充
id date city category age price
010012013-01-02 Beijing 100-A 231200.9110022013-01-03 SH 100-B 44 NaN
210032013-01-04 guangzhou 110-A 542133.0310042013-01-05 Shenzhen 110-C 325433.2410052013-01-06 shanghai 210-A 34 NaN
510062013-01-07 BEIJING 130-F 454432.0id date city category age price
010012013-01-02 Beijing 100-A 231200.9110022013-01-03 SH 100-B 44100.0210032013-01-04 guangzhou 110-A 542133.0310042013-01-05 Shenzhen 110-C 325433.2410052013-01-06 shanghai 210-A 34100.0510062013-01-07 BEIJING 130-F 454432.001200.90013299.77522133.00035433.20043299.77554432.000
Name: price, dtype: float64
===============================
清除city字段的字符前后空格,字符的大小写转换
id date city category age price
010012013-01-02 BEIJING 100-A 231200.9110022013-01-03 SH 100-B 44 NaN
210032013-01-04 GUANGZHOU 110-A 542133.0310042013-01-05 SHENZHEN 110-C 325433.2410052013-01-06 SHANGHAI 210-A 34 NaN
510062013-01-07 BEIJING 130-F 454432.0===============================
更改数据格式
0120011002213335433410054432
Name: price, dtype: int32
===============================
更改列名称
id date city category-size age price
010012013-01-02 BEIJING 100-A 231200.9110022013-01-03 SH 100-B 44 NaN
210032013-01-04 GUANGZHOU 110-A 542133.0310042013-01-05 SHENZHEN 110-C 325433.2410052013-01-06 SHANGHAI 210-A 34 NaN
510062013-01-07 BEIJING 130-F 454432.0===============================
删除后出现的重复值
0 BEIJING
1 SH
2 GUANGZHOU
3 SHENZHEN
4 SHANGHAI
Name: city, dtype:object===============================
数据替换
id date city category age price
010012013-01-02 BEIJING 100-A 231200.9110022013-01-03 shanghai 100-B 44 NaN
210032013-01-04 GUANGZHOU 110-A 542133.0310042013-01-05 SHENZHEN 110-C 325433.2410052013-01-06 SHANGHAI 210-A 34 NaN
510062013-01-07 BEIJING 130-F 454432.0===============================********************************
三、数据预处理
********************************
内连接
id date city category age price gender m-point pay
010012013-01-02 BEIJING 100-A 231200.9 male 10 Y
110022013-01-03 shanghai 100-B 44 NaN female 12 N
210032013-01-04 GUANGZHOU 110-A 542133.0 male 20 Y
310042013-01-05 SHENZHEN 110-C 325433.2 female 40 Y
410052013-01-06 SHANGHAI 210-A 34 NaN male 40 N
510062013-01-07 BEIJING 130-F 454432.0 female 40 Y
===============================
左连接
id date city category age price gender m-point pay
010012013-01-02 BEIJING 100-A 231200.9 male 10 Y
110022013-01-03 shanghai 100-B 44 NaN female 12 N
210032013-01-04 GUANGZHOU 110-A 542133.0 male 20 Y
310042013-01-05 SHENZHEN 110-C 325433.2 female 40 Y
410052013-01-06 SHANGHAI 210-A 34 NaN male 40 N
510062013-01-07 BEIJING 130-F 454432.0 female 40 Y
===============================
右连接
id date city category age price gender m-point pay
010012013-01-02 BEIJING 100-A 23.01200.9 male 10 Y
110022013-01-03 shanghai 100-B 44.0 NaN female 12 N
210032013-01-04 GUANGZHOU 110-A 54.02133.0 male 20 Y
310042013-01-05 SHENZHEN 110-C 32.05433.2 female 40 Y
410052013-01-06 SHANGHAI 210-A 34.0 NaN male 40 N
510062013-01-07 BEIJING 130-F 45.04432.0 female 40 Y
61007 NaT NaN NaN NaN NaN male 30 N
7108 NaT NaN NaN NaN NaN female 20 Y
===============================
外连接
id date city category age price gender m-point pay
010012013-01-02 BEIJING 100-A 23.01200.9 male 10 Y
110022013-01-03 shanghai 100-B 44.0 NaN female 12 N
210032013-01-04 GUANGZHOU 110-A 54.02133.0 male 20 Y
310042013-01-05 SHENZHEN 110-C 32.05433.2 female 40 Y
410052013-01-06 SHANGHAI 210-A 34.0 NaN male 40 N
510062013-01-07 BEIJING 130-F 45.04432.0 female 40 Y
61007 NaT NaN NaN NaN NaN male 30 N
7108 NaT NaN NaN NaN NaN female 20 Y
===============================
设置索引列
id date city category age price gender m-point pay
010012013-01-02 BEIJING 100-A 23.01200.9 male 10 Y
310042013-01-05 SHENZHEN 110-C 32.05433.2 female 40 Y
410052013-01-06 SHANGHAI 210-A 34.0 NaN male 40 N
110022013-01-03 shanghai 100-B 44.0 NaN female 12 N
510062013-01-07 BEIJING 130-F 45.04432.0 female 40 Y
210032013-01-04 GUANGZHOU 110-A 54.02133.0 male 20 Y
61007 NaT NaN NaN NaN NaN male 30 N
7108 NaT NaN NaN NaN NaN female 20 Y
===============================
如果price列的值>3000,group列显示high,否则显示low
id date city category ... gender m-point pay group
010012013-01-02 BEIJING 100-A ... male 10 Y low
310042013-01-05 SHENZHEN 110-C ... female 40 Y high
410052013-01-06 SHANGHAI 210-A ... male 40 N low
110022013-01-03 shanghai 100-B ... female 12 N low
510062013-01-07 BEIJING 130-F ... female 40 Y high
210032013-01-04 GUANGZHOU 110-A ... male 20 Y low
61007 NaT NaN NaN ... male 30 N low
7108 NaT NaN NaN ... female 20 Y low
[8 rows x 10 columns]===============================
对复合多个条件的数据进行分组标记
id date city category ... gender m-point pay sign
010012013-01-02 BEIJING 100-A ... male 10 Y NaN
110022013-01-03 shanghai 100-B ... female 12 N NaN
210032013-01-04 GUANGZHOU 110-A ... male 20 Y NaN
310042013-01-05 SHENZHEN 110-C ... female 40 Y NaN
410052013-01-06 SHANGHAI 210-A ... male 40 N NaN
510062013-01-07 BEIJING 130-F ... female 40 Y 1.0[6 rows x 10 columns]===============================
对category字段的值依次进行分列,并创建数据表,索引值为df_inner的索引列,列名称为category和size
split
category size
0100 A
1100 B
2110 A
3110 C
4210 A
5130 F
===============================
将完成分裂后的数据表和原df_inner数据表进行匹配
id date city category_x ... pay sign category_y size
010012013-01-02 BEIJING 100-A ... Y NaN 100 A
110022013-01-03 shanghai 100-B ... N NaN 100 B
210032013-01-04 GUANGZHOU 110-A ... Y NaN 110 A
310042013-01-05 SHENZHEN 110-C ... Y NaN 110 C
410052013-01-06 SHANGHAI 210-A ... N NaN 210 A
510062013-01-07 BEIJING 130-F ... Y 1.0130 F
[6 rows x 12 columns]===============================********************************
四、数据提取
********************************
按索引提取单行的数值
id1001
date 2013-01-0200:00:00
city BEIJING
category_x 100-A
age 23
price 1200.9
gender male
m-point 10
pay Y
sign NaN
category_y 100
size A
Name:0, dtype:object===============================
按索引提取区域行数值
id date city category_x ... pay sign category_y size
010012013-01-02 BEIJING 100-A ... Y NaN 100 A
110022013-01-03 shanghai 100-B ... N NaN 100 B
[2 rows x 12 columns]===============================
重设索引
===============================
设置日期为索引
===============================
提取4日之前的所有数据
id city category_x age ... pay sign category_y size
date ...2013-01-021001 BEIJING 100-A 23... Y NaN 100 A
2013-01-031002 shanghai 100-B 44... N NaN 100 B
2013-01-041003 GUANGZHOU 110-A 54... Y NaN 110 A
[3 rows x 11 columns]===============================
使用iloc按位置区域提取数据
id city
date
2013-01-021001 BEIJING
2013-01-031002 shanghai
===============================
适应iloc按位置单独提起数据
price gender
date
2013-01-021200.9 male
2013-01-042133.0 male
2013-01-074432.0 female
===============================
使用ix按索引标签和位置混合提取数据
id city category_x age
date
2013-01-021001 BEIJING 100-A 232013-01-031002 shanghai 100-B 44===============================
判断city列的值是否为北京
date
2013-01-02True2013-01-03False2013-01-04False2013-01-05False2013-01-06False2013-01-07True
Name: city, dtype:bool===============================
判断city列里是否包含beijing和shanghai,然后将符合条件的数据提取出来
id city category_x age ... pay sign category_y size
date ...2013-01-021001 BEIJING 100-A 23... Y NaN 100 A
2013-01-031002 shanghai 100-B 44... N NaN 100 B
2013-01-071006 BEIJING 130-F 45... Y 1.0130 F
[3 rows x 11 columns]===============================
提取前三个字符,并生成数据表
category
010011002110311042105130===============================********************************
五、数据筛选
********************************
找出年龄大于25并且在beijing的记录
id city age category price
51006 BEIJING 45130-F 4432.0===============================
找出年龄大于35或者在beijing的记录
===============================id city age category price
01001 BEIJING 23100-A 1200.911002 shanghai 44100-B NaN
21003 GUANGZHOU 54110-A 2133.051006 BEIJING 45130-F 4432.0===============================
找出不在beijing的记录,对筛选后的数据按city列进行计数
===============================id city age category price
11002 shanghai 44100-B NaN
21003 GUANGZHOU 54110-A 2133.031004 SHENZHEN 32110-C 5433.241005 SHANGHAI 34210-A NaN
总数:4===============================
使用query函数进行筛选
===============================id date city category age price
010012013-01-02 BEIJING 100-A 231200.9110022013-01-03 shanghai 100-B 44 NaN
510062013-01-07 BEIJING 130-F 454432.0===============================
对筛选后的结果按prince进行求和
===============================5632.9===============================********************************
六、数据汇总
********************************
对所有的列进行计数汇总
===============================id date category age price
city
BEIJING 22222
GUANGZHOU 11111
SHANGHAI 11110
SHENZHEN 11111
shanghai 11110===============================
按城市对id字段进行计数
===============================
city
BEIJING 2
GUANGZHOU 1
SHANGHAI 1
SHENZHEN 1
shanghai 1
Name:id, dtype: int64
===============================
对两个字段进行汇总计数
===============================
city age
BEIJING 231451
GUANGZHOU 541
SHANGHAI 341
SHENZHEN 321
shanghai 441
Name:id, dtype: int64
===============================
对city字段进行汇总,并分别计算prince的合计和均值
===============================lensum mean
city
BEIJING 2.05632.92816.45
GUANGZHOU 1.02133.02133.00
SHANGHAI 1.00.0 NaN
SHENZHEN 1.05433.25433.20
shanghai 1.00.0 NaN
===============================********************************
七、数据统计
********************************
简单的数据采样
===============================id city category_x age ... pay sign category_y size
date ...2013-01-071006 BEIJING 130-F 45... Y 1.0130 F
2013-01-021001 BEIJING 100-A 23... Y NaN 100 A
2013-01-041003 GUANGZHOU 110-A 54... Y NaN 110 A
[3 rows x 11 columns]===============================
手动设置采样权重
===============================id date city category age price
210032013-01-04 GUANGZHOU 110-A 542133.0510062013-01-07 BEIJING 130-F 454432.0===============================
采样后不放回
===============================id date city category age price
010012013-01-02 BEIJING 100-A 231200.9410052013-01-06 SHANGHAI 210-A 34 NaN
310042013-01-05 SHENZHEN 110-C 325433.2210032013-01-04 GUANGZHOU 110-A 542133.0110022013-01-03 shanghai 100-B 44 NaN
510062013-01-07 BEIJING 130-F 454432.0===============================
采样后放回
===============================id date city category age price
110022013-01-03 shanghai 100-B 44 NaN
210032013-01-04 GUANGZHOU 110-A 542133.0310042013-01-05 SHENZHEN 110-C 325433.2210032013-01-04 GUANGZHOU 110-A 542133.0410052013-01-06 SHANGHAI 210-A 34 NaN
010012013-01-02 BEIJING 100-A 231200.9===============================
数据表描述性统计
===============================
count mean std min25%50%75%maxid6.01003.501.871001.01002.251003.51004.751006.0
age 6.038.6711.0923.032.5039.044.7554.0
price 4.03299.781966.391200.91899.983282.54682.305433.2===============================
计算列的标准差
===============================1966.39058917===============================
计算两个字段间的协方差
===============================
nan
===============================
数据表中所有字段间的协方差
===============================id age price
id3.5000005.8000003.242617e+03
age 5.800000123.0666672.646583e+03
price 3242.6166672646.5833333.866692e+06===============================
两个字段的相关性分析
===============================
nan
===============================
数据表的相关性分析
===============================id age price
id1.0000000.2794630.792163
age 0.2794631.0000000.098074
price 0.7921630.0980741.000000===============================********************************
八、数据输出
********************************
数据导出:写入Excel
写入成功
===============================
数据导出:写入到CSV
写入成功
===============================