Pandas拼接、数据分析实操

pandas的拼接操作

pandas的拼接分为两种:

  • 级联:pd.concat, pd.append
  • 合并:pd.merge, pd.join
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

0. 回顾numpy的级联

nd = np.random.randint(0, 150, size=(5, 4))
nd
  • 输出

    array([[ 47, 48, 144, 83],
    [ 69, 112, 87, 107],
    [ 67, 47, 74, 145],
    [115, 141, 125, 85],
    [ 65, 12, 87, 118]])

np.concatenate([nd, nd], axis=1)
  • 输出

    array([[ 47, 48, 144, 83, 47, 48, 144, 83],
    [ 69, 112, 87, 107, 69, 112, 87, 107],
    [ 67, 47, 74, 145, 67, 47, 74, 145],
    [115, 141, 125, 85, 115, 141, 125, 85],
    [ 65, 12, 87, 118, 65, 12, 87, 118]])

为方便讲解,我们首先定义一个生成DataFrame的函数:

def make_df(cols, index):
    data = {col: [str(col)+str(ind) for ind in index] for col in cols}
    
    df = DataFrame(data=data, columns = cols, index=index)
    
    return df
df1 = make_df(['a', 'b', 'c'], [1, 2, 3])
df1
abc
1a1b1c1
2a2b2c2
3a3b3c3
df2 = make_df(['a', 'b', 'c'], [4, 5, 6])
df2
abc
4a4b4c4
5a5b5c5
6a6b6c6

1. 使用pd.concat()级联

pandas使用pd.concat函数,与np.concatenate函数类似,只是多了一些参数:

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
          keys=None, levels=None, names=None, verify_integrity=False,
          copy=True)

1) 简单级联

和np.concatenate一样,优先增加行数(默认axis=0)

# np.concatenate(axis=1)的情况是水平的级联, np中没有index,和columns,所以只要行列相等就可以级联
# 在pd中,如果行和列不一致,但是形状相同,会级联成一个更大的df,但是不对应的值会填充
pd.concat([df1, df2], axis=1)
abcabc
1a1b1c1NaNNaNNaN
2a2b2c2NaNNaNNaN
3a3b3c3NaNNaNNaN
4NaNNaNNaNa4b4c4
5NaNNaNNaNa5b5c5
6NaNNaNNaNa6b6c6

可以通过设置axis来改变级联方向

注意index在级联时可以重复

df3 = make_df(['a', 'b', 'c'], [2, 3, 4])
df3
abc
2a2b2c2
3a3b3c3
4a4b4c4
pd.concat([df1, df3])
abc
1a1b1c1
2a2b2c2
3a3b3c3
2a2b2c2
3a3b3c3
4a4b4c4
pd.concat([df1, df3], axis=1)
abcabc
1a1b1c1NaNNaNNaN
2a2b2c2a2b2c2
3a3b3c3a3b3c3
4NaNNaNNaNa4b4c4

也可以选择忽略ignore_index,重新索引

# gitignore 这个文件,会把写入的文件路径,给屏蔽,不会被上传到云端
# ignore_index的作用是对索引重新排序
pd.concat([df1, df3], axis=0, ignore_index=True)
# 在工作中,大部分的分析来源于mysql,mysql中的id都是唯一的,分表

# 分表, 每个表最大的存储限制是100w(索引条数),mysql,80w(实际中应用)
abc
0a1b1c1
1a2b2c2
2a3b3c3
3a2b2c2
4a3b3c3
5a4b4c4

或者使用多层索引 keys

concat([x,y],keys=[‘x’,‘y’])

df4 = pd.concat([df1, df3], keys=['期中', '期末'])
df4
abc
期中1a1b1c1
2a2b2c2
3a3b3c3
期末2a2b2c2
3a3b3c3
4a4b4c4

2) 不匹配级联

不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致,横向级联时行索引不一致

有3种连接方式:

  • 外连接:补NaN(默认模式)
df1
abc
1a1b1c1
2a2b2c2
3a3b3c3
df5 = make_df(['c', 'd', 'e'], [3, 4, 5])
df5
cde
3c3d3e3
4c4d4e4
5c5d5e5
#using()
df6 = pd.concat([df1, df5], axis=0)
df6
abcde
1a1b1c1NaNNaN
2a2b2c2NaNNaN
3a3b3c3NaNNaN
3NaNNaNc3d3e3
4NaNNaNc4d4e4
5NaNNaNc5d5e5
  • 内连接:只连接匹配的项
# 回忆mysql中outer inner 不同
# 外连接 left 以左边的表中的数据为核心,右表数据不匹配,则填充null
# 内连接 join 两边的表数据不完全对应的话,会只显示能对应上的数据
# join 默认值是outer
df6 = pd.concat([df1, df5], axis=1, join='inner')
df6
# 同mysql一致
abccde
3a3b3c3c3d3e3
  • 连接指定轴 join_axes
df1
abc
1a1b1c1
2a2b2c2
3a3b3c3
df7 = pd.concat([df1,df5], join_axes=[df1.columns])
df7
# join_axes的值是一个列表[df1.index]
# select df1.a df1.b df1.c from df1 left join df5 using(c);
# using(c) 相当于 on df1.c = df5.c
abc
1a1b1c1
2a2b2c2
3a3b3c3
3NaNNaNc3
4NaNNaNc4
5NaNNaNc5

3) 使用append()函数添加

由于在后面级联的使用非常普遍,因此有一个函数append专门用于在后面添加

append 和 concat 相似

df1.append(df2)
abc
1a1b1c1
2a2b2c2
3a3b3c3
4a4b4c4
5a5b5c5
6a6b6c6
df1.append(df5)
abcde
1a1b1c1NaNNaN
2a2b2c2NaNNaN
3a3b3c3NaNNaN
3NaNNaNc3d3e3
4NaNNaNc4d4e4
5NaNNaNc5d5e5

============================================

新建一个只有张三李四王老五的期末考试成绩单ddd3,使用append()与期中考试成绩表ddd级联

============================================

2. 使用pd.merge()合并

merge与concat的区别在于,merge需要依据某一共同的行或列来进行合并

使用pd.merge()合并时,会自动根据两者相同column名称的那一列,作为key来进行合并。

注意每一列元素的顺序不要求一致

1) 一对一合并

df1
abc
1a1b1c1
2a2b2c2
3a3b3c3
df2
abc
4a4b4c4
5a5b5c5
6a6b6c6
# 默认的是内连接,表的两边数据都不对应
pd.merge(df1, df2)
abc
# how:{'left', 'right', 'outer', 'inner'}, default 'inner'
pd.merge(df1,df2, how='right')
abc
0a4b4c4
1a5b5c5
2a6b6c6
pd.merge(df1, df5)
# how默认的是inner,数据对称才显示
abcde
0a3b3c3d3e3
pd.merge(df1, df5,how='right')
abcde
0a3b3c3d3e3
1NaNNaNc4d4e4
2NaNNaNc5d5e5

2) 多对一|一对多合并

df8 = make_df(['c', 'd', 'e'],[1, 1, 1, 4])
df8
cde
1c1d1e1
1c1d1e1
1c1d1e1
4c4d4e4
pd.merge(df1, df8)
# select * from df1 join df8 on df1.c = df8.c
abcde
0a1b1c1d1e1
1a1b1c1d1e1
2a1b1c1d1e1
pd.merge(df1, df8, how='left')
abcde
0a1b1c1d1e1
1a1b1c1d1e1
2a1b1c1d1e1
3a2b2c2NaNNaN
4a3b3c3NaNNaN
pd.merge(df1, df8, how='outer')
# 在工作中要使用outer,自动分配
abcde
0a1b1c1d1e1
1a1b1c1d1e1
2a1b1c1d1e1
3a2b2c2NaNNaN
4a3b3c3NaNNaN
5NaNNaNc4d4e4

3) 多对多合并

df8
cde
1c1qwee1
1c1d1e1
1c1d1e1
4c4d4asd
df8.iloc[0]['d'] = 'qwe'
df8['e'][4] = 'asd'
df9 = make_df(list('abc'),[1,1,4,4])
df9
abc
1a1b1c1
1a1b1c1
4a4b4c4
4a4b4c4
pd.merge(df9, df8, how='outer')
abcde
0a1b1c1qwee1
1a1b1c1d1e1
2a1b1c1d1e1
3a1b1c1qwee1
4a1b1c1d1e1
5a1b1c1d1e1
6a4b4c4d4asd
7a4b4c4d4asd

4) key的规范化

  • 使用on=显式指定哪一列为key,当有多个key相同时使用
df1
abc
1a1b1c1
2a2b2c2
3a3b3c3
df10 = make_df(list('cde'),[1, 2, 3])
# 现在的情况是df已经生成了,那么我想把列修改一下
df10.columns = list('wde')
df10
wde
1c1d1e1
2c2d2e2
3c3d3e3
df11 = make_df(list('bcd'),[1, 2, 3])
df11
bcd
1b1c1d1
2b2c2d2
3b3c3d3
pd.merge(df1, df11, on='b')
# mysql中一般碰到两个字段相同,但是数据类型不同
# 那么a.u = b.u  a.o - b.o
# select a.o as ao , b.0 as bo 
abc_xc_yd
0a1b1c1c1d1
1a2b2c2c2d2
2a3b3c3c3d3
pd.merge(df1, df11, on='c')
  • 使用left_on和right_on指定左右两边的列作为key,当左右两边的key都不想等时使用
# on的作用是将两个表中相同数据类型,含义一致的字段进行连接的
pd.merge(df1, df10, left_on='c', right_on='w')
#mysql

abcwde
0a1b1c1c1d1e1
1a2b2c2c2d2e2
2a3b3c3c3d3e3

============================================

  1. 假设有两份成绩单,除了ddd是张三李四王老五之外,还有ddd4是张三和赵小六的成绩单,如何合并?

  2. 如果ddd4中张三的名字被打错了,成为了张十三,怎么办?

============================================

5) 内合并与外合并

  • 内合并:只保留两者都有的key(默认模式)

  • 外合并 how=‘outer’:补NaN

  • 左合并、右合并:how=‘left’,how=‘right’,

============================================

  1. 如果只有张三赵小六语数英三个科目的成绩,如何合并?

  2. 考虑应用情景,使用多种方式合并ddd与ddd4

============================================

6) 列冲突的解决

当列冲突时,即有多个列名称相同时,需要使用on=来指定哪一个列作为key,配合suffixes指定冲突列名

可以使用suffixes=自己指定后缀

pd.merge(df1, df11, on='c', suffixes=('_up','_down'))
ab_upcb_downd
0a1b1c1b1d1
1a2b2c2b2d2
2a3b3c3b3d3

============================================

假设有两个同学都叫李四,ddd5、ddd6都是张三和李四的成绩表,如何合并?

============================================

3. 案例分析:美国各州人口数据分析

首先导入文件,并查看数据样本

# 面积
areas = pd.read_csv('./data/state-areas.csv')
# 缩写
abbr = pd.read_csv('./data/state-abbrevs.csv')
# 人口
pop = pd.read_csv('./data/state-population.csv')
areas.shape
  • 输出

    (52, 2)

areas.head()
statearea (sq. mi)
0Alabama52423
1Alaska656425
2Arizona114006
3Arkansas53182
4California163707
abbr.shape
  • 输出

    (51, 2)

abbr.head()
stateabbreviation
0AlabamaAL
1AlaskaAK
2ArizonaAZ
3ArkansasAR
4CaliforniaCA
pop.shape
  • 输出

    (2544, 4)

pop.head()
state/regionagesyearpopulation
0ALunder1820121117489.0
1ALtotal20124817528.0
2ALunder1820101130966.0
3ALtotal20104785570.0
4ALunder1820111125763.0

合并pop与abbrevs两个DataFrame,分别依据state/region列和abbreviation列来合并。

为了保留所有信息,使用外合并。

abbrToPop = pd.merge(abbr, pop, left_on='abbreviation', right_on='state/region', how='outer')
abbrToPop.head()                  
stateabbreviationstate/regionagesyearpopulation
0AlabamaALALunder1820121117489.0
1AlabamaALALtotal20124817528.0
2AlabamaALALunder1820101130966.0
3AlabamaALALtotal20104785570.0
4AlabamaALALunder1820111125763.0

去除abbreviation的那一列(axis=1)

# drop()
# 一般的一执行完就打印的,这种形式的方法不对原数据产生影响
abbrToPop.drop(labels='abbreviation', axis=1, inplace=True)
abbrToPop.head()
abbrToPop.shape
  • 输出

    (2544, 5)

查看存在缺失数据的列。

使用.isnull().any(),只有某一列存在一个缺失数据,就会显示True。

# NaN
abbrToPop.isnull().any()
  • 输出

    state False
    state/region False
    ages False
    year False
    population True
    dtype: bool

查看缺失数据

根据数据是否缺失情况显示数据,如果缺失为True,那么显示

# 怎么计算丢失数据的数量
abbrToPop['state'].isnull().sum()
  • 输出

    96

找到有哪些state/region使得state的值为NaN,使用unique()查看非重复值

# 把空数据填充上
cond = abbrToPop['state'].isnull()
# state是州名,如何填充
# unique()
abbrToPop['state/region'][cond].unique()
  • 输出

    array([‘PR’, ‘USA’], dtype=object)

# 我们通过翻阅资料查到的PR的全称
# Puerto Rico
# 开始赋值
cond_pr = abbrToPop['state/region'] == 'PR'

abbrToPop['state'][cond_pr] = 'Puerto Rico'
C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
cond_usa = abbrToPop['state/region'] == 'USA'
abbrToPop['state'][cond_usa] = 'United States'
C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
# population 查阅资料,我们先删除掉
abbrToPop.isnull().sum()
  • 输出

    state 0
    state/region 0
    ages 0
    year 0
    population 0
    dtype: int64

# dropna()
# inplace表示对原表进行操作
abbrToPop.dropna(inplace=True)

为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN!

记住这样清除缺失数据NaN的方法!

合并各州面积数据areas,使用左合并。

思考一下为什么使用外合并?

areas.head()
statearea (sq. mi)
0Alabama52423
1Alaska656425
2Arizona114006
3Arkansas53182
4California163707
abbrToPopToAreas = pd.merge(abbrToPop, areas, how='outer')
abbrToPopToAreas.head()
statestate/regionagesyearpopulationarea (sq. mi)
0AlabamaALunder1820121117489.052423.0
1AlabamaALtotal20124817528.052423.0
2AlabamaALunder1820101130966.052423.0
3AlabamaALtotal20104785570.052423.0
4AlabamaALunder1820111125763.052423.0

继续寻找存在缺失数据的列

我们会发现area(sq.mi)这一列有缺失数据,为了找出是哪一行,我们需要找出是哪个state没有数据

abbrToPopToAreas.isnull().any()
  • 输出

    state False
    state/region False
    ages False
    year False
    population False
    area (sq. mi) False
    dtype: bool

cond_area = abbrToPopToAreas['area (sq. mi)'].isnull()
abbrToPopToAreas['state/region'][cond_area]
  • 输出

    2476 USA
    2477 USA
    2478 USA
    2479 USA
    2480 USA
    2481 USA
    2482 USA
    2483 USA
    2484 USA
    2485 USA
    2486 USA
    2487 USA
    2488 USA
    2489 USA
    2490 USA
    2491 USA
    2492 USA
    2493 USA
    2494 USA
    2495 USA
    2496 USA
    2497 USA
    2498 USA
    2499 USA
    2500 USA
    2501 USA
    2502 USA
    2503 USA
    2504 USA
    2505 USA
    2506 USA
    2507 USA
    2508 USA
    2509 USA
    2510 USA
    2511 USA
    2512 USA
    2513 USA
    2514 USA
    2515 USA
    2516 USA
    2517 USA
    2518 USA
    2519 USA
    2520 USA
    2521 USA
    2522 USA
    2523 USA
    Name: state/region, dtype: object

total_area = areas['area (sq. mi)'].sum()
total_area
  • 输出

    3790399

cond_ab = abbrToPopToAreas['state/region'] == 'USA'
abbrToPopToAreas['area (sq. mi)'][cond_ab] = total_area
C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

查看数据是否缺失

abbrToPopToAreas.isnull().sum()
  • 输出

    state 0
    state/region 0
    ages 0
    year 0
    population 0
    area (sq. mi) 0
    dtype: int64

找出2010年的全民人口数据,df.query(查询语句)

abbrToPopToAreas
statestate/regionagesyearpopulationarea (sq. mi)
0AlabamaALunder1820121117489.052423.0
1AlabamaALtotal20124817528.052423.0
2AlabamaALunder1820101130966.052423.0
3AlabamaALtotal20104785570.052423.0
4AlabamaALunder1820111125763.052423.0
5AlabamaALtotal20114801627.052423.0
6AlabamaALtotal20094757938.052423.0
7AlabamaALunder1820091134192.052423.0
8AlabamaALunder1820131111481.052423.0
9AlabamaALtotal20134833722.052423.0
10AlabamaALtotal20074672840.052423.0
11AlabamaALunder1820071132296.052423.0
12AlabamaALtotal20084718206.052423.0
13AlabamaALunder1820081134927.052423.0
14AlabamaALtotal20054569805.052423.0
15AlabamaALunder1820051117229.052423.0
16AlabamaALtotal20064628981.052423.0
17AlabamaALunder1820061126798.052423.0
18AlabamaALtotal20044530729.052423.0
19AlabamaALunder1820041113662.052423.0
20AlabamaALtotal20034503491.052423.0
21AlabamaALunder1820031113083.052423.0
22AlabamaALtotal20014467634.052423.0
23AlabamaALunder1820011120409.052423.0
24AlabamaALtotal20024480089.052423.0
25AlabamaALunder1820021116590.052423.0
26AlabamaALunder1819991121287.052423.0
27AlabamaALtotal19994430141.052423.0
28AlabamaALtotal20004452173.052423.0
29AlabamaALunder1820001122273.052423.0
.....................
2494United StatesUSAunder18199971946051.03790399.0
2495United StatesUSAtotal2000282162411.03790399.0
2496United StatesUSAunder18200072376189.03790399.0
2497United StatesUSAtotal1999279040181.03790399.0
2498United StatesUSAtotal2001284968955.03790399.0
2499United StatesUSAunder18200172671175.03790399.0
2500United StatesUSAtotal2002287625193.03790399.0
2501United StatesUSAunder18200272936457.03790399.0
2502United StatesUSAtotal2003290107933.03790399.0
2503United StatesUSAunder18200373100758.03790399.0
2504United StatesUSAtotal2004292805298.03790399.0
2505United StatesUSAunder18200473297735.03790399.0
2506United StatesUSAtotal2005295516599.03790399.0
2507United StatesUSAunder18200573523669.03790399.0
2508United StatesUSAtotal2006298379912.03790399.0
2509United StatesUSAunder18200673757714.03790399.0
2510United StatesUSAtotal2007301231207.03790399.0
2511United StatesUSAunder18200774019405.03790399.0
2512United StatesUSAtotal2008304093966.03790399.0
2513United StatesUSAunder18200874104602.03790399.0
2514United StatesUSAunder18201373585872.03790399.0
2515United StatesUSAtotal2013316128839.03790399.0
2516United StatesUSAtotal2009306771529.03790399.0
2517United StatesUSAunder18200974134167.03790399.0
2518United StatesUSAunder18201074119556.03790399.0
2519United StatesUSAtotal2010309326295.03790399.0
2520United StatesUSAunder18201173902222.03790399.0
2521United StatesUSAtotal2011311582564.03790399.0
2522United StatesUSAunder18201273708179.03790399.0
2523United StatesUSAtotal2012313873685.03790399.0

2524 rows × 6 columns

abbrToPopToAreas_2010 = abbrToPopToAreas.query('year == 2010 & ages == "total"')
abbrToPopToAreas_2010
statestate/regionagesyearpopulationarea (sq. mi)
3AlabamaALtotal20104785570.052423.0
91AlaskaAKtotal2010713868.0656425.0
101ArizonaAZtotal20106408790.0114006.0
189ArkansasARtotal20102922280.053182.0
197CaliforniaCAtotal201037333601.0163707.0
283ColoradoCOtotal20105048196.0104100.0
293ConnecticutCTtotal20103579210.05544.0
379DelawareDEtotal2010899711.01954.0
389District of ColumbiaDCtotal2010605125.068.0
475FloridaFLtotal201018846054.065758.0
485GeorgiaGAtotal20109713248.059441.0
570HawaiiHItotal20101363731.010932.0
581IdahoIDtotal20101570718.083574.0
666IllinoisILtotal201012839695.057918.0
677IndianaINtotal20106489965.036420.0
762IowaIAtotal20103050314.056276.0
773KansasKStotal20102858910.082282.0
858KentuckyKYtotal20104347698.040411.0
869LouisianaLAtotal20104545392.051843.0
954MaineMEtotal20101327366.035387.0
965MontanaMTtotal2010990527.0147046.0
1050NebraskaNEtotal20101829838.077358.0
1061NevadaNVtotal20102703230.0110567.0
1146New HampshireNHtotal20101316614.09351.0
1157New JerseyNJtotal20108802707.08722.0
1242New MexicoNMtotal20102064982.0121593.0
1253New YorkNYtotal201019398228.054475.0
1338North CarolinaNCtotal20109559533.053821.0
1349North DakotaNDtotal2010674344.070704.0
1434OhioOHtotal201011545435.044828.0
1445OklahomaOKtotal20103759263.069903.0
1530OregonORtotal20103837208.098386.0
1541MarylandMDtotal20105787193.012407.0
1626MassachusettsMAtotal20106563263.010555.0
1637MichiganMItotal20109876149.096810.0
1722MinnesotaMNtotal20105310337.086943.0
1733MississippiMStotal20102970047.048434.0
1818MissouriMOtotal20105996063.069709.0
1829PennsylvaniaPAtotal201012710472.046058.0
1914Rhode IslandRItotal20101052669.01545.0
1925South CarolinaSCtotal20104636361.032007.0
2010South DakotaSDtotal2010816211.077121.0
2021TennesseeTNtotal20106356683.042146.0
2106TexasTXtotal201025245178.0268601.0
2117UtahUTtotal20102774424.084904.0
2202VermontVTtotal2010625793.09615.0
2213VirginiaVAtotal20108024417.042769.0
2298WashingtonWAtotal20106742256.071303.0
2309West VirginiaWVtotal20101854146.024231.0
2394WisconsinWItotal20105689060.065503.0
2405WyomingWYtotal2010564222.097818.0
2470Puerto RicoPRtotal20103721208.03515.0
2519United StatesUSAtotal2010309326295.03790399.0

对查询结果进行处理,以state列作为新的行索引:set_index

# 工作中会使用id作为列的索引
# set_index()
abbrToPopToAreas_2010.set_index('state',inplace=True)
abbrToPopToAreas_2010.head()
state/regionagesyearpopulationarea (sq. mi)
state
AlabamaALtotal20104785570.052423.0
AlaskaAKtotal2010713868.0656425.0
ArizonaAZtotal20106408790.0114006.0
ArkansasARtotal20102922280.053182.0
CaliforniaCAtotal201037333601.0163707.0

计算人口密度population/area。注意是Series/Series,其结果还是一个Series。

density_2010 = abbrToPopToAreas_2010['population'] / abbrToPopToAreas_2010['area (sq. mi)']
density_2010.head()
  • 输出

    state
    Alabama 91.287603
    Alaska 1.087509
    Arizona 56.214497
    Arkansas 54.948667
    California 228.051342
    dtype: float64

2010年的人口密度融合到表中

abbrToPopToAreas_2010['density_2010']=density_2010
C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

排序,并找出人口密度最高的五个州sort_values()(根据值来排序)

abbrToPopToAreas_2010.sort_values(by='density_2010').tail()
state/regionagesyearpopulationarea (sq. mi)density_2010
state
ConnecticutCTtotal20103579210.05544.0645.600649
Rhode IslandRItotal20101052669.01545.0681.339159
New JerseyNJtotal20108802707.08722.01009.253268
Puerto RicoPRtotal20103721208.03515.01058.665149
District of ColumbiaDCtotal2010605125.068.08898.897059

找出人口密度最低的五个州

abbrToPopToAreas_2010.sort_values(by='density_2010').head()

要点总结:

  • 统一用loc()索引
  • 善于使用.isnull().any()找到存在NaN的列
  • 善于使用.unique()确定该列中哪些key是我们需要的
  • 一般使用外合并、左合并,目的只有一个:宁愿该列是NaN也不要丢弃其他列的信息

回顾:Series/DataFrame运算与ndarray运算的区别

  • Series与DataFrame没有广播,如果对应index没有值,则记为NaN;或者使用add的fill_value来补缺失值
  • ndarray有广播,通过重复已有值来计算

苹果股票涨跌绘制

import matplotlib.pyplot as plt
import numpy as np

读取数据

apple = pd.read_csv('./data/AAPL.csv')
apple.head()
# adj闭盘以后的调整价格
DateOpenHighLowCloseAdj CloseVolume
01980-12-120.5133930.5156250.5133930.5133930.023268117258400.0
11980-12-150.4888390.4888390.4866070.4866070.02205443971200.0
21980-12-160.4531250.4531250.4508930.4508930.02043526432000.0
31980-12-170.4620540.4642860.4620540.4620540.02094121610400.0
41980-12-180.4754460.4776790.4754460.4754460.02154818362400.0
apple.tail()
DateOpenHighLowCloseAdj CloseVolume
94532018-06-08191.169998192.000000189.770004191.699997191.69999726656800.0
94542018-06-11191.350006191.970001190.210007191.229996191.22999618308500.0
94552018-06-12191.389999192.610001191.149994192.279999192.27999916911100.0
94562018-06-13192.419998192.880005190.440002190.699997190.69999721638400.0
94572018-06-14191.550003191.570007190.220001190.800003190.80000321491500.0
apple.shape
  • 输出

    (9458, 7)

apple.dtypes
  • 输出

    Date datetime64[ns]
    Open float64
    High float64
    Low float64
    Close float64
    Adj Close float64
    Volume float64
    dtype: object

转换一下data的数据类型
mysql中datetime pd.to_datetime()

apple['Date'] = pd.to_datetime(apple['Date'])
apple.set_index('Date', inplace=True)
---------------------------------------------------------------------------
apple.tail()
OpenHighLowCloseAdj CloseVolume
Date
2018-06-08191.169998192.000000189.770004191.699997191.69999726656800.0
2018-06-11191.350006191.970001190.210007191.229996191.22999618308500.0
2018-06-12191.389999192.610001191.149994192.279999192.27999916911100.0
2018-06-13192.419998192.880005190.440002190.699997190.69999721638400.0
2018-06-14191.550003191.570007190.220001190.800003190.80000321491500.0

绘制图形

adj_plot = apple['Adj Close'].plot()
fig = adj_plot.get_figure()
# set_size_inches()设置图片大小,单位英寸
fig.set_size_inches(12,6)

这里写图片描述

apple.drop('Volume',axis=1,inplace=True)
app = apple.plot()
# 需要获取当前图片
fig1 = app.get_figure()
fig1 = fig1.set_size_inches(12,6)

这里写图片描述

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值