Python--Pandas入门

数据结构Series

import numpy as np
import pandas as pd
构造和初始化series
s = pd.Series([7, 'Beijing', 3.14, -12345, 'Hanxiao'])
s[1]
'Beijing'
s
0          7
1    Beijing
2       3.14
3     -12345
4    Hanxiao
dtype: object

pandas默认用0到n作为Series的index,也可以自定义index(索引)。

s = pd.Series([7, 'Beijing', 3.14, -12345, 'Hanxiao'], index=['A', 'B', 'C', 'D', 'E'])
s
A          7
B    Beijing
C       3.14
D     -12345
E    Hanxiao
dtype: object
s['A']
7
s[['A', 'D', 'B']]
A          7
D     -12345
B    Beijing
dtype: object

可以用list构建Series,同时可以指定index。也可以用dict来初始化Series。

cities = {'Beijing':55000, 'Shanghai':60000, 'Shenzhen':50000, 'Hangzhou':30000, 'Guangzhou':40000, 'Suzhou':None}
cities
{'Beijing': 55000,
 'Shanghai': 60000,
 'Shenzhen': 50000,
 'Hangzhou': 30000,
 'Guangzhou': 40000,
 'Suzhou': None}
apt = pd.Series(cities, name='income')
apt
Beijing      55000.0
Shanghai     60000.0
Shenzhen     50000.0
Hangzhou     30000.0
Guangzhou    40000.0
Suzhou           NaN
Name: income, dtype: float64
# 索引
apt['Guangzhou']
40000.0
apt[4]
40000.0
apt[1:]
Shanghai     60000.0
Shenzhen     50000.0
Hangzhou     30000.0
Guangzhou    40000.0
Suzhou           NaN
Name: income, dtype: float64
apt[:-1]
Beijing      55000.0
Shanghai     60000.0
Shenzhen     50000.0
Hangzhou     30000.0
Guangzhou    40000.0
Name: income, dtype: float64
apt[[3,4,1]]
Hangzhou     30000.0
Guangzhou    40000.0
Shanghai     60000.0
Name: income, dtype: float64
apt[['Shanghai', 'Shenzhen']]
Shanghai    60000.0
Shenzhen    50000.0
Name: income, dtype: float64
# 简单的计算
3*apt
Beijing      165000.0
Shanghai     180000.0
Shenzhen     150000.0
Hangzhou      90000.0
Guangzhou    120000.0
Suzhou            NaN
Name: income, dtype: float64
apt/2.5
Beijing      22000.0
Shanghai     24000.0
Shenzhen     20000.0
Hangzhou     12000.0
Guangzhou    16000.0
Suzhou           NaN
Name: income, dtype: float64
# 基于索引做计算
apt[1:] + apt[:-1]
Beijing           NaN
Guangzhou     80000.0
Hangzhou      60000.0
Shanghai     120000.0
Shenzhen     100000.0
Suzhou            NaN
Name: income, dtype: float64
# in判断index是否存在
'Hangzhou' in apt
True
'Chongqing' in apt
False
# apt['Chongqing'] 报错
print(apt.get('Chongqing'))
None
print(apt.get('Guangzhou'))
40000.0
booling indexing/条件判断索引
apt>=40000
Beijing       True
Shanghai      True
Shenzhen      True
Hangzhou     False
Guangzhou     True
Suzhou       False
Name: income, dtype: bool
# 条件索引
apt[apt>=40000]
Beijing      55000.0
Shanghai     60000.0
Shenzhen     50000.0
Guangzhou    40000.0
Name: income, dtype: float64
# 统计计算
apt.mean()
47000.0
apt.median()
50000.0
apt.max()
60000.0
apt.min()
30000.0
Series赋值
apt
Beijing      55000.0
Shanghai     60000.0
Shenzhen     50000.0
Hangzhou     30000.0
Guangzhou    40000.0
Suzhou           NaN
Name: income, dtype: float64
apt['Shenzhen'] = 70000
apt
Beijing      55000.0
Shanghai     60000.0
Shenzhen     70000.0
Hangzhou     30000.0
Guangzhou    40000.0
Suzhou           NaN
Name: income, dtype: float64
# 条件赋值
apt[apt<=40000] = 45000
apt
Beijing      55000.0
Shanghai     60000.0
Shenzhen     70000.0
Hangzhou     45000.0
Guangzhou    45000.0
Suzhou           NaN
Name: income, dtype: float64
np.log(apt)
Beijing      10.915088
Shanghai     11.002100
Shenzhen     11.156251
Hangzhou     10.714418
Guangzhou    10.714418
Suzhou             NaN
Name: income, dtype: float64
cars = pd.Series({'Beijing':350000, 'Shanghai':400000, 'Shenzhen':300000, 'Tianjing':200000, 'Guangzhou':250000, 'Chongqing':150000})
cars
Beijing      350000
Shanghai     400000
Shenzhen     300000
Tianjing     200000
Guangzhou    250000
Chongqing    150000
dtype: int64
expense = cars + 10*apt
expense
Beijing       900000.0
Chongqing          NaN
Guangzhou     700000.0
Hangzhou           NaN
Shanghai     1000000.0
Shenzhen     1000000.0
Suzhou             NaN
Tianjing           NaN
dtype: float64
缺失数据处理
apt
Beijing      55000.0
Shanghai     60000.0
Shenzhen     70000.0
Hangzhou     45000.0
Guangzhou    45000.0
Suzhou           NaN
Name: income, dtype: float64
apt.notnull()
Beijing       True
Shanghai      True
Shenzhen      True
Hangzhou      True
Guangzhou     True
Suzhou       False
Name: income, dtype: bool
apt.isnull()
Beijing      False
Shanghai     False
Shenzhen     False
Hangzhou     False
Guangzhou    False
Suzhou        True
Name: income, dtype: bool
expense
Beijing       900000.0
Chongqing     900000.0
Guangzhou     700000.0
Hangzhou      900000.0
Shanghai     1000000.0
Shenzhen     1000000.0
Suzhou        900000.0
Tianjing      900000.0
dtype: float64
expense[expense.isnull()] = expense.mean()
expense
Beijing       900000.0
Chongqing     900000.0
Guangzhou     700000.0
Hangzhou      900000.0
Shanghai     1000000.0
Shenzhen     1000000.0
Suzhou        900000.0
Tianjing      900000.0
dtype: float64

数据结构DataFrame

二位数组

data = {'City':['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen', 'Hanghzhou', 'Chongqing'],
        'year':[2017, 2018, 2017, 2018, 2017, 2017],
        'popu':[2100, 2300, 1000, 700, 500, 500]
       }
pd.DataFrame(data)
Cityyearpopu
0Beijing20172100
1Shanghai20182300
2Guangzhou20171000
3Shenzhen2018700
4Hanghzhou2017500
5Chongqing2017500
pd.DataFrame(data, columns=['year','City','popu'])
yearCitypopu
02017Beijing2100
12018Shanghai2300
22017Guangzhou1000
32018Shenzhen700
42017Hanghzhou500
52017Chongqing500
pd.DataFrame(data, columns=['year','City','popu'], index=['one','two','three','four','five','six'])
yearCitypopu
one2017Beijing2100
two2018Shanghai2300
three2017Guangzhou1000
four2018Shenzhen700
five2017Hanghzhou500
six2017Chongqing500
# DataFrame可以视作Series的集合
apt
Beijing      55000.0
Shanghai     60000.0
Shenzhen     70000.0
Hangzhou     45000.0
Guangzhou    45000.0
Suzhou           NaN
Name: income, dtype: float64
cars
Beijing      350000
Shanghai     400000
Shenzhen     300000
Tianjing     200000
Guangzhou    250000
Chongqing    150000
dtype: int64
df = pd.DataFrame({'apt':apt,'cars':cars})
df
aptcars
Beijing55000.0350000.0
ChongqingNaN150000.0
Guangzhou45000.0250000.0
Hangzhou45000.0NaN
Shanghai60000.0400000.0
Shenzhen70000.0300000.0
SuzhouNaNNaN
TianjingNaN200000.0
# 取出一列
df['apt']
Beijing      55000.0
Chongqing        NaN
Guangzhou    45000.0
Hangzhou     45000.0
Shanghai     60000.0
Shenzhen     70000.0
Suzhou           NaN
Tianjing         NaN
Name: apt, dtype: float64
type(df['apt'])
pandas.core.series.Series
df[['apt']]
apt
Beijing55000.0
ChongqingNaN
Guangzhou45000.0
Hangzhou45000.0
Shanghai60000.0
Shenzhen70000.0
SuzhouNaN
TianjingNaN
type(df[['apt']])
pandas.core.frame.DataFrame
# 赋值
df
aptcars
Beijing55000.0350000.0
ChongqingNaN150000.0
Guangzhou45000.0250000.0
Hangzhou45000.0NaN
Shanghai60000.0400000.0
Shenzhen70000.0300000.0
SuzhouNaNNaN
TianjingNaN200000.0
df['bonus'] = 40000
df
aptcarsbonus
Beijing55000.0350000.040000
ChongqingNaN150000.040000
Guangzhou45000.0250000.040000
Hangzhou45000.0NaN40000
Shanghai60000.0400000.040000
Shenzhen70000.0300000.040000
SuzhouNaNNaN40000
TianjingNaN200000.040000
# 对两列做计算
df['expense'] = df['apt'] + df['bonus']
df
aptcarsbonusexpense
Beijing55000.0350000.04000095000.0
ChongqingNaN150000.040000NaN
Guangzhou45000.0250000.04000085000.0
Hangzhou45000.0NaN4000085000.0
Shanghai60000.0400000.040000100000.0
Shenzhen70000.0300000.040000110000.0
SuzhouNaNNaN40000NaN
TianjingNaN200000.040000NaN
df.index
Index(['Beijing', 'Chongqing', 'Guangzhou', 'Hangzhou', 'Shanghai', 'Shenzhen',
       'Suzhou', 'Tianjing'],
      dtype='object')
df.loc['Beijing']
apt         55000.0
cars       350000.0
bonus       40000.0
expense     95000.0
Name: Beijing, dtype: float64
type(df.loc['Beijing'])
pandas.core.series.Series
df.loc[['Beijing','Shanghai','Guangzhou']]
aptcarsbonusexpense
Beijing55000.0350000.04000095000.0
Shanghai60000.0400000.040000100000.0
Guangzhou45000.0250000.04000085000.0
df
aptcarsbonusexpense
Beijing55000.0350000.04000095000.0
ChongqingNaN150000.040000NaN
Guangzhou45000.0250000.04000085000.0
Hangzhou45000.0NaN4000085000.0
Shanghai60000.0400000.040000100000.0
Shenzhen70000.0300000.040000110000.0
SuzhouNaNNaN40000NaN
TianjingNaN200000.040000NaN
# loc高级用法
df.loc['Beijing':'Suzhou',['apt','bonus']]
aptbonus
Beijing55000.040000
ChongqingNaN40000
Guangzhou45000.040000
Hangzhou45000.040000
Shanghai60000.040000
Shenzhen70000.040000
SuzhouNaN40000
# 类似切片的用法
df.loc['Beijing':'Suzhou','apt':'bonus']
aptcarsbonus
Beijing55000.0350000.040000
ChongqingNaN150000.040000
Guangzhou45000.0250000.040000
Hangzhou45000.0NaN40000
Shanghai60000.0400000.040000
Shenzhen70000.0300000.040000
SuzhouNaNNaN40000
# 传入list的用法
df.loc[['Beijing','Suzhou'],['apt','bonus']]
aptbonus
Beijing55000.040000
SuzhouNaN40000
df.loc['Beijing','bonus'] = 50000
df
aptcarsbonusexpense
Beijing55000.0350000.05000095000.0
ChongqingNaN150000.040000NaN
Guangzhou45000.0250000.04000085000.0
Hangzhou45000.0NaN4000085000.0
Shanghai60000.0400000.040000100000.0
Shenzhen70000.0300000.040000110000.0
SuzhouNaNNaN40000NaN
TianjingNaN200000.040000NaN
# 对一列赋值
df.loc[:,'expense'] = 100000
df
aptcarsbonusexpense
Beijing55000.0350000.050000100000
ChongqingNaN150000.040000100000
Guangzhou45000.0250000.040000100000
Hangzhou45000.0NaN40000100000
Shanghai60000.0400000.040000100000
Shenzhen70000.0300000.040000100000
SuzhouNaNNaN40000100000
TianjingNaN200000.040000100000
df.shape
(8, 4)
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, Beijing to Tianjing
Data columns (total 4 columns):
apt        5 non-null float64
cars       6 non-null float64
bonus      8 non-null int64
expense    8 non-null int64
dtypes: float64(2), int64(2)
memory usage: 640.0+ bytes
df.T
BeijingChongqingGuangzhouHangzhouShanghaiShenzhenSuzhouTianjing
apt55000.0NaN45000.045000.060000.070000.0NaNNaN
cars350000.0150000.0250000.0NaN400000.0300000.0NaN200000.0
bonus50000.040000.040000.040000.040000.040000.040000.040000.0
expense100000.0100000.0100000.0100000.0100000.0100000.0100000.0100000.0
df
aptcarsbonusexpense
Beijing55000.0350000.050000100000
ChongqingNaN150000.040000100000
Guangzhou45000.0250000.040000100000
Hangzhou45000.0NaN40000100000
Shanghai60000.0400000.040000100000
Shenzhen70000.0300000.040000100000
SuzhouNaNNaN40000100000
TianjingNaN200000.040000100000
df.describe()
aptcarsbonusexpense
count5.0000006.0000008.0000008.0
mean55000.000000275000.00000041250.000000100000.0
std10606.60171893541.4346693535.5339060.0
min45000.000000150000.00000040000.000000100000.0
25%45000.000000212500.00000040000.000000100000.0
50%55000.000000275000.00000040000.000000100000.0
75%60000.000000337500.00000040000.000000100000.0
max70000.000000400000.00000050000.000000100000.0
df['cars']
Beijing      350000.0
Chongqing    150000.0
Guangzhou    250000.0
Hangzhou          NaN
Shanghai     400000.0
Shenzhen     300000.0
Suzhou            NaN
Tianjing     200000.0
Name: cars, dtype: float64
df['cars'] < 310000
Beijing      False
Chongqing     True
Guangzhou     True
Hangzhou     False
Shanghai     False
Shenzhen      True
Suzhou       False
Tianjing      True
Name: cars, dtype: bool
df.loc[:,'color'] = ['红','黄','紫','蓝','红','绿','棕','红']
df
aptcarsbonusexpensecolor
Beijing55000.0350000.050000100000
ChongqingNaN150000.040000100000
Guangzhou45000.0250000.040000100000
Hangzhou45000.0NaN40000100000
Shanghai60000.0400000.040000100000
Shenzhen70000.0300000.040000100000绿
SuzhouNaNNaN40000100000
TianjingNaN200000.040000100000
df['color'].isin(['红','绿'])
Beijing       True
Chongqing    False
Guangzhou    False
Hangzhou     False
Shanghai      True
Shenzhen      True
Suzhou       False
Tianjing      True
Name: color, dtype: bool
df[df['color'].isin(['红','绿'])]
aptcarsbonusexpensecolor
Beijing55000.0350000.050000100000
Shanghai60000.0400000.040000100000
Shenzhen70000.0300000.040000100000绿
TianjingNaN200000.040000100000
# 填充缺失值
df.fillna(value=50000)
#df.fillna(value=50000, inplace=True) 就地填充,改变原始值
aptcarsbonusexpensecolor
Beijing55000.0350000.050000100000
Chongqing50000.0150000.040000100000
Guangzhou45000.0250000.040000100000
Hangzhou45000.050000.040000100000
Shanghai60000.0400000.040000100000
Shenzhen70000.0300000.040000100000绿
Suzhou50000.050000.040000100000
Tianjing50000.0200000.040000100000
df
aptcarsbonusexpensecolor
Beijing55000.0350000.050000100000
ChongqingNaN150000.040000100000
Guangzhou45000.0250000.040000100000
Hangzhou45000.0NaN40000100000
Shanghai60000.0400000.040000100000
Shenzhen70000.0300000.040000100000绿
SuzhouNaNNaN40000100000
TianjingNaN200000.040000100000
df.fillna(method='ffill') #缺失值使用前一行的数据填充
aptcarsbonusexpensecolor
Beijing55000.0350000.050000100000
Chongqing55000.0150000.040000100000
Guangzhou45000.0250000.040000100000
Hangzhou45000.0250000.040000100000
Shanghai60000.0400000.040000100000
Shenzhen70000.0300000.040000100000绿
Suzhou70000.0300000.040000100000
Tianjing70000.0200000.040000100000
# 向后填充
df.fillna(method='bfill')
aptcarsbonusexpensecolor
Beijing55000.0350000.050000100000
Chongqing45000.0150000.040000100000
Guangzhou45000.0250000.040000100000
Hangzhou45000.0400000.040000100000
Shanghai60000.0400000.040000100000
Shenzhen70000.0300000.040000100000绿
SuzhouNaN200000.040000100000
TianjingNaN200000.040000100000

从外部读入文件

stock = pd.read_csv('stock_px.csv')
stock.head() #显示前5行
DATEAAAAPLGEIBMJNJMSFTPEPSPXXOM
01990-2-1 0:004.987.862.8716.794.270.516.04328.796.12
11990-2-2 0:005.048.002.8716.894.370.516.09330.926.24
21990-2-5 0:005.078.182.8717.324.340.516.05331.856.25
31990-2-6 0:005.018.122.8817.564.320.516.15329.666.23
41990-2-7 0:005.047.772.9117.934.380.516.17333.756.33
#将第一列(时间)作为索引
stock = pd.read_csv('stock_px.csv', index_col=0, parse_dates=['DATE']) #解析成日期型,作用见后续
stock.head(6)
AAAAPLGEIBMJNJMSFTPEPSPXXOM
DATE
1990-02-014.987.862.8716.794.270.516.04328.796.12
1990-02-025.048.002.8716.894.370.516.09330.926.24
1990-02-055.078.182.8717.324.340.516.05331.856.25
1990-02-065.018.122.8817.564.320.516.15329.666.23
1990-02-075.047.772.9117.934.380.516.17333.756.33
1990-02-085.047.712.9217.864.460.516.22332.966.35
stock.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5472 entries, 1990-02-01 to 2011-10-14
Data columns (total 9 columns):
AA      5472 non-null float64
AAPL    5472 non-null float64
GE      5472 non-null float64
IBM     5472 non-null float64
JNJ     5472 non-null float64
MSFT    5472 non-null float64
PEP     5471 non-null float64
SPX     5472 non-null float64
XOM     5472 non-null float64
dtypes: float64(9)
memory usage: 427.5 KB
stock.describe()
AAAAPLGEIBMJNJMSFTPEPSPXXOM
count5472.0000005472.0000005472.0000005472.0000005472.0000005472.0000005471.0000005472.0000005472.000000
mean17.44028557.11931317.93319466.63773034.22534017.04634534.284204945.03521635.225919
std9.64799988.67042310.64763541.68948119.72666611.00098818.383894369.49467223.967647
min4.2000003.2300002.4000008.4000004.2000000.5100005.870000295.4600005.940000
25%8.0775008.7600006.28000020.57750012.1600004.24000015.480000547.21750011.910000
50%14.88500011.99000018.15000074.11500037.57000020.91000033.7500001058.30500030.280000
75%26.34000068.01750027.30000095.65750052.67500025.31000049.5300001253.39500055.300000
max43.620000422.00000042.780000190.53000067.32000046.81000071.2500001565.15000087.480000
stock.tail()  #默认打印后5行
AAAAPLGEIBMJNJMSFTPEPSPXXOM
DATE
2011-10-1010.09388.8116.14186.6264.4326.9461.871194.8976.28
2011-10-1110.30400.2916.14185.0063.9627.0060.951195.5476.27
2011-10-1210.05402.1916.40186.1264.3326.9662.701207.2577.16
2011-10-1310.10408.4316.22186.8264.2327.1862.361203.6676.37
2011-10-1410.26422.0016.60190.5364.7227.2762.241224.5878.11
stock.index
DatetimeIndex(['1990-02-01', '1990-02-02', '1990-02-05', '1990-02-06',
               '1990-02-07', '1990-02-08', '1990-02-09', '1990-02-12',
               '1990-02-13', '1990-02-14',
               ...
               '2011-10-03', '2011-10-04', '2011-10-05', '2011-10-06',
               '2011-10-07', '2011-10-10', '2011-10-11', '2011-10-12',
               '2011-10-13', '2011-10-14'],
              dtype='datetime64[ns]', name='DATE', length=5472, freq=None)
  • 解析为日期型的作用
stock.loc[:,'dow'] = stock.index.dayofweek  #添加一个名为'dow'的新列,其内容为该日为星期几
stock.head()
AAAAPLGEIBMJNJMSFTPEPSPXXOMdow
DATE
1990-02-014.987.862.8716.794.270.516.04328.796.123
1990-02-025.048.002.8716.894.370.516.09330.926.244
1990-02-055.078.182.8717.324.340.516.05331.856.250
1990-02-065.018.122.8817.564.320.516.15329.666.231
1990-02-075.047.772.9117.934.380.516.17333.756.332
stock.loc[:,'doy'] = stock.index.dayofyear  #添加一个名为'doy'的新列,其内容为一年当中的第几天
stock.head()
AAAAPLGEIBMJNJMSFTPEPSPXXOMdowdoy
DATE
1990-02-014.987.862.8716.794.270.516.04328.796.12332
1990-02-025.048.002.8716.894.370.516.09330.926.24433
1990-02-055.078.182.8717.324.340.516.05331.856.25036
1990-02-065.018.122.8817.564.320.516.15329.666.23137
1990-02-075.047.772.9117.934.380.516.17333.756.33238
  • 作图
%matplotlib inline
stock['JNJ'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x19bfb184788>

在这里插入图片描述

stock['JNJ'].plot(grid=True)  #网格线
<matplotlib.axes._subplots.AxesSubplot at 0x19bfc4b9188>

在这里插入图片描述

  • 条件与或非

    | 或

    & 且

    ! 非
sto = stock[(stock.index >= '2009-01-01') & (stock.index <= '2009-1-31')] #2009年1月份的数据
sto['JNJ'].plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x19bfc7eaa48>

在这里插入图片描述

stock[(stock.index >= '2009-01-01') & (stock.index < '2009-1-31')].describe()
AAAAPLGEIBMJNJMSFTPEPSPXXOMdowdoy
count20.00000020.00000020.00000020.0000020.0000020.00000020.00000020.00000020.00000020.0000020.000000
mean9.18700088.77500012.94750083.6595052.8930017.76800047.889500865.57550073.1705002.2000016.650000
std1.3642084.3778311.5323593.344271.160831.1352611.69464938.8742331.5908111.436378.851256
min7.45000078.20000010.86000077.8700050.9400016.06000046.150000805.22000070.2300000.000002.000000
25%8.05750087.11500011.74750080.9600052.1500016.58250046.490000839.32250071.9300001.000008.750000
50%8.80500089.88500012.52500083.1900052.8900018.14000047.345000847.91500073.1000002.0000015.500000
75%10.36000091.43250014.47500086.8725053.7175018.54500048.457500894.42500073.9975003.2500023.750000
max11.60000094.58000015.42000090.0600055.2000019.50000051.420000934.70000076.3500004.0000030.000000
  • 写入文件
sto.to_csv('sto.csv')
sto.to_csv('n_sto.csv', index=False) #不加索引
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值