机器学习之Pandas教程（下）

最新推荐文章于 2022-07-19 21:36:22 发布

深城肥肠

最新推荐文章于 2022-07-19 21:36:22 发布

阅读量757

点赞数 2

分类专栏： Python 人工智能机器学习教程文章标签： Pandas 特征工程机器学习数据处理

本文链接：https://blog.csdn.net/weixin_41863685/article/details/79795830

版权

Python 同时被 3 个专栏收录

20 篇文章 3 订阅

订阅专栏

人工智能

20 篇文章 2 订阅

订阅专栏

机器学习教程

5 篇文章 1 订阅

订阅专栏

机器学习之pandas（下）是接着上一节，是继续学习机器学习中常用的pandas操作，只要跟着这个教程一步一步操作，多加练习，对于机器学习中常用的操作就基本掌握了。

1. Groupby和Aggregate

# 导入所需要的库
import pandas as pd
import numpy as np
%matplotlib inline

# 生成数据
# 举个栗子，假设我们现在有一张公司每个员工的收入流水。
salaries = pd.DataFrame({
    'Name': ['xiaoming', 'xiaohong', 'lisi', 'zhaowu', 'xiaohong', 'lisi', 'zhaowu', 'lisi'],
    'Year': [2016,2016,2016,2016,2017,2017,2017,2017],
    'Salary': [10000,2000,4000,5000,18000,25000,3000,4000],
    'Bonus': [3000,1000,1000,1200,4000,2300,500,1000]
})
print(salaries)# 输出

输出：
   Bonus      Name  Salary  Year
0   3000  xioaming   10000  2016
1   1000  xiaohong    2000  2016
2   1000      lisi    4000  2016
3   1200    zhaowu    5000  2016
4   4000  xiaohong   18000  2017
5   2300      lisi   25000  2017
6    500    zhaowu    3000  2017
7   1000      lisi    4000  2017

1.1 GroupBy操作

group_by_name = salaries.groupby('Name')
group_by_name

# 输出
# 通过输出我们发现，GroupBy后，生成的是一个GroupBy对象。

1.2 对GroupBy对象操作

groupby构造了一个GroupBy object，我们可以对这个object做各种操作，比如求个和，当然我们后面还会详细的介绍各种GroupBy之后的操作。

1.2.1 groupby后进行aggregate操作

(1). sum操作

# sum操作
group_by_name.sum()
# 或者
group_by_name.aggregate(sum)
# 两种属性方式都可以

输出： 
      Bonus  Salary  Year
Name                         
lisi       4300   33000  6050
xiaohong   5000   20000  4033
xioaming   3000   10000  2016
zhaowu     1700    8000  4033

以上是通过GroupBy后，进行了分组，但是没有执行相应的操作，通过sum()后，对每个组进行了求和操作。

(2). 自定义函数限制

# 测试数据
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),'D' : np.random.randn(8), 
                   'E' : np.random.randn(8)})
print(df)

输出：
   A      B         C         D         E
0  foo    one  1.204367  2.293298  1.040369
1  bar    one  0.929188 -1.177558  0.758965
2  foo    two -0.072676  1.638526  1.084714
3  bar  three -0.051892  0.889456 -0.076172
4  foo    two  0.699553  0.764503 -0.144055
5  bar    two  0.353203  2.640532  1.756924
6  foo    one -0.569656 -2.312519  0.461082
7  foo  three -0.686681 -1.653378 -0.973734

# 定义函数
def get_letter_type(letter):
    if letter.lower() in "aeiou":   #判断是不是元音
        return "vowel"
    else:
        return "consonant"

# 传入一个函数，每一个列按照这个函数进行区分
grouped = df.groupby(get_letter_type, axis=1)

# 按照定义的函数进行分组以后，进行aggregate
grouped.sum()

输出： 
    consonant     vowel
0   3.497665  1.040369
1  -0.248370  0.758965
2   1.565849  1.084714
3   0.837564 -0.076172
4   1.464057 -0.144055
5   2.993735  1.756924
6  -2.882175  0.461082
7  -2.340059 -0.973734

(3). groupby排序

默认会排序，也可以选择不排序。

salaries.groupby('Name', sort=False).sum()

输出：    
         Bonus  Salary  Year
Name                         
xioaming   3000   10000  2016
xiaohong   5000   20000  4033
lisi       4300   33000  6050
zhaowu     1700    8000  4033

1.3 GroupBy的attributes

# 查看有多少组
print(group_by_name.groups)
print(len(group_by_name))

1.4 用多个columns进行GroupBy

# 多个列进行groupby
group_by_name_year = salaries.groupby(['Name', 'Year'])
group_by_name_year.sum()

输出：  
               Bonus  Salary
Name     Year               
lisi     2016   1000    4000
         2017   3300   29000
xiaohong 2016   1000    2000
         2017   4000   18000
xioaming 2016   3000   10000
zhaowu   2016   1200    5000
         2017    500    3000

1.5 常用操作size,mean,median

group_by_name_year.size()

输出：
 Name      Year
lisi      2016    1
          2017    2
xiaohong  2016    1
          2017    1
xioaming  2016    1
zhaowu    2016    1
          2017    1
dtype: int64

group_by_name.mean()

输出：
            Bonus   Salary         Year
Name                                       
lisi      1433.333333  11000.0  2016.666667
xiaohong  2500.000000  10000.0  2016.500000
xioaming  3000.000000  10000.0  2016.000000
zhaowu     850.000000   4000.0  2016.500000

group_by_name.median()

输出：
     Bonus   Salary    Year
Name                             
lisi      1000.0   4000.0  2017.0
xiaohong  2500.0  10000.0  2016.5
xioaming  3000.0  10000.0  2016.0
zhaowu     850.0   4000.0  2016.5

1.6 统计函数describe()

group_by_name.describe()

# 会输出各个组的统计信息，平均值，方差等。

1.7 排除某一列，不做groupBy

# 对于这种方式排除了一些列不做groupBy，例如，有些列是日期，我们没必要进行GroupBy操作
salaries.loc[:,salaries.columns != "Year"].groupby('Name').median()

输出：
     Bonus  Salary
Name                   
lisi       1000    4000
xiaohong   2500   10000
xioaming   3000   10000
zhaowu      850    4000

1.8 iterate GroupBy object

# 把每一组拿出来
for name, group in group_by_name:
    print(name)
    print(group)

输出：
lisi
   Bonus  Name  Salary  Year
2   1000  lisi    4000  2016
5   2300  lisi   25000  2017
7   1000  lisi    4000  2017
xiaohong
   Bonus      Name  Salary  Year
1   1000  xiaohong    2000  2016
4   4000  xiaohong   18000  2017
xioaming
   Bonus      Name  Salary  Year
0   3000  xioaming   10000  2016
zhaowu
   Bonus    Name  Salary  Year
3   1200  zhaowu    5000  2016
6    500  zhaowu    3000  2017

1.9 选择某一组输出

print(group_by_name.get_group("lisi"))
type(group_by_name.get_group("lisi"))

输出：
   Bonus  Salary  Year
2   1000    4000  2016
5   2300   25000  2017
7   1000    4000  2017

   Out[29]: 
 

pandas.core.frame.DataFrame

# 也可以使用aggregate效果一样
group_by_name.agg([np.sum, np.mean, np.std])

# 对每组进行sum，mean，std操作，每组输出三个

1.10 对列进行aggregate操作

group_by_name.agg({"Bonus": np.sum, "Salary": np.sum})

输出：
     Bonus  Salary
Name                   
lisi       4300   33000
xiaohong   5000   20000
xioaming   3000   10000
zhaowu     1700    8000

2. transform

transform会把group中的每一个record都按照同样的规则转化

# 获取数据
# 以第一列作为index
nvda = pd.read_csv("data/NVDA.csv", index_col=0)
nvda.head()

# 定义一个函数，获取年份，按年份进行groupby
key = lambda x: x[:4]

nvda.groupby(key).mean()

zscore = lambda x: (x-x.mean())/x.std()

# 每一个元素都按照transform传给的函数做同样的操作
# 这里进行groupby之后，把每一年进行了分组，但是后面的transform没有进行汇总，只是按照每一组的每一天进行了操作
transformed = nvda.groupby(key).transform(zscore)
transformed.head()

# 输出结果

price_range = lambda x: x.max() - x.min()
nvda.groupby(key).transform(price_range).head()

# 输出

# 获取当年股票的最高价在哪一天
nvda.groupby(key).transform("max").head()

(nvda.groupby(key).transform("max") - nvda.groupby(key).transform("min")).head()

# 输出

3. filter

这个以按照我们想要找出符合统计特征的一些内容。

# 进行groupby之后，每一组按照filter传入的函数做操作。注意与transform的区别，transform是分组之后对每个组中的元素做操作
s = pd.Series([1,1,2,2,2,3,4,4,5])
s.groupby(s).filter(lambda x: x.sum() > 4)

# 输出（每组求和大于4的留下）

2    2
3    2
4    2
6    4
7    4
8    5
dtype: int64

# 对每一组中大于两个的留下
df = pd.DataFrame({"A": np.arange(8), "B":list("aaabbbcc")})
df.groupby("B").filter(lambda x: len(x) > 2)

key_month = lambda x: x[0:7]

# 获取每个月中均值大于50的每一天
nvda.groupby(key_month).filter(lambda x: x["Adj Close"].mean() > 50).head()

4. apply

对于之前的aggregate都可以直接定义一个function然后用apply来实现。

# 先对每个月做一个group
nvda_month = nvda.groupby(key_month)
nvda_month

# 对每组中每列进行同样操作，可以进行行操作
def f(group):
    return pd.DataFrame({"original": group,
                        "demeaned": group - group.mean()})
nvda_month["Adj Close"].apply(f).tail()

输出：

5. 表格的匹配与拼接

可以使用contat，appand, merge, join进行拼接。

5.1 contat

df1 = pd.DataFrame({'apts': [55000, 60000],
                   'cars': [200000, 300000],},
                  index = ['Shanghai', 'Beijing'])
df1

输出：

df2 = pd.DataFrame({'cars': [150000, 120000],
                    'apts': [25000, 20000],
                   },
                  index = ['Hangzhou', 'Najing'])
df2

输出：

df3 = pd.DataFrame({'apts': [30000, 10000],
                   'cars': [180000, 100000],},
                  index = ['Guangzhou', 'Chongqing'])
df3

输出：

(1). 按照行拼接

# 这里也可以用元组
# 按照默认方式进行拼接
frames = [df1, df2, df3] 
result = pd.concat(frames)
result

输出：

(2). 拼接以后，指定key

# concatenate的时候可以指定keys，这样可以给每一个部分加上一个Key，这样可以构造一个hierarchical index
result2 = pd.concat(frames, keys=['x', 'y', 'z'])
result2

输出：

# 获取某一列
# 获取某一列
result2.loc["y"]

输出：

df4 = pd.DataFrame({'salaries': [10000, 30000, 30000, 20000, 15000]},
                  index = ['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin'])
df4

输出：

(3). 按照列来拼接

# 按照列来拼接，匹配不到的列为NAN
result3 = pd.concat([result, df4], axis=1)
result3

输出：

(4). 有NAN的拼接

out:左右，左-，-右（如果只有左边，右边为空，如果只有右边左边为空）
inner：左右（如果两边有一个事NAN就不要了）
left：左右，左-
right：左右，-右

# 只要有一个是NAN那就舍去
result3 = pd.concat([result, df4], axis=1, join='inner')
result3

输出：

5.2 append

按照行进行叠加到一起。

# 在后面叠加，把两个叠加到一起
# 只按行来拼接
df1.append(df2)

输出：

df1.append(df4)

输出：

5.3 Series和DataFrame进行concatenate

Series会先被转成DataFrame然后做Join，因为Series本来就是一个只有一维的DataFrame。

(1). concat 方式

s1 = pd.Series([60, 50], index=['Shanghai', 'Beijing'], name='meal')
s1

输出：

    Shanghai    60
    Beijing     50
    Name: meal, dtype: int64

# 按列进行连接
print(pd.concat([df1, s1], axis=1))

输出：

               apts    cars  meal
    Shanghai  55000  200000    60
    Beijing   60000  300000    50

(2). append一个row到DataFrame里

#注意这里的name是必须要有的，因为要用作Index。

s2 = pd.Series([18000, 12000], index=['apts', 'cars'], name='Xiamen') 
print(s2)

输出：

    apts    18000
    cars    12000
    Name: Xiamen, dtype: int64

df1.append(s2)
df1

输出：

5.4 Merge

(1). 默认连接方式

df1 = pd.DataFrame({'apts': [55000, 60000, 58000],
                   'cars': [200000, 300000,250000],
                  'city': ['Shanghai', 'Beijing','Shenzhen']})
df1

输出;

df4 = pd.DataFrame({'salaries': [10000, 30000, 30000, 20000, 15000],
                  'city': ['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin']})
df4

输出：

# 指定按照city来拼接。也可以默认不指定。只有能拼接成功的才返回
result = pd.merge(df1, df4, on='city')
result

输出：

# 默认方式连接
result = pd.merge(df1, df4)
result

输出：

(2). 定义连接方式

outer,right,left。

# 定义连接方式，how = 'outer'
result = pd.merge(df1, df4, on='city', how='outer')
result

输出：

# right
result = pd.merge(df1, df4, on='city', how='right')
result

输出：

# left
result = pd.merge(df1, df4, on='city', how='left')
result

输出：

(3). 基于多个列的合并

left = DataFrame({'key1': ['foo', 'foo', 'bar'],
                  'key2': ['one', 'two', 'one'],
                  'lval': [1, 2, 3]})

right = DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
                   'key2': ['one', 'one', 'one', 'two'],
                   'rval': [4, 5, 6, 7]})
print(left)
print(right)
pd.merge(left, right, on=['key1', 'key2'], how='outer') # 基于多个列的合并

输出：

(4). 重名合并，并指定列后缀

left = DataFrame({'key1': ['foo', 'foo', 'bar'],
                  'key2': ['one', 'two', 'one'],
                  'lval': [1, 2, 3]})

right = DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
                   'key2': ['one', 'one', 'one', 'two'],
                   'rval': [4, 5, 6, 7]})
print(left)
print(right)
# 重名列可以指定合并后的后缀
pd.merge(left, right, on='key1', suffixes=('_left', '_right'))

输出：

5.4 join on index

按照index拼接。

left2 = DataFrame([[1., 2.], [3., 4.], [5., 6.]],
                  index=['a', 'c', 'e'],
                  columns=['Ohio', 'Nevada'])
right2 = DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
                   index=['b', 'c', 'd', 'e'],
                   columns=['Missouri', 'Alabama'])

print(left2)
print(right2)
# join默认使用索引合并
left2.join(right2, how='outer')

输出：

another = DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]],
                    index=['a', 'c', 'e', 'f'],
                    columns=['New York', 'Oregon'])
right2 = DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
                   index=['b', 'c', 'd', 'e'],
                   columns=['Missouri', 'Alabama'])
print(another)
print(right2)
# 一次合并多个DataFrame，默认使用全连接。
left2.join([right2, another])

输出：