Python学习之数据分析库二（Pandas）

最新推荐文章于 2023-09-09 20:05:45 发布

step-forward

最新推荐文章于 2023-09-09 20:05:45 发布

阅读量787

点赞数

本文链接：https://blog.csdn.net/weixin_45734982/article/details/106208191

版权

一、什么是Pandas
pandas 是基于NumPy的一种工具，该工具是为了解决数据分析任务而创建的。
Pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具。
pandas提供了大量能使我们快速便捷地处理数据的函数和方法。
它是使Python成为强大而高效的数据分析环境的重要因素之一。
有关Pandas的更多介绍网站：https://pandas.pydata.org/

二、安装Pandas
豆瓣源快速安装：pip install -i https://pypi.douban.com/simple pands

普通安装：pip install pandas

导入Pandas库，一般都会用到numpy库，所以我们需要一同导入：

import  numpy  as  np
import  pandas as  pd

三、Pandas基本用法
1.pandas创建Series数据类型
Pandas是一个强大的分析结构化数据的工具集；它的使用基础是Numpy（提供高性能的矩阵运算）；用于数据挖掘和数据分析，同时也提供数据清洗功能。
利器之一：Series
类似于一维数组的对象，是由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成。仅由一组数据也可产生简单的Series对象。
利器之二：DataFrame
是Pandas中的一个表格型的数据结构，包含有一组有序的列，每列可以是不同的值类型(数值、字符串、布尔型等)，DataFrame即有行索引也有列索引，可以被看做是由Series组成的字典。

常见的数据类型:
- 一维: Series
- 二维: DataFrame
- 三维: Panel …
- 四维: Panel4D …
- N维: PanelND …
Series是Pandas中的一维数据结构，类似于Python中的列表和Numpy中的Ndarray，不同之处在于：Series是一维的，能存储不同类型的数据，有一组索引与元素对应。
创建Series 数据类型有三种方法：

通过列表创建Series对象
通过numpy的对象Ndarry创建Serise
通过字典创建Series对象;字典的所有key值作为索引,所有的value值作为Series值

import pandas as pd
import numpy as np
import  string
 
# 查看pandas版本信息
print(pd.__version__)
 
# ********************创建Series对象
 
#  1). 通过列表创建Series对象
array = ["粉条", "粉丝", "粉带"]
# 如果不指定索引， 默认从0开始;
s1 = pd.Series(data=array)
print(s1)
# 如果不指定索引， 默认从0开始;
ss1 = pd.Series(data=array, index=['A', 'B', 'C'])
print(ss1)
 
# 2). 通过numpy的对象Ndarray创建Series；
n = np.random.randn(5)   # 随机创建一个ndarray对象;
s2 = pd.Series(data=n)
print(s2)
# 修改元素的数据类型;
ss2 = s2.astype(np.int)
print(ss2)
 
# 3). 通过字典创建Series对象;
dict = {string.ascii_lowercase[i]:i for i in range(10)}
# print(dict)
s3 = pd.Series(dict)
print(s3)

运行结果：

0.23.4
0    粉条
1    粉丝
2    粉带
dtype: object
A    粉条
B    粉丝
C    粉带
dtype: object
0    0.024406
1   -1.819926
2   -0.763840
3   -0.945519
4   -0.763354
dtype: float64
0    0
1   -1
2    0
3    0
4    0
dtype: int32
a    0
b    1
c    2
d    3
e    4
f    5
g    6
h    7
i    8
j    9
dtype: int64

2.Series基本操作

Series 基本操作:
编号      属性或方法           描述
1       axes                返回行轴标签列表。
2       dtype               返回对象的数据类型(dtype)。
3       empty               如果系列为空，则返回True。
4       ndim                返回底层数据的维数，默认定义：1。
5       size                返回基础数据中的元素数。
6       values              将系列作为ndarray返回。
7       head()              返回前n行。
8       tail()              返回最后n行。

例如：

import pandas as pd
import numpy as np
import  string
 
array = ["粉条", "粉丝", "粉带"]
s1 = pd.Series(data=array)
print(s1)
print(s1.axes)
print(s1.dtype)
print(s1.empty)
print(s1.ndim )
print(s1.size)
print(s1.values)            #获取所有的value值（不显示索引）
 
#
# 1). 修改Series索引
print(s1.index)
s1.index = ['A', 'B', 'C']
print(s1)
 
 
# 2). Series纵向拼接;
array = ["粉条", "粉丝", "westos"]
# 如果不指定索引， 默认从0开始;
s2 = pd.Series(data=array)
s3 = s1.append(s2)
print(s3)
#
# 3). 删除指定索引对应的元素;
s3 = s3.drop('C')  # 删除索引为‘C’对应的值;
print(s3)
 
 
# 4). 根据指定的索引查找元素
print(s3['B'])
s3['B'] = np.nan  # None, null, pandas数据为空, 或者数据缺失, np.nan
print(s3)
 
 
# 5). 切片操作  --- 同列表
print(s3[:2])
print(s3[::-1])
print(s3[-2:])  # 显示最后两个元素

运行结果：

0    粉条
1    粉丝
2    粉带
dtype: object
[RangeIndex(start=0, stop=3, step=1)]
object
False
1
3
['粉条' '粉丝' '粉带']
RangeIndex(start=0, stop=3, step=1)
A    粉条
B    粉丝
C    粉带
dtype: object
A        粉条
B        粉丝
C        粉带
0        粉条
1        粉丝
2    westos
dtype: object
A        粉条
B        粉丝
0        粉条
1        粉丝
2    westos
dtype: object
粉丝
A        粉条
B       NaN
0        粉条
1        粉丝
2    westos
dtype: object
A     粉条
B    NaN
dtype: object
2    westos
1        粉丝
0        粉条
B       NaN
A        粉条
dtype: object
1        粉丝
2    westos
dtype: object

3、Series运算（+ - * /）

import pandas as pd
import numpy as np
 
s1 = pd.Series(np.arange(5), index=['a', 'b', 'c', 'd', 'e'])
s2 = pd.Series(np.arange(2,8), index=['c', 'd', 'e', 'f', 'g', 'h'])
 
print(s1)
print(s2)
 
# 按照对应的索引来进行运算,如果索引不同,则填充为Nan
# 加法, 缺失值 + 真实值 = 缺失值
# print(s1+s2)
print(s1.add(s2))
 
# 减法
# print(s1 - s2)
print(s1.sub(s2))
 
# 乘法
# print(s1 * s2)
print(s1.mul(s2))
 
# 除法
# print(s1 / s2)
print(s1.div(s2))
 
# 求中位数
print(s1)
print(s1.median())
 
# 求和
print(s1.sum())
 
# max
print(s1.max())
 
# min
print(s1.min())

运行结果：

a    0
b    1
c    2
d    3
e    4
dtype: int64
c    2
d    3
e    4
f    5
g    6
h    7
dtype: int64
a    NaN
b    NaN
c    4.0
d    6.0
e    8.0
f    NaN
g    NaN
h    NaN
dtype: float64
a    NaN
b    NaN
c    0.0
d    0.0
e    0.0
f    NaN
g    NaN
h    NaN
dtype: float64
a     NaN
b     NaN
c     4.0
d     9.0
e    16.0
f     NaN
g     NaN
h     NaN
dtype: float64
a    NaN
b    NaN
c    1.0
d    1.0
e    1.0
f    NaN
g    NaN
h    NaN
dtype: float64
a    0
b    1
c    2
d    3
e    4
dtype: int64
2.0
10
4
0

4、特殊的where方法
where方法：类似于三元运算符，满足条件不做改变，否则赋值为其他值


import pandas as pd
import numpy as np
import string
 
# &**********series中的where方法运行结果和numpy中完全不同;
s1 = pd.Series(np.arange(5), index=['a', 'b', 'c', 'd', 'e'])
# 判断s1的value值是否大于3， 如果大于3，值不变， 否则，设置为缺失值
print(s1.where(s1 > 3))
 
# 对象中不大于3的元素赋值为10；判断s1的value值是否大于3， 如果大于3，值不变， 否则，设置值为10
print(s1.where(s1 > 3, 10))
 
# 对象中大于3的元素赋值为10；
print(s1.mask(s1 > 3))
print(s1.mask(s1 > 3, 10))

运行结果：
在这里插入图片描述
5、创建DataFrame数据类型
Series只有行索引，而DataFrame对象既有行索引，也有列索引
行索引，表明不同行，横向索引，叫index，
列索引，表明不同列，纵向索引，叫columns，

方法有三种：

通过列表创建
通过numpy对象创建
通过字典的方式创建

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
 
# 方法1： 通过列表创建
li = [
    [1, 2, 3, 4],
    [2, 3, 4, 5]
]
 
# DataFRame对象里面包含两个索引， 行索引(0轴， axis=0)， 列索引(1轴， axis=1)
d1 = pd.DataFrame(data=li, index=['A', 'B'], columns=['views', 'loves', 'comments', 'tranfers'])
print(d1)
 
# 方法2： 通过numpy对象创建
# [0 1 2 3 4 5 6 7]  ====> [[0 1 2 3], [4 5 6 7]]
narr = np.arange(8).reshape(2, 4)
# DataFRame对象里面包含两个索引， 行索引(0轴， axis=0)， 列索引(1轴， axis=1)
d2 = pd.DataFrame(data=narr, index=['A', 'B'], columns=['views', 'loves', 'comments', 'tranfers'])
print(d2)
 
# 方法三: 通过字典的方式创建;
dict = {
    'views': [1, 2, ],

运行结果：

   views  loves  comments  tranfers
A      1      2         3         4
B      2      3         4         5
   views  loves  comments  tranfers
A      0      1         2         3
B      4      5         6         7
    views  loves  comments
粉条      1      2         3
粉丝      2      3         4
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-05', '2019-01-06', '2019-01-07', '2019-01-08'],
              dtype='datetime64[ns]', freq='D')
DatetimeIndex(['2020-05-10 15:28:16.591580', '2020-05-12 15:28:16.591580',
               '2020-05-14 15:28:16.591580', '2020-05-16 15:28:16.591580',
               '2020-05-18 15:28:16.591580', '2020-05-20 15:28:16.591580'],
              dtype='datetime64[ns]', freq='2D')
                                   A         B         C         D
2020-05-10 15:28:16.591580 -0.836054 -0.067327  1.875740  1.833607
2020-05-12 15:28:16.591580 -0.177480 -1.372123  0.458569 -0.741190
2020-05-14 15:28:16.591580 -0.040522 -0.819632 -0.292013 -1.619735
2020-05-16 15:28:16.591580 -2.423660  1.670951 -0.101030 -0.550243
2020-05-18 15:28:16.591580 -1.045623 -2.250482  0.418338 -0.785946
2020-05-20 15:28:16.591580 -0.199168  1.555307 -0.330309 -0.059888
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03'], dtype='datetime64[ns]', freq='D')
2021-01-01    1
2021-01-02    2
2021-01-03    3
Freq: D, dtype: int64

6、DataFrame基础属性和整体情况查询
1)基础属性
df.shape #行数、列数
df.dtype #列数据类型
df.ndim #数据维度
df.index #行索引
df.columns #列索引
df.values #对象值，二维ndarray数组
2)整体情况查询
df.head(3) #显示头部几行，默认5行
df.tail(3) #显示末尾几行，默认5行
df.info() #相关信息概览：行数、列数、索引、列非空值个数、列类型、内存占用
df.describe() #快速综合统计结果：计数、均值、标准差、最大值、四分位数、最小值等
需要注意的是：
获取行数据：不能直接通过行索引获取行数据，需通过切片获取，

DataFrame对象[:1] 获取第一行数据
DataFrame对象[:2] 获取前两行数据
或者通过：
DataFrame对象.iloc[0] 获取第一行数据，
DataFrame对象.loc[‘A’] 获取标签为A的行数据
获取列数据：

DataFrame对象[‘列标签名称’]
DataFrame对象.列标签名称

如：

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
 
narr = np.arange(8).reshape(2, 4)
# DataFRame对象里面包含两个索引， 行索引(0轴， axis=0)， 列索引(1轴， axis=1)
d2 = pd.DataFrame(data=narr, index=['A', 'B'], columns=['views', 'loves', 'comments', 'tranfers'])
print(d2)
 
# **********************1). 查看基础属性***********************
print(d2.shape)  # 获取行数和列数;
print(d2.dtypes)  # 列数据类型
print(d2.ndim)  # 获取数据的维度
print(d2.index)  # 行索引
print(d2.columns)  # 列索引
print(d2.values, type(d2.values))  # 对象的值， 二维ndarray数组;
 
# ******************************2). 数据整体状况的查询*************
print(d2.head(1))  # 显示头部的几行， 默认5行
print(d2.tail(1))  # 显示头部的尾行， 默认5行
 
print("*" * 10)
# 相关信息的预览： 行数， 列数， 列类型， 内存占用
print("info:", d2.info())
 
print("统计".center(50, '*'))
# 快速综合用计结果： 计数， 均值， 标准差， 最小值， 1/4位数， 中位数， 3/4位数， 最大值;
print(d2.describe())
 
# 3). 转置操作
print("d2: \n", d2)
# print("d2 T: \n", d2.transpose())
print("d2 T: \n", d2.T)
print("d2 T: \n", d2.swapaxes(1, 0))
 
# 4). 按列进行排序
print(d2)
# 按照指定列进行排序， 默认是升序， 如果需要降序显示，设置ascending=False;
print(d2.sort_values(by=["views", 'tranfers'], ascending=False))
 
# 5). 切片及查询
print(d2)
print(d2[:2])  # 可以实现切片， 但是不能索引;
print('1:\n', d2['views'])  # 通过标签查询， 获取单列信息
print('2:\n', d2.views)  # 和上面是等价的;
print(d2[['views', 'comments']])  # 通过标签查询多列信息
 
# 6). 通过类似索引的方式查询;
#       - iloc(通过位置进行行数据的获取),
#        - loc(t通过标签索引行数据)
# print(d2[0])
# print(d2)
print(d2.iloc[0])
print(d2.iloc[-1])
 
# print(d2['A'])    # 报错
print(d2)
print(d2.loc['A'])
 
# 7). 更改pandas的值；
d2.loc['A'] = np.nan
print(d2)
print(d2.info())

运行结果：

views  loves  comments  tranfers
A      0      1         2         3
B      4      5         6         7
(2, 4)
views       int32
loves       int32
comments    int32
tranfers    int32
dtype: object
2
Index(['A', 'B'], dtype='object')
Index(['views', 'loves', 'comments', 'tranfers'], dtype='object')
[[0 1 2 3]
 [4 5 6 7]] <class 'numpy.ndarray'>
   views  loves  comments  tranfers
A      0      1         2         3
   views  loves  comments  tranfers
B      4      5         6         7
**********
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, A to B
Data columns (total 4 columns):
views       2 non-null int32
loves       2 non-null int32
comments    2 non-null int32
tranfers    2 non-null int32
dtypes: int32(4)
memory usage: 48.0+ bytes
info: None
************************统计************************
          views     loves  comments  tranfers
count  2.000000  2.000000  2.000000  2.000000
mean   2.000000  3.000000  4.000000  5.000000
std    2.828427  2.828427  2.828427  2.828427
min    0.000000  1.000000  2.000000  3.000000
25%    1.000000  2.000000  3.000000  4.000000
50%    2.000000  3.000000  4.000000  5.000000
75%    3.000000  4.000000  5.000000  6.000000
max    4.000000  5.000000  6.000000  7.000000
d2: 
    views  loves  comments  tranfers
A      0      1         2         3
B      4      5         6         7
d2 T: 
           A  B
views     0  4
loves     1  5
comments  2  6
tranfers  3  7
d2 T: 
           A  B
views     0  4
loves     1  5
comments  2  6
tranfers  3  7
   views  loves  comments  tranfers
A      0      1         2         3
B      4      5         6         7
   views  loves  comments  tranfers
B      4      5         6         7
A      0      1         2         3
   views  loves  comments  tranfers
A      0      1         2         3
B      4      5         6         7
   views  loves  comments  tranfers
A      0      1         2         3
B      4      5         6         7
1:
 A    0
B    4
Name: views, dtype: int32
2:
 A    0
B    4
Name: views, dtype: int32
   views  comments
A      0         2
B      4         6
views       0
loves       1
comments    2
tranfers    3
Name: A, dtype: int32
views       4
loves       5
comments    6
tranfers    7
Name: B, dtype: int32
   views  loves  comments  tranfers
A      0      1         2         3
B      4      5         6         7
views       0
loves       1
comments    2
tranfers    3
Name: A, dtype: int32
   views  loves  comments  tranfers
A    NaN    NaN       NaN       NaN
B    4.0    5.0       6.0       7.0
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, A to B
Data columns (total 4 columns):
views       1 non-null float64
loves       1 non-null float64
comments    1 non-null float64
tranfers    1 non-null float64
dtypes: float64(4)
memory usage: 160.0+ bytes
None

7、文件的读取与写入
读csv文件：pd.read_csv（）
写入csv文件：df.to_csv（）
写入excel文件：df.to_excel（）

import os
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
# csv, excel, json........
 
# 1). csv文件的写入
df = pd.DataFrame(
    {'province': ['陕西', '陕西', '四川', '四川', '陕西'],
     'city': ['咸阳', '宝鸡', '成都', '成都', '宝鸡'],
     'count1': [1, 2, 3, 4, 5],
     'count2': [1, 2, 33, 4, 5]
     }
)
 
print(df)
 
filename = os.path.join('doc', 'csvFile.csv')
"""
index=True/False   是否存储行索引, 一般情况下不存储
mode='w'           文件写入的方式, 默认是'w'(清空原有的文件内容, 再写入), 'a'追加
header=True/False  是否写入头部信息(列索引), 一般情况是需要的
"""
df.to_csv(filename, index=False, mode='a', header=False, sep=' ')  # index=False不存储行索引
print("csv文件保存成功")
 
# # 2). csv文件的读取
# df2 = pd.read_csv('doc/csvFile.csv')
# print(df2)
 
# 3). excel文件的写入
df.to_excel("doc\excelFile.xlsx", sheet_name="省份统计", index=False)
print("excel文件保存成功")

运行结果：
在这里插入图片描述

8、分组和聚合操作之group_by
pandas提供了一个灵活高效的groupby功能，
1). 它使你能以一种自然的方式对数据集进行切片、切块、摘要等操作。
2). 根据一个或多个键（可以是函数、数组或DataFrame列>名）拆分pandas对象。
3). 计算分组摘要统计，如计数、平均值、标准差，或用户自定义函数。
如：

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
 
df = pd.DataFrame(
    {'province': ['陕西', '陕西', '四川', '四川', '陕西'],
     'city': ['咸阳', '宝鸡', '成都', '成都', '宝鸡'],
     'count1': [1, 2, 3, 4, 5],
     'count2': [1, 2, 33, 4, 5]
     }
)
# 陕西      咸阳    1
#          宝鸡     1
 
print(df)
# 根据某一列的key值进行统计分析;
grouped = df['count1'].groupby(df['province'])
print(grouped.describe())
print(grouped.median())
 
# 根据城市统计分析cpunt1的信息;
grouped = df['count1'].groupby(df['city'])
print(grouped.max())
 
# 指定多个key值进行分类聚合;
grouped = df['count1'].groupby([df['province'], df['city']])
print(grouped.max())
print(grouped.sum())
print(grouped.count())
 
#  通过unstack方法， 实现层次化的索引;
print(grouped.max().unstack())

运行结果：

 province city  count1  count2
0       陕西   咸阳       1       1
1       陕西   宝鸡       2       2
2       四川   成都       3      33
3       四川   成都       4       4
4       陕西   宝鸡       5       5
          count      mean       std  min   25%  50%   75%  max
province                                                      
四川          2.0  3.500000  0.707107  3.0  3.25  3.5  3.75  4.0
陕西          3.0  2.666667  2.081666  1.0  1.50  2.0  3.50  5.0
province
四川    3.5
陕西    2.0
Name: count1, dtype: float64
city
咸阳    1
宝鸡    5
成都    4
Name: count1, dtype: int64
province  city
四川        成都      4
陕西        咸阳      1
          宝鸡      5
Name: count1, dtype: int64
province  city
四川        成都      7
陕西        咸阳      1
          宝鸡      7
Name: count1, dtype: int64
province  city
四川        成都      2
陕西        咸阳      1
          宝鸡      2
Name: count1, dtype: int64
city       咸阳   宝鸡   成都
province               
四川        NaN  NaN  4.0
陕西        1.0  5.0  NaN

四、案例一（商品数据分析）
文件描述: 每列数据分别代表如下: 订单编号, 订单数量, 商品名称，商品详细选择项，商品总价格
需求1：
1). 从文件中读取所有的数据; 如何读取csv文件? to_csv
2). 获取数据中所有的商品名称；如何获取dataframe对象中的某一列信息? df[‘列名’], df.列名称
3）. 跟据商品的价格进行排序，降序，如何对df对象排序? d2.sort_values(by=[“排序的列名称”], ascending=True)
将价格最高的20件产品信息写入mosthighPrice.xlsx文件中; 如何获取df的前20行并写入文件? df.head(20) df1.to_csv(xxxxxx)

需求2：
1). 统计列[item_name]中每种商品出现的频率，绘制柱状图
(购买次数最多的商品排名-绘制前5条记录)
2). 根据列 [odrder_id] 分组，求出每个订单花费的总金额。
3). 根据每笔订单的总金额和其商品的总数量画出散点图。

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
 
# 需求1：
#     1). 从文件中读取所有的数据;
#     2). 获取数据中所有的商品名称；
goodsInfo = pd.read_csv('doc/chipo.csv')
# print(goodsInfo.head())
# print(goodsInfo.tail())
# print(goodsInfo.info())
# print(goodsInfo.describe())
print("商品名称显示: \n", goodsInfo['item_name'].head())
print("商品名称显示: \n", goodsInfo.item_name.head())
 
# 需求1：
#
#     3). 跟据商品的价格进行排序， 降序，
#     将价格最高的20件产品信息写入mosthighPrice.xlsx文件中;
# 重新赋值；
goodsInfo.item_price = goodsInfo.item_price.str.strip('$').astype(np.float)
highPriceData = goodsInfo.sort_values('item_price', ascending=False).head(20)
# print(highPriceData.head(5))
filename = 'doc\mostHighPrice.xlsx'
highPriceData.to_excel(filename)
print("保存成功.......")
 
# 需求2：
#     1). 统计列[item_name]中每种商品出现的频率，绘制柱状图
#             (购买次数最多的商品排名-绘制前5条记录)
goodsInfo = pd.read_csv('doc\chipo.csv')
# new_info会统计每个商品名出现的次数;其中 Unnamed: 0就是我们需要获取的商品出现频率；
newInfo = goodsInfo.groupby('item_name').count()
mostRaiseGoods = newInfo.sort_values('Unnamed: 0', ascending=False)['Unnamed: 0'].head(5)
print(mostRaiseGoods)       # Series对象
 
# 获取对象中的商品名称;
x = mostRaiseGoods.index
# 获取商品出现的次数;
y = mostRaiseGoods.values
 
# from pyecharts import Bar
#
# bar = Bar("购买次数最多的商品排名")
# bar.add("", x, y)
# bar.render()
 
# 需求2：
#     2). 根据列 [odrder_id] 分组，求出每个订单花费的总金额======订单数量(quantity), 订单总价(item_price)。
#     3). 根据每笔订单的总金额和其商品的总数量画出散点图。
 
 
goodsInfo = pd.read_csv('doc/chipo.csv')
# 获取订单数量
quantity = goodsInfo.quantity
# 获取订单item_price价格
item_price = goodsInfo.item_price.str.strip('$').astype(np.float)
print(item_price)
 
# 根据列 [odrder_id] 分组
order_group = goodsInfo.groupby("order_id")
# 每笔订单的总金额
x = order_group.item_price.sum()
# 商品的总数量
y = order_group.quantity.sum()

运行结果：

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
 
# 需求1：
#     1). 从文件中读取所有的数据;
#     2). 获取数据中所有的商品名称；
goodsInfo = pd.read_csv('doc/chipo.csv')
# print(goodsInfo.head())
# print(goodsInfo.tail())
# print(goodsInfo.info())
# print(goodsInfo.describe())
print("商品名称显示: \n", goodsInfo['item_name'].head())
print("商品名称显示: \n", goodsInfo.item_name.head())
 
# 需求1：
#
#     3). 跟据商品的价格进行排序， 降序，
#     将价格最高的20件产品信息写入mosthighPrice.xlsx文件中;
# 重新赋值；
goodsInfo.item_price = goodsInfo.item_price.str.strip('$').astype(np.float)
highPriceData = goodsInfo.sort_values('item_price', ascending=False).head(20)
# print(highPriceData.head(5))
filename = 'doc\mostHighPrice.xlsx'
highPriceData.to_excel(filename)
print("保存成功.......")
 
# 需求2：
#     1). 统计列[item_name]中每种商品出现的频率，绘制柱状图
#             (购买次数最多的商品排名-绘制前5条记录)
goodsInfo = pd.read_csv('doc\chipo.csv')
# new_info会统计每个商品名出现的次数;其中 Unnamed: 0就是我们需要获取的商品出现频率；
newInfo = goodsInfo.groupby('item_name').count()
mostRaiseGoods = newInfo.sort_values('Unnamed: 0', ascending=False)['Unnamed: 0'].head(5)
print(mostRaiseGoods)       # Series对象
 
# 获取对象中的商品名称;
x = mostRaiseGoods.index
# 获取商品出现的次数;
y = mostRaiseGoods.values
 
# from pyecharts import Bar
#
# bar = Bar("购买次数最多的商品排名")
# bar.add("", x, y)
# bar.render()
 
# 需求2：
#     2). 根据列 [odrder_id] 分组，求出每个订单花费的总金额======订单数量(quantity), 订单总价(item_price)。
#     3). 根据每笔订单的总金额和其商品的总数量画出散点图。
 
 
goodsInfo = pd.read_csv('doc/chipo.csv')
# 获取订单数量
quantity = goodsInfo.quantity
# 获取订单item_price价格
item_price = goodsInfo.item_price.str.strip('$').astype(np.float)
print(item_price)
 
# 根据列 [odrder_id] 分组
order_group = goodsInfo.groupby("order_id")
# 每笔订单的总金额
x = order_group.item_price.sum()
# 商品的总数量
y = order_group.quantity.sum()