日萌社
人工智能AI:Keras PyTorch MXNet TensorFlow PaddlePaddle 深度学习实战(不定时更新)
1 Pandas介绍
- 2008年WesMcKinney开发出的库
- 专门用于数据挖掘的开源python库
- 以Numpy为基础,借力Numpy模块在计算方面性能高的优势
- 基于matplotlib,能够简便的画图
- 独特的数据结构
2 为什么使用Pandas
Numpy已经能够帮助我们处理数据,能够结合matplotlib解决部分数据展示等问题,那么pandas学习的目的在什么地方呢?
-
增强图表可读性
-
回忆我们在numpy当中创建学生成绩表样式:
-
返回结果:
-
array([[92, 55, 78, 50, 50],
[71, 76, 50, 48, 96],
[45, 84, 78, 51, 68],
[81, 91, 56, 54, 76],
[86, 66, 77, 67, 95],
[46, 86, 56, 61, 99],
[46, 95, 44, 46, 56],
[80, 50, 45, 65, 57],
[41, 93, 90, 41, 97],
[65, 83, 57, 57, 40]])
如果数据展示为这样,可读性就会更友好:
3 小结
- pandas的优势【了解】
- 增强图表可读性
- 便捷的数据处理能力
- 读取文件方便
- 封装了Matplotlib、Numpy的画图和计算
5.2 Pandas数据结构
Pandas中一共有三种数据结构,分别为:Series、DataFrame和MultiIndex(老版本中叫Panel )。
其中Series是一维数据结构,DataFrame是二维的表格型数据结构,MultiIndex是三维的数据结构。
1.Series
Series是一个类似于一维数组的数据结构,它能够保存任何类型的数据,比如整数、字符串、浮点数等,主要由一组数据和与之相关的索引两部分构成。
1.1 Series的创建
# 导入pandas
import pandas as pd
pd.Series(data=None, index=None, dtype=None)
- 参数:
- data:传入的数据,可以是ndarray、list等
- index:索引,必须是唯一的,且与数据的长度相等。如果没有传入索引参数,则默认会自动创建一个从0-N的整数索引。
- dtype:数据的类型
通过已有数据创建
- 指定内容,默认索引
pd.Series(np.arange(10))
# 运行结果
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
dtype: int64
- 指定索引
pd.Series([6.7,5.6,3,10,2], index=[1,2,3,4,5])
# 运行结果
1 6.7
2 5.6
3 3.0
4 10.0
5 2.0
dtype: float64
- 通过字典数据创建
color_count = pd.Series({'red':100, 'blue':200, 'green': 500, 'yellow':1000})
color_count
# 运行结果
blue 200
green 500
red 100
yellow 1000
dtype: int64
1.2 Series的属性
为了更方便地操作Series对象中的索引和数据,Series中提供了两个属性index和values
- index
color_count.index
# 结果
Index(['blue', 'green', 'red', 'yellow'], dtype='object')
- values
color_count.values
# 结果
array([ 200, 500, 100, 1000])
也可以使用索引来获取数据:
color_count[2]
# 结果
100
2.DataFrame
DataFrame是一个类似于二维数组或表格(如excel)的对象,既有行索引,又有列索引
- 行索引,表明不同行,横向索引,叫index,0轴,axis=0
- 列索引,表名不同列,纵向索引,叫columns,1轴,axis=1
2.1 DataFrame的创建
# 导入pandas
import pandas as pd
pd.DataFrame(data=None, index=None, columns=None)
-
参数:
- index:行标签。如果没有传入索引参数,则默认会自动创建一个从0-N的整数索引。
- columns:列标签。如果没有传入索引参数,则默认会自动创建一个从0-N的整数索引。
-
通过已有数据创建
举例一:
pd.DataFrame(np.random.randn(2,3))
回忆咱们在前面直接使用np创建的数组显示方式,比较两者的区别。
举例二:创建学生成绩表
# 生成10名同学,5门功课的数据
score = np.random.randint(40, 100, (10, 5))
# 结果
array([[92, 55, 78, 50, 50],
[71, 76, 50, 48, 96],
[45, 84, 78, 51, 68],
[81, 91, 56, 54, 76],
[86, 66, 77, 67, 95],
[46, 86, 56, 61, 99],
[46, 95, 44, 46, 56],
[80, 50, 45, 65, 57],
[41, 93, 90, 41, 97],
[65, 83, 57, 57, 40]])
但是这样的数据形式很难看到存储的是什么的样的数据,可读性比较差!!
问题:如何让数据更有意义的显示?
# 使用Pandas中的数据结构
score_df = pd.DataFrame(score)
给分数数据增加行列索引,显示效果更佳
效果:
- 增加行、列索引
# 构造行索引序列
subjects = ["语文", "数学", "英语", "政治", "体育"]
# 构造列索引序列
stu = ['同学' + str(i) for i in range(score_df.shape[0])]
# 添加行索引
data = pd.DataFrame(score, columns=subjects, index=stu)
2.2 DataFrame的属性
- shape
data.shape
# 结果
(10, 5)
- index
DataFrame的行索引列表
data.index
# 结果
Index(['同学0', '同学1', '同学2', '同学3', '同学4', '同学5', '同学6', '同学7', '同学8', '同学9'], dtype='object')
- columns
DataFrame的列索引列表
data.columns
# 结果
Index(['语文', '数学', '英语', '政治', '体育'], dtype='object')
- values
直接获取其中array的值
data.values
array([[92, 55, 78, 50, 50],
[71, 76, 50, 48, 96],
[45, 84, 78, 51, 68],
[81, 91, 56, 54, 76],
[86, 66, 77, 67, 95],
[46, 86, 56, 61, 99],
[46, 95, 44, 46, 56],
[80, 50, 45, 65, 57],
[41, 93, 90, 41, 97],
[65, 83, 57, 57, 40]])
- T
转置
data.T
结果
- head(5):显示前5行内容
如果不补充参数,默认5行。填入参数N则显示前N行
data.head(5)
- tail(5):显示后5行内容
如果不补充参数,默认5行。填入参数N则显示后N行
data.tail(5)
2.3 DatatFrame索引的设置
需求:
2.3.1 修改行列索引值
stu = ["学生_" + str(i) for i in range(score_df.shape[0])]
# 必须整体全部修改
data.index = stu
注意:以下修改方式是错误的
# 错误修改方式
data.index[3] = '学生_3'
2.3.2 重设索引
- reset_index(drop=False)
- 设置新的下标索引
- drop:默认为False,不删除原来索引,如果为True,删除原来的索引值
# 重置索引,drop=False
data.reset_index()
# 重置索引,drop=True
data.reset_index(drop=True)
2.3.3 以某列值设置为新的索引
- set_index(keys, drop=True)
- keys : 列索引名成或者列索引名称的列表
- drop : boolean, default True.当做新的索引,删除原来的列
设置新索引案例
1、创建
df = pd.DataFrame({'month': [1, 4, 7, 10],
'year': [2012, 2014, 2013, 2014],
'sale':[55, 40, 84, 31]})
month sale year
0 1 55 2012
1 4 40 2014
2 7 84 2013
3 10 31 2014
2、以月份设置新的索引
df.set_index('month')
sale year
month
1 55 2012
4 40 2014
7 84 2013
10 31 2014
3、设置多个索引,以年和月份
df = df.set_index(['year', 'month'])
df
sale
year month
2012 1 55
2014 4 40
2013 7 84
2014 10 31
注:通过刚才的设置,这样DataFrame就变成了一个具有MultiIndex的DataFrame。
3.MultiIndex与Panel
3.1 MultiIndex
MultiIndex是三维的数据结构;
多级索引(也称层次化索引)是pandas的重要功能,可以在Series、DataFrame对象上拥有2个以及2个以上的索引。
3.1.1 multiIndex的特性
打印刚才的df的行索引结果
df.index
MultiIndex(levels=[[2012, 2013, 2014], [1, 4, 7, 10]],
labels=[[0, 2, 1, 2], [0, 1, 2, 3]],
names=['year', 'month'])
多级或分层索引对象。
- index属性
- names:levels的名称
- levels:每个level的元组值
df.index.names
# FrozenList(['year', 'month'])
df.index.levels
# FrozenList([[1, 2], [1, 4, 7, 10]])
3.1.2 multiIndex的创建
arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']]
pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
# 结果
MultiIndex(levels=[[1, 2], ['blue', 'red']],
codes=[[0, 0, 1, 1], [1, 0, 1, 0]],
names=['number', 'color'])
3.2 Panel
3.2.1 panel的创建
-
class
pandas.Panel
(data=None, items=None, major_axis=None, minor_axis=None)-
作用:存储3维数组的Panel结构
-
参数:
-
data : ndarray或者dataframe
-
items : 索引或类似数组的对象,axis=0
-
major_axis : 索引或类似数组的对象,axis=1
-
minor_axis : 索引或类似数组的对象,axis=2
-
-
p = pd.Panel(data=np.arange(24).reshape(4,3,2),
items=list('ABCD'),
major_axis=pd.date_range('20130101', periods=3),
minor_axis=['first', 'second'])
# 结果
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 3 (major_axis) x 2 (minor_axis)
Items axis: A to D
Major_axis axis: 2013-01-01 00:00:00 to 2013-01-03 00:00:00
Minor_axis axis: first to second
3.2.2 查看panel数据
p[:,:,"first"]
p["B",:,:]
注:Pandas从版本0.20.0开始弃用:推荐的用于表示3D数据的方法是通过DataFrame上的MultiIndex方法
4 小结
- pandas的优势【了解】
- 增强图表可读性
- 便捷的数据处理能力
- 读取文件方便
- 封装了Matplotlib、Numpy的画图和计算
- series【知道】
- 创建
- pd.Series([], index=[])
- pd.Series({})
- 属性
- 对象.index
- 对象.values
- 创建
- DataFrame【掌握】
- 创建
- pd.DataFrame(data=None, index=None, columns=None)
- 属性
- shape -- 形状
- index -- 行索引
- columns -- 列索引
- values -- 查看值
- T -- 转置
- head() -- 查看头部内容
- tail() -- 查看尾部内容
- DataFrame索引
- 修改的时候,需要进行全局修改
- 对象.reset_index()
- 对象.set_index(keys)
- 创建
- MultiIndex与Panel【了解】
- multiIndex:
- 类似ndarray中的三维数组
- 创建:
- pd.MultiIndex.from_arrays()
- 属性:
- 对象.index
- panel:
- pd.Panel(data, items, major_axis, minor_axis)
- panel数据要是想看到,则需要进行索引到dataframe或者series才可以
- multiIndex:
5.3 基本数据操作
为了更好的理解这些基本操作,我们将读取一个真实的股票数据。关于文件操作,后面在介绍,这里只先用一下API
# 读取文件
data = pd.read_csv("./data/stock_day.csv")
# 删除一些列,让数据更简单些,再去做后面的操作
data = data.drop(["ma5","ma10","ma20","v_ma5","v_ma10","v_ma20"], axis=1)
1 索引操作
Numpy当中我们已经讲过使用索引选取序列和切片选择,pandas也支持类似的操作,也可以直接使用列名、行名
称,甚至组合使用。
1.1 直接使用行列索引(先列后行)
获取'2018-02-27'这天的'close'的结果
# 直接使用行列索引名字的方式(先列后行)
data['open']['2018-02-27']
23.53
# 不支持的操作
# 错误
data['2018-02-27']['open']
# 错误
data[:1, :2]
1.2 结合loc或者iloc使用索引
获取从'2018-02-27':'2018-02-22','open'的结果
# 使用loc:只能指定行列索引的名字
data.loc['2018-02-27':'2018-02-22', 'open']
2018-02-27 23.53
2018-02-26 22.80
2018-02-23 22.88
Name: open, dtype: float64
# 使用iloc可以通过索引的下标去获取
# 获取前3天数据,前5列的结果
data.iloc[:3, :5]
open high close low
2018-02-27 23.53 25.88 24.16 23.53
2018-02-26 22.80 23.78 23.53 22.80
2018-02-23 22.88 23.37 22.82 22.71
1.3 使用ix组合索引
Warning:Starting in 0.20.0, the
.ix
indexer is deprecated, in favor of the more strict.iloc
and.loc
indexers.
获取行第1天到第4天,['open', 'close', 'high', 'low']这个四个指标的结果
# 使用ix进行下表和名称组合做引
data.ix[0:4, ['open', 'close', 'high', 'low']]
# 推荐使用loc和iloc来获取的方式
data.loc[data.index[0:4], ['open', 'close', 'high', 'low']]
data.iloc[0:4, data.columns.get_indexer(['open', 'close', 'high', 'low'])]
open close high low
2018-02-27 23.53 24.16 25.88 23.53
2018-02-26 22.80 23.53 23.78 22.80
2018-02-23 22.88 22.82 23.37 22.71
2018-02-22 22.25 22.28 22.76 22.02
2 赋值操作
对DataFrame当中的close列进行重新赋值为1
# 直接修改原来的值
data['close'] = 1
# 或者
data.close = 1
3 排序
排序有两种形式,一种对于索引进行排序,一种对于内容进行排序
3.1 DataFrame排序
- 使用df.sort_values(by=, ascending=)
- 单个键或者多个键进行排序,
- 参数:
- by:指定排序参考的键
- ascending:默认升序
- ascending=False:降序
- ascending=True:升序
# 按照开盘价大小进行排序 , 使用ascending指定按照大小排序
data.sort_values(by="open", ascending=True).head()
# 按照多个键进行排序
data.sort_values(by=['open', 'high'])
- 使用df.sort_index给索引进行排序
这个股票的日期索引原来是从大到小,现在重新排序,从小到大
# 对索引进行排序
data.sort_index()
3.2 Series排序
- 使用series.sort_values(ascending=True)进行排序
series排序时,只有一列,不需要参数
data['p_change'].sort_values(ascending=True).head()
2015-09-01 -10.03
2015-09-14 -10.02
2016-01-11 -10.02
2015-07-15 -10.02
2015-08-26 -10.01
Name: p_change, dtype: float64
- 使用series.sort_index()进行排序
与df一致
# 对索引进行排序
data['p_change'].sort_index().head()
2015-03-02 2.62
2015-03-03 1.44
2015-03-04 1.57
2015-03-05 2.02
2015-03-06 8.51
Name: p_change, dtype: float64
4 总结
- 1.索引【掌握】
- 直接索引 -- 先列后行,是需要通过索引的字符串进行获取
- loc -- 先行后列,是需要通过索引的字符串进行获取
- iloc -- 先行后列,是通过下标进行索引
- ix -- 先行后列, 可以用上面两种方法混合进行索引
- 2.赋值【知道】
- data[""] = **
- data. =
- 3.排序【知道】
- dataframe
- 对象.sort_values()
- 对象.sort_index()
- series
- 对象.sort_values()
- 对象.sort_index()
- dataframe
5.4 DataFrame运算
1 算术运算
- add(other)
比如进行数学运算加上具体的一个数字
data['open'].add(1)
2018-02-27 24.53
2018-02-26 23.80
2018-02-23 23.88
2018-02-22 23.25
2018-02-14 22.49
- sub(other)'
2 逻辑运算
2.1 逻辑运算符号
- 例如筛选data["open"] > 23的日期数据
- data["open"] > 23返回逻辑结果
data["open"] > 23
2018-02-27 True
2018-02-26 False
2018-02-23 False
2018-02-22 False
2018-02-14 False
# 逻辑判断的结果可以作为筛选的依据
data[data["open"] > 23].head()
- 完成多个逻辑判断,
data[(data["open"] > 23) & (data["open"] < 24)].head()
2.2 逻辑运算函数
- query(expr)
- expr:查询字符串
通过query使得刚才的过程更加方便简单
data.query("open<24 & open>23").head()
- isin(values)
例如判断'open'是否为23.53和23.85
# 可以指定值进行一个判断,从而进行筛选操作
data[data["open"].isin([23.53, 23.85])]
3 统计运算
3.1 describe
综合分析: 能够直接得出很多统计结果,count
, mean
, std
, min
, max
等
# 计算平均值、标准差、最大值、最小值
data.describe()
3.2 统计函数
Numpy当中已经详细介绍,在这里我们演示min(最小值), max(最大值), mean(平均值), median(中位数), var(方差), std(标准差),mode(众数)结果:
count | Number of non-NA observations |
---|---|
sum | Sum of values |
mean | Mean of values |
median | Arithmetic median of values |
min | Minimum |
max | Maximum |
mode | Mode |
abs | Absolute Value |
prod | Product of values |
std | Bessel-corrected sample standard deviation |
var | Unbiased variance |
idxmax | compute the index labels with the maximum |
idxmin | compute the index labels with the minimum |
对于单个函数去进行统计的时候,坐标轴还是按照默认列“columns” (axis=0, default),如果要对行“index” 需要指定(axis=1)
- max()、min()
# 使用统计函数:0 代表列求结果, 1 代表行求统计结果
data.max(0)
open 34.99
high 36.35
close 35.21
low 34.01
volume 501915.41
price_change 3.03
p_change 10.03
turnover 12.56
my_price_change 3.41
dtype: float64
- std()、var()
# 方差
data.var(0)
open 1.545255e+01
high 1.662665e+01
close 1.554572e+01
low 1.437902e+01
volume 5.458124e+09
price_change 8.072595e-01
p_change 1.664394e+01
turnover 4.323800e+00
my_price_change 6.409037e-01
dtype: float64
# 标准差
data.std(0)
open 3.930973
high 4.077578
close 3.942806
low 3.791968
volume 73879.119354
price_change 0.898476
p_change 4.079698
turnover 2.079375
my_price_change 0.800565
dtype: float64
- median():中位数
中位数为将数据从小到大排列,在最中间的那个数为中位数。如果没有中间数,取中间两个数的平均值。
df = pd.DataFrame({'COL1' : [2,3,4,5,4,2],
'COL2' : [0,1,2,3,4,2]})
df.median()
COL1 3.5
COL2 2.0
dtype: float64
- idxmax()、idxmin()
# 求出最大值的位置
data.idxmax(axis=0)
open 2015-06-15
high 2015-06-10
close 2015-06-12
low 2015-06-12
volume 2017-10-26
price_change 2015-06-09
p_change 2015-08-28
turnover 2017-10-26
my_price_change 2015-07-10
dtype: object
# 求出最小值的位置
data.idxmin(axis=0)
open 2015-03-02
high 2015-03-02
close 2015-09-02
low 2015-03-02
volume 2016-07-06
price_change 2015-06-15
p_change 2015-09-01
turnover 2016-07-06
my_price_change 2015-06-15
dtype: object
3.3 累计统计函数
函数 | 作用 |
---|---|
cumsum | 计算前1/2/3/…/n个数的和 |
cummax | 计算前1/2/3/…/n个数的最大值 |
cummin | 计算前1/2/3/…/n个数的最小值 |
cumprod | 计算前1/2/3/…/n个数的积 |
那么这些累计统计函数怎么用?
以上这些函数可以对series和dataframe操作
这里我们按照时间的从前往后来进行累计
- 排序
# 排序之后,进行累计求和
data = data.sort_index()
- 对p_change进行求和
stock_rise = data['p_change']
# plot方法集成了前面直方图、条形图、饼图、折线图
stock_rise.cumsum()
2015-03-02 2.62
2015-03-03 4.06
2015-03-04 5.63
2015-03-05 7.65
2015-03-06 16.16
2015-03-09 16.37
2015-03-10 18.75
2015-03-11 16.36
2015-03-12 15.03
2015-03-13 17.58
2015-03-16 20.34
2015-03-17 22.42
2015-03-18 23.28
2015-03-19 23.74
2015-03-20 23.48
2015-03-23 23.74
那么如何让这个连续求和的结果更好的显示呢?
如果要使用plot函数,需要导入matplotlib.
import matplotlib.pyplot as plt
# plot显示图形
stock_rise.cumsum().plot()
# 需要调用show,才能显示出结果
plt.show()
关于plot,稍后会介绍API的选择
4 自定义运算
- apply(func, axis=0)
- func:自定义函数
- axis=0:默认是列,axis=1为行进行运算
- 定义一个对列,最大值-最小值的函数
data[['open', 'close']].apply(lambda x: x.max() - x.min(), axis=0)
open 22.74
close 22.85
dtype: float64
5 小结
- 算术运算【知道】
- 逻辑运算【知道】
- 1.逻辑运算符号
- 2.逻辑运算函数
- 对象.query()
- 对象.isin()
- 统计运算【知道】
- 1.对象.describe()
- 2.统计函数
- 3.累积统计函数
- 自定义运算【知道】
- apply(func, axis=0)
5.5 Pandas画图
1 pandas.DataFrame.plot
DataFrame.plot
(kind='line')- kind : str,需要绘制图形的种类
- ‘line’ : line plot (default)
- ‘bar’ : vertical bar plot
- ‘barh’ : horizontal bar plot
- ‘hist’ : histogram
- ‘pie’ : pie plot
- ‘scatter’ : scatter plot
2 pandas.Series.plot
5.6 文件读取与存储
我们的数据大部分存在于文件当中,所以pandas会支持复杂的IO操作,pandas的API支持众多的文件格式,如CSV、SQL、XLS、JSON、HDF5。
注:最常用的HDF5和CSV文件
1 CSV
1.1 read_csv
-
pandas.read_csv(filepath_or_buffer, sep =',', usecols )
- filepath_or_buffer:文件路径
- sep :分隔符,默认用","隔开
- usecols:指定读取的列名,列表形式
-
举例:读取之前的股票的数据
# 读取文件,并且指定只获取'open', 'close'指标
data = pd.read_csv("./data/stock_day.csv", usecols=['open', 'close'])
open close
2018-02-27 23.53 24.16
2018-02-26 22.80 23.53
2018-02-23 22.88 22.82
2018-02-22 22.25 22.28
2018-02-14 21.49 21.92
1.2 to_csv
-
DataFrame.to_csv(path_or_buf=None, sep=', ’, columns=None, header=True, index=True, mode='w', encoding=None)
- path_or_buf :文件路径
- sep :分隔符,默认用","隔开
- columns :选择需要的列索引
- header :boolean or list of string, default True,是否写进列索引值
- index:是否写进行索引
- mode:'w':重写, 'a' 追加
-
举例:保存读取出来的股票数据
- 保存'open'列的数据,然后读取查看结果
# 选取10行数据保存,便于观察数据
data[:10].to_csv("./data/test.csv", columns=['open'])
# 读取,查看结果
pd.read_csv("./data/test.csv")
Unnamed: 0 open
0 2018-02-27 23.53
1 2018-02-26 22.80
2 2018-02-23 22.88
3 2018-02-22 22.25
4 2018-02-14 21.49
5 2018-02-13 21.40
6 2018-02-12 20.70
7 2018-02-09 21.20
8 2018-02-08 21.79
9 2018-02-07 22.69
会发现将索引存入到文件当中,变成单独的一列数据。如果需要删除,可以指定index参数,删除原来的文件,重新保存一次。
# index:存储不会讲索引值变成一列数据
data[:10].to_csv("./data/test.csv", columns=['open'], index=False)
2 HDF5
2.1 read_hdf与to_hdf
HDF5文件的读取和存储需要指定一个键,值为要存储的DataFrame
-
pandas.read_hdf(path_or_buf,key =None,** kwargs)
从h5文件当中读取数据
- path_or_buffer:文件路径
- key:读取的键
- return:Theselected object
-
DataFrame.to_hdf(path_or_buf, key, *\kwargs*)
2.2 案例
- 读取文件
day_close = pd.read_hdf("./data/day_close.h5")
如果读取的时候出现以下错误
需要安装安装tables模块避免不能读取HDF5文件
pip install tables
- 存储文件
day_close.to_hdf("./data/test.h5", key="day_close")
再次读取的时候, 需要指定键的名字
new_close = pd.read_hdf("./data/test.h5", key="day_close")
注意:优先选择使用HDF5文件存储
- HDF5在存储的时候支持压缩,使用的方式是blosc,这个是速度最快的也是pandas默认支持的
- 使用压缩可以提磁盘利用率,节省空间
- HDF5还是跨平台的,可以轻松迁移到hadoop 上面
3 JSON
JSON是我们常用的一种数据交换格式,前面在前后端的交互经常用到,也会在存储的时候选择这种格式。所以我们需要知道Pandas如何进行读取和存储JSON格式。
3.1 read_json
-
pandas.read_json(path_or_buf=None, orient=None, typ='frame', lines=False)
- 将JSON格式准换成默认的Pandas DataFrame格式
- orient : string,Indication of expected JSON string format.
- 'split' : dict like {index -> [index], columns -> [columns], data -> [values]}
- split 将索引总结到索引,列名到列名,数据到数据。将三部分都分开了
- 'records' : list like [{column -> value}, ... , {column -> value}]
- records 以
columns:values
的形式输出
- records 以
- 'index' : dict like {index -> {column -> value}}
- index 以
index:{columns:values}...
的形式输出
- index 以
- 'columns' : dict like {column -> {index -> value}},默认该格式
- colums 以
columns:{index:values}
的形式输出
- colums 以
- 'values' : just the values array
- values 直接输出值
- 'split' : dict like {index -> [index], columns -> [columns], data -> [values]}
- lines : boolean, default False
- 按照每行读取json对象
- typ : default ‘frame’, 指定转换成的对象类型series或者dataframe
3.2 read_josn 案例
-
数据介绍
这里使用一个新闻标题讽刺数据集,格式为json。is_sarcastic
:1讽刺的,否则为0;headline
:新闻报道的标题;article_link
:链接到原始新闻文章。存储格式为:
{"article_link": "https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5", "headline": "former versace store clerk sues over secret 'black code' for minority shoppers", "is_sarcastic": 0}
{"article_link": "https://www.huffingtonpost.com/entry/roseanne-revival-review_us_5ab3a497e4b054d118e04365", "headline": "the 'roseanne' revival catches up to our thorny political mood, for better and worse", "is_sarcastic": 0}
- 读取
orient指定存储的json格式,lines指定按照行去变成一个样本
json_read = pd.read_json("./data/Sarcasm_Headlines_Dataset.json", orient="records", lines=True)
结果为:
3.3 to_json
- DataFrame.to_json(path_or_buf=None, orient=None, lines=False)
- 将Pandas 对象存储为json格式
- path_or_buf=None:文件地址
- orient:存储的json形式,{‘split’,’records’,’index’,’columns’,’values’}
- lines:一个对象存储为一行
3.4 案例
- 存储文件
json_read.to_json("./data/test.json", orient='records')
结果
[{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/versace-black-code_us_5861fbefe4b0de3a08f600d5","headline":"former versace store clerk sues over secret 'black code' for minority shoppers","is_sarcastic":0},{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/roseanne-revival-review_us_5ab3a497e4b054d118e04365","headline":"the 'roseanne' revival catches up to our thorny political mood, for better and worse","is_sarcastic":0},{"article_link":"https:\/\/local.theonion.com\/mom-starting-to-fear-son-s-web-series-closest-thing-she-1819576697","headline":"mom starting to fear son's web series closest thing she will have to grandchild","is_sarcastic":1},{"article_link":"https:\/\/politics.theonion.com\/boehner-just-wants-wife-to-listen-not-come-up-with-alt-1819574302","headline":"boehner just wants wife to listen, not come up with alternative debt-reduction ideas","is_sarcastic":1},{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/jk-rowling-wishes-snape-happy-birthday_us_569117c4e4b0cad15e64fdcb","headline":"j.k. rowling wishes snape happy birthday in the most magical way","is_sarcastic":0},{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/advancing-the-worlds-women_b_6810038.html","headline":"advancing the world's women","is_sarcastic":0},....]
- 修改lines参数为True
json_read.to_json("./data/test.json", orient='records', lines=True)
结果
{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/versace-black-code_us_5861fbefe4b0de3a08f600d5","headline":"former versace store clerk sues over secret 'black code' for minority shoppers","is_sarcastic":0}
{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/roseanne-revival-review_us_5ab3a497e4b054d118e04365","headline":"the 'roseanne' revival catches up to our thorny political mood, for better and worse","is_sarcastic":0}
{"article_link":"https:\/\/local.theonion.com\/mom-starting-to-fear-son-s-web-series-closest-thing-she-1819576697","headline":"mom starting to fear son's web series closest thing she will have to grandchild","is_sarcastic":1}
{"article_link":"https:\/\/politics.theonion.com\/boehner-just-wants-wife-to-listen-not-come-up-with-alt-1819574302","headline":"boehner just wants wife to listen, not come up with alternative debt-reduction ideas","is_sarcastic":1}
{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/jk-rowling-wishes-snape-happy-birthday_us_569117c4e4b0cad15e64fdcb","headline":"j.k. rowling wishes snape happy birthday in the most magical way","is_sarcastic":0}...
4 小结
- pandas的CSV、HDF5、JSON文件的读取【知道】
- 对象.read_**()
- 对象.to_**()
5.7 高级处理-缺失值处理
1 如何处理nan
-
获取缺失值的标记方式(NaN或者其他标记方式)
-
如果缺失值的标记方式是NaN
-
判断数据中是否包含NaN:
- pd.isnull(df),
- pd.notnull(df)
-
存在缺失值nan:
-
1、删除存在缺失值的:dropna(axis='rows')
- 注:不会修改原数据,需要接受返回值
-
2、替换缺失值:fillna(value, inplace=True)
- value:替换成的值
- inplace:True:会修改原数据,False:不替换修改原数据,生成新的对象
-
-
-
如果缺失值没有使用NaN标记,比如使用"?"
- 先替换‘?’为np.nan,然后继续处理
2 电影数据的缺失值处理
- 电影数据文件获取
# 读取电影数据
movie = pd.read_csv("./data/IMDB-Movie-Data.csv")
2.1 判断缺失值是否存在
- pd.notnull()
pd.notnull(movie)
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
0 True True True True True True True True True True True True
1 True True True True True True True True True True True True
2 True True True True True True True True True True True True
3 True True True True True True True True True True True True
4 True True True True True True True True True True True True
5 True True True True True True True True True True True True
6 True True True True True True True True True True True True
7 True True True True True True True True True True False True
np.all(pd.notnull(movie))
- pd.isnull()
2.2 存在缺失值nan,并且是np.nan
- 1、删除
pandas删除缺失值,使用dropna的前提是,缺失值的类型必须是np.nan
# 不修改原数据
movie.dropna()
# 可以定义新的变量接受或者用原来的变量名
data = movie.dropna()
- 2、替换缺失值
# 替换存在缺失值的样本的两列
# 替换填充平均值,中位数
# movie['Revenue (Millions)'].fillna(movie['Revenue (Millions)'].mean(), inplace=True)
替换所有缺失值:
for i in movie.columns:
if np.all(pd.notnull(movie[i])) == False:
print(i)
movie[i].fillna(movie[i].mean(), inplace=True)
2.3 不是缺失值nan,有默认标记的
数据是这样的:
wis = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data")
以上数据在读取时,可能会报如下错误:
URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833)>
解决办法:
# 全局取消证书验证
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
处理思路分析:
- 1、先替换‘?’为np.nan
- df.replace(to_replace=, value=)
- to_replace:替换前的值
- value:替换后的值
- df.replace(to_replace=, value=)
# 把一些其它值标记的缺失值,替换成np.nan
wis = wis.replace(to_replace='?', value=np.nan)
- 2、在进行缺失值的处理
# 删除
wis = wis.dropna()
3 小结
- isnull、notnull判断是否存在缺失值【知道】
- np.any(pd.isnull(movie)) # 里面如果有一个缺失值,就返回True
- np.all(pd.notnull(movie)) # 里面如果有一个缺失值,就返回False
- dropna删除np.nan标记的缺失值【知道】
- movie.dropna()
- fillna填充缺失值【知道】
- movie[i].fillna(value=movie[i].mean(), inplace=True)
- replace替换具体某些值【知道】
- wis.replace(to_replace="?", value=np.NaN)
5.8 高级处理-数据离散化![](https://i-blog.csdnimg.cn/blog_migrate/7d499cc569115c1acab657bca30787c5.png)
1 为什么要离散化
连续属性离散化的目的是为了简化数据结构,数据离散化技术可以用来减少给定连续属性值的个数。离散化方法经常作为数据挖掘的工具。
2 什么是数据的离散化
连续属性的离散化就是在连续属性的值域上,将值域划分为若干个离散的区间,最后用不同的符号或整数 值代表落在每个子区间中的属性值。
离散化有很多种方法,这使用一种最简单的方式去操作
- 原始人的身高数据:165,174,160,180,159,163,192,184
- 假设按照身高分几个区间段:150~165, 165~180,180~195
这样我们将数据分到了三个区间段,我可以对应的标记为矮、中、高三个类别,最终要处理成一个"哑变量"矩阵
3 股票的涨跌幅离散化
我们对股票每日的"p_change"进行离散化
3.1 读取股票的数据
先读取股票的数据,筛选出p_change数据
data = pd.read_csv("./data/stock_day.csv")
p_change= data['p_change']
3.2 将股票涨跌幅数据进行分组
使用的工具:
- pd.qcut(data, q):
- 对数据进行分组将数据分组,一般会与value_counts搭配使用,统计每组的个数
- series.value_counts():统计分组次数
# 自行分组
qcut = pd.qcut(p_change, 10)
# 计算分到每个组数据个数
qcut.value_counts()
自定义区间分组:
- pd.cut(data, bins)
# 自己指定分组区间
bins = [-100, -7, -5, -3, 0, 3, 5, 7, 100]
p_counts = pd.cut(p_change, bins)
3.3 股票涨跌幅分组数据变成one-hot编码
- 什么是one-hot编码
把每个类别生成一个布尔列,这些列中只有一列可以为这个样本取值为1.其又被称为独热编码。
把下图中左边的表格转化为使用右边形式进行表示:
-
pandas.get_dummies(data, prefix=None)
-
data:array-like, Series, or DataFrame
-
prefix:分组名字
-
# 得出one-hot编码矩阵
dummies = pd.get_dummies(p_counts, prefix="rise")
4 小结
- 数据离散化【知道】
- 可以用来减少给定连续属性值的个数
- 在连续属性的值域上,将值域划分为若干个离散的区间,最后用不同的符号或整数值代表落在每个子区间中的属性值。
- qcut、cut实现数据分组【知道】
- qcut:大致分为相同的几组
- cut:自定义分组区间
- get_dummies实现哑变量矩阵【知道】
5.9 高级处理-合并
如果你的数据由多张表组成,那么有时候需要将不同的内容合并在一起分析
1 pd.concat实现数据合并
- pd.concat([data1, data2], axis=1)
- 按照行或列进行合并,axis=0为列索引,axis=1为行索引
比如我们将刚才处理好的one-hot编码与原数据合并
# 按照行索引进行
pd.concat([data, dummies], axis=1)
2 pd.merge
- pd.merge(left, right, how='inner', on=None)
- 可以指定按照两组数据的共同键值对合并或者左右各自
left
: DataFrameright
: 另一个DataFrameon
: 指定的共同键- how:按照什么方式连接
Merge method | SQL Join Name | Description |
---|---|---|
left | LEFT OUTER JOIN | Use keys from left frame only |
right | RIGHT OUTER JOIN | Use keys from right frame only |
outer | FULL OUTER JOIN | Use union of keys from both frames |
inner | INNER JOIN | Use intersection of keys from both frames |
2.1 pd.merge合并
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
# 默认内连接
result = pd.merge(left, right, on=['key1', 'key2'])
- 左连接
result = pd.merge(left, right, how='left', on=['key1', 'key2'])
- 右连接
result = pd.merge(left, right, how='right', on=['key1', 'key2'])
- 外链接
result = pd.merge(left, right, how='outer', on=['key1', 'key2'])
3 总结
- pd.concat([数据1, 数据2], axis=**)【知道】
- pd.merge(left, right, how=, on=)【知道】
- how -- 以何种方式连接
- on -- 连接的键的依据是哪几个
5.10 高级处理-交叉表与透视表
1 交叉表与透视表什么作用
探究股票的涨跌与星期几有关?
以下图当中表示,week代表星期几,1,0代表这一天股票的涨跌幅是好还是坏,里面的数据代表比例
可以理解为所有时间为星期一等等的数据当中涨跌幅好坏的比例
- 交叉表:交叉表用于计算一列数据对于另外一列数据的分组个数(用于统计分组频率的特殊透视表)
- pd.crosstab(value1, value2)
- 透视表:透视表是将原有的DataFrame的列分别作为行索引和列索引,然后对指定的列应用聚集函数
- data.pivot_table()
-
- DataFrame.pivot_table([], index=[])
2 案例分析
2.1 数据准备
- 准备两列数据,星期数据以及涨跌幅是好是坏数据
- 进行交叉表计算
# 寻找星期几跟股票张得的关系
# 1、先把对应的日期找到星期几
date = pd.to_datetime(data.index).weekday
data['week'] = date
# 2、假如把p_change按照大小去分个类0为界限
data['posi_neg'] = np.where(data['p_change'] > 0, 1, 0)
# 通过交叉表找寻两列数据的关系
count = pd.crosstab(data['week'], data['posi_neg'])
但是我们看到count只是每个星期日子的好坏天数,并没有得到比例,该怎么去做?
- 对于每个星期一等的总天数求和,运用除法运算求出比例
# 算数运算,先求和
sum = count.sum(axis=1).astype(np.float32)
# 进行相除操作,得出比例
pro = count.div(sum, axis=0)
2.2 查看效果
使用plot画出这个比例,使用stacked的柱状图
pro.plot(kind='bar', stacked=True)
plt.show()
2.3 使用pivot_table(透视表)实现
使用透视表,刚才的过程更加简单
# 通过透视表,将整个过程变成更简单一些
data.pivot_table(['posi_neg'], index='week')
3 小结
- 交叉表与透视表的作用【知道】
- 交叉表:计算一列数据对于另外一列数据的分组个数
- 透视表:指定某一列对另一列的关系
5.11 高级处理-分组与聚合![](https://i-blog.csdnimg.cn/blog_migrate/6c20dbc9c989c439085a48eb92a0dc36.png)
分组与聚合通常是分析数据的一种方式,通常与一些统计函数一起使用,查看数据的分组情况
想一想其实刚才的交叉表与透视表也有分组的功能,所以算是分组的一种形式,只不过他们主要是计算次数或者计算比例!!看其中的效果:
1 什么分组与聚合
2 分组API
- DataFrame.groupby(key, as_index=False)
- key:分组的列数据,可以多个
- 案例:不同颜色的不同笔的价格数据
col =pd.DataFrame({'color': ['white','red','green','red','green'], 'object': ['pen','pencil','pencil','ashtray','pen'],'price1':[5.56,4.20,1.30,0.56,2.75],'price2':[4.75,4.12,1.60,0.75,3.15]})
color object price1 price2
0 white pen 5.56 4.75
1 red pencil 4.20 4.12
2 green pencil 1.30 1.60
3 red ashtray 0.56 0.75
4 green pen 2.75 3.15
- 进行分组,对颜色分组,price进行聚合
# 分组,求平均值
col.groupby(['color'])['price1'].mean()
col['price1'].groupby(col['color']).mean()
color
green 2.025
red 2.380
white 5.560
Name: price1, dtype: float64
# 分组,数据的结构不变
col.groupby(['color'], as_index=False)['price1'].mean()
color price1
0 green 2.025
1 red 2.380
2 white 5.560
3 星巴克零售店铺数据
现在我们有一组关于全球星巴克店铺的统计数据,如果我想知道美国的星巴克数量和中国的哪个多,或者我想知道中国每个省份星巴克的数量的情况,那么应该怎么办?
3.1 数据获取
从文件中读取星巴克店铺数据
# 导入星巴克店的数据
starbucks = pd.read_csv("./data/starbucks/directory.csv")
3.2 进行分组聚合
# 按照国家分组,求出每个国家的星巴克零售店数量
count = starbucks.groupby(['Country']).count()
画图显示结果
count['Brand'].plot(kind='bar', figsize=(20, 8))
plt.show()
假设我们加入省市一起进行分组
# 设置多个索引,set_index()
starbucks.groupby(['Country', 'State/Province']).count()
仔细观察这个结构,与我们前面讲的哪个结构类似??
与前面的MultiIndex结构类似
4 小结
- groupby进行数据的分组【知道】
- pandas中,抛开聚合谈分组,无意义
案例
1 需求
现在我们有一组从2006年到2016年1000部最流行的电影数据
数据来源:https://www.kaggle.com/damianpanek/sunday-eda/data
- 问题1:我们想知道这些电影数据中评分的平均分,导演的人数等信息,我们应该怎么获取?
- 问题2:对于这一组电影数据,如果我们想rating,runtime的分布情况,应该如何呈现数据?
- 问题3:对于这一组电影数据,如果我们希望统计电影分类(genre)的情况,应该如何处理数据?
2 实现
首先获取导入包,获取数据
%matplotlib inline
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
#文件的路径
path = "./data/IMDB-Movie-Data.csv"
#读取文件
df = pd.read_csv(path)
2.1 问题一:
我们想知道这些电影数据中评分的平均分,导演的人数等信息,我们应该怎么获取?
- 得出评分的平均分
使用mean函数
df["Rating"].mean()
- 得出导演人数信息
求出唯一值,然后进行形状获取
## 导演的人数
# df["Director"].unique().shape[0]
np.unique(df["Director"]).shape[0]
644
2.2 问题二:
对于这一组电影数据,如果我们想Rating,Runtime (Minutes)的分布情况,应该如何呈现数据?
- 直接呈现,以直方图的形式
选择分数列数据,进行plot
df["Rating"].plot(kind='hist',figsize=(20,8))
- Rating进行分布展示
进行绘制直方图
plt.figure(figsize=(20,8),dpi=80)
plt.hist(df["Rating"].values,bins=20)
plt.show()
修改刻度的间隔
# 求出最大最小值
max_ = df["Rating"].max()
min_ = df["Rating"].min()
# 生成刻度列表
t1 = np.linspace(min_,max_,num=21)
# [ 1.9 2.255 2.61 2.965 3.32 3.675 4.03 4.385 4.74 5.095 5.45 5.805 6.16 6.515 6.87 7.225 7.58 7.935 8.29 8.645 9. ]
# 修改刻度
plt.xticks(t1)
# 添加网格
plt.grid()
- Runtime (Minutes)进行分布展示
进行绘制直方图
plt.figure(figsize=(20,8),dpi=80)
plt.hist(df["Runtime (Minutes)"].values,bins=20)
plt.show()
修改间隔
# 求出最大最小值
max_ = df["Runtime (Minutes)"].max()
min_ = df["Runtime (Minutes)"].min()
# # 生成刻度列表
t1 = np.linspace(min_,max_,num=21)
# 修改刻度
plt.xticks(np.linspace(min_,max_,num=21))
# 添加网格
plt.grid()
2.3 问题三:
对于这一组电影数据,如果我们希望统计电影分类(genre)的情况,应该如何处理数据?
- 思路分析
- 思路
- 1、创建一个全为0的dataframe,列索引置为电影的分类,temp_df
- 2、遍历每一部电影,temp_df中把分类出现的列的值置为1
- 3、求和
- 思路
- 1、创建一个全为0的dataframe,列索引置为电影的分类,temp_df
# 进行字符串分割
temp_list = [i.split(",") for i in df["Genre"]]
# 获取电影的分类
genre_list = np.unique([i for j in temp_list for i in j])
# 增加新的列
temp_df = pd.DataFrame(np.zeros([df.shape[0],genre_list.shape[0]]),columns=genre_list)
- 2、遍历每一部电影,temp_df中把分类出现的列的值置为1
for i in range(1000):
#temp_list[i] ['Action','Adventure','Animation']
temp_df.ix[i,temp_list[i]]=1
print(temp_df.sum().sort_values())
- 3、求和,绘图
temp_df.sum().sort_values(ascending=False).plot(kind="bar",figsize=(20,8),fontsize=20,colormap="cool")
Musical 5.0
Western 7.0
War 13.0
Music 16.0
Sport 18.0
History 29.0
Animation 49.0
Family 51.0
Biography 81.0
Fantasy 101.0
Mystery 106.0
Horror 119.0
Sci-Fi 120.0
Romance 141.0
Crime 150.0
Thriller 195.0
Adventure 259.0
Comedy 279.0
Action 303.0
Drama 513.0
dtype: float64
series
series的创建
import pandas as pd
import numpy as np
pd.Series(np.arange(9))
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
dtype: int64
pd.Series([1.2, 2.3, 4.5, 5.0], index=[1, 2, 3, 4])
1 1.2
2 2.3
3 4.5
4 5.0
dtype: float64
color_count = pd.Series({"red":10, "green":20, "blue":100})
color_count
blue 100
green 20
red 10
dtype: int64
series的属性
color_count.index
Index(['blue', 'green', 'red'], dtype='object')
color_count.values
array([100, 20, 10])
color_count[0]
100
color_count[1]
20
DataFrame
DataFrame创建
pd.DataFrame(np.random.randn(2,3))
score = np.random.randint(40, 100, (10, 5))
score
array([[98, 85, 84, 93, 83],
[73, 68, 64, 47, 87],
[91, 58, 86, 92, 46],
[62, 86, 99, 75, 66],
[49, 73, 46, 61, 81],
[43, 96, 47, 65, 90],
[98, 84, 60, 73, 54],
[59, 93, 58, 83, 43],
[62, 66, 51, 88, 89],
[66, 60, 63, 57, 51]])
score_df = pd.DataFrame(score)
score_df
subjects = ["语文", "数学", "英语", "政治", "体育"]
stu = ["同学"+ str(i) for i in range(score_df.shape[0])]
data = pd.DataFrame(score, columns=subjects, index=stu)
stu
['同学0', '同学1', '同学2', '同学3', '同学4', '同学5', '同学6', '同学7', '同学8', '同学9']
data
DataFrame的属性
data.shape
(10, 5)
data.index
Index(['同学0', '同学1', '同学2', '同学3', '同学4', '同学5', '同学6', '同学7', '同学8', '同学9'], dtype='object')
data.columns
Index(['语文', '数学', '英语', '政治', '体育'], dtype='object')
data.values
array([[98, 85, 84, 93, 83],
[73, 68, 64, 47, 87],
[91, 58, 86, 92, 46],
[62, 86, 99, 75, 66],
[49, 73, 46, 61, 81],
[43, 96, 47, 65, 90],
[98, 84, 60, 73, 54],
[59, 93, 58, 83, 43],
[62, 66, 51, 88, 89],
[66, 60, 63, 57, 51]])
data.T
DataFrame索引值的设置
stu = ["同学_"+ str(i) for i in range(score_df.shape[0])]
data.index = stu
# stu
data
# data.index[2] = "同学__"
data.reset_index()
data.reset_index(drop=True)
df = pd.DataFrame({'month': [1, 4, 7, 10],
'year': [2012, 2014, 2013, 2014],
'sale':[55, 40, 84, 31]})
df
df.set_index("year")
df = df.set_index(["year", "month"])
df
MultiIndex与Panel
MultiIndex
df
df.index
MultiIndex(levels=[[2012, 2013, 2014], [1, 4, 7, 10]],
labels=[[0, 2, 1, 2], [0, 1, 2, 3]],
names=['year', 'month'])
df.index.names
FrozenList(['year', 'month'])
df.index.levels
FrozenList([[2012, 2013, 2014], [1, 4, 7, 10]])
arrays = [[1, 1, 2, 2], ["r", "b", "r","b"]]
pd.MultiIndex.from_arrays(arrays, names=("num", "col"))
MultiIndex(levels=[[1, 2], ['b', 'r']],
labels=[[0, 0, 1, 1], [1, 0, 1, 0]],
names=['num', 'col'])
panel
p = pd.Panel(data=np.arange(24).reshape(4,3,2),
items=list('ABCD'),
major_axis=pd.date_range('20130101', periods=3),
minor_axis=['first', 'second'])
p
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 3 (major_axis) x 2 (minor_axis)
Items axis: A to D
Major_axis axis: 2013-01-01 00:00:00 to 2013-01-03 00:00:00
Minor_axis axis: first to second
In [1]:
import pandas as pd
In [2]:
data = pd.read_csv("./data/stock_day.csv")
In [4]:
data.head()
Out[4]:
open | high | close | low | volume | price_change | p_change | ma5 | ma10 | ma20 | v_ma5 | v_ma10 | v_ma20 | turnover | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2018-02-27 | 23.53 | 25.88 | 24.16 | 23.53 | 95578.03 | 0.63 | 2.68 | 22.942 | 22.142 | 22.875 | 53782.64 | 46738.65 | 55576.11 | 2.39 |
2018-02-26 | 22.80 | 23.78 | 23.53 | 22.80 | 60985.11 | 0.69 | 3.02 | 22.406 | 21.955 | 22.942 | 40827.52 | 42736.34 | 56007.50 | 1.53 |
2018-02-23 | 22.88 | 23.37 | 22.82 | 22.71 | 52914.01 | 0.54 | 2.42 | 21.938 | 21.929 | 23.022 | 35119.58 | 41871.97 | 56372.85 | 1.32 |
2018-02-22 | 22.25 | 22.76 | 22.28 | 22.02 | 36105.01 | 0.36 | 1.64 | 21.446 | 21.909 | 23.137 | 35397.58 | 39904.78 | 60149.60 | 0.90 |
2018-02-14 | 21.49 | 21.99 | 21.92 | 21.48 | 23331.04 | 0.44 | 2.05 | 21.366 | 21.923 | 23.253 | 33590.21 | 42935.74 | 61716.11 | 0.58 |
In [5]:
data = data.drop(["ma5","ma10","ma20","v_ma5","v_ma10","v_ma20"], axis=1)
In [7]:
data.head()
Out[7]:
open | high | close | low | volume | price_change | p_change | turnover | |
---|---|---|---|---|---|---|---|---|
2018-02-27 | 23.53 | 25.88 | 24.16 | 23.53 | 95578.03 | 0.63 | 2.68 | 2.39 |
2018-02-26 | 22.80 | 23.78 | 23.53 | 22.80 | 60985.11 | 0.69 | 3.02 | 1.53 |
2018-02-23 | 22.88 | 23.37 | 22.82 | 22.71 | 52914.01 | 0.54 | 2.42 | 1.32 |
2018-02-22 | 22.25 | 22.76 | 22.28 | 22.02 | 36105.01 | 0.36 | 1.64 | 0.90 |
2018-02-14 | 21.49 | 21.99 | 21.92 | 21.48 | 23331.04 | 0.44 | 2.05 | 0.58 |
索引操作
In [8]:
data["open"]["2018-02-27"] # 直接索引,必须是先列后行
Out[8]:
23.53
In [11]:
# data["2018-02-27"]["open"]
In [13]:
# data[:1 ,:2]
In [14]:
data.loc["2018-02-27":"2018-02-14", "open":"close"]
Out[14]:
open | high | close | |
---|---|---|---|
2018-02-27 | 23.53 | 25.88 | 24.16 |
2018-02-26 | 22.80 | 23.78 | 23.53 |
2018-02-23 | 22.88 | 23.37 | 22.82 |
2018-02-22 | 22.25 | 22.76 | 22.28 |
2018-02-14 | 21.49 | 21.99 | 21.92 |
In [15]:
data.iloc[:5, :3]
Out[15]:
open | high | close | |
---|---|---|---|
2018-02-27 | 23.53 | 25.88 | 24.16 |
2018-02-26 | 22.80 | 23.78 | 23.53 |
2018-02-23 | 22.88 | 23.37 | 22.82 |
2018-02-22 | 22.25 | 22.76 | 22.28 |
2018-02-14 | 21.49 | 21.99 | 21.92 |
In [16]:
data.ix[0:5, ["open", "close"]]
Out[16]:
open | close | |
---|---|---|
2018-02-27 | 23.53 | 24.16 |
2018-02-26 | 22.80 | 23.53 |
2018-02-23 | 22.88 | 22.82 |
2018-02-22 | 22.25 | 22.28 |
2018-02-14 | 21.49 | 21.92 |
In [17]:
data.loc[data.index[0:5], ["open", "close"]]
Out[17]:
open | close | |
---|---|---|
2018-02-27 | 23.53 | 24.16 |
2018-02-26 | 22.80 | 23.53 |
2018-02-23 | 22.88 | 22.82 |
2018-02-22 | 22.25 | 22.28 |
2018-02-14 | 21.49 | 21.92 |
In [19]:
data.columns.get_indexer(["open", "close"])
Out[19]:
array([0, 2])
In [20]:
data.iloc[0:5, data.columns.get_indexer(["open", "close"])]
Out[20]:
open | close | |
---|---|---|
2018-02-27 | 23.53 | 24.16 |
2018-02-26 | 22.80 | 23.53 |
2018-02-23 | 22.88 | 22.82 |
2018-02-22 | 22.25 | 22.28 |
2018-02-14 | 21.49 | 21.92 |
赋值操作
In [22]:
data["close"] = 1
In [24]:
data.head()
Out[24]:
open | high | close | low | volume | price_change | p_change | turnover | |
---|---|---|---|---|---|---|---|---|
2018-02-27 | 23.53 | 25.88 | 1 | 23.53 | 95578.03 | 0.63 | 2.68 | 2.39 |
2018-02-26 | 22.80 | 23.78 | 1 | 22.80 | 60985.11 | 0.69 | 3.02 | 1.53 |
2018-02-23 | 22.88 | 23.37 | 1 | 22.71 | 52914.01 | 0.54 | 2.42 | 1.32 |
2018-02-22 | 22.25 | 22.76 | 1 | 22.02 | 36105.01 | 0.36 | 1.64 | 0.90 |
2018-02-14 | 21.49 | 21.99 | 1 | 21.48 | 23331.04 | 0.44 | 2.05 | 0.58 |
In [26]:
data.close = 10
In [27]:
data.head()
Out[27]:
open | high | close | low | volume | price_change | p_change | turnover | |
---|---|---|---|---|---|---|---|---|
2018-02-27 | 23.53 | 25.88 | 10 | 23.53 | 95578.03 | 0.63 | 2.68 | 2.39 |
2018-02-26 | 22.80 | 23.78 | 10 | 22.80 | 60985.11 | 0.69 | 3.02 | 1.53 |
2018-02-23 | 22.88 | 23.37 | 10 | 22.71 | 52914.01 | 0.54 | 2.42 | 1.32 |
2018-02-22 | 22.25 | 22.76 | 10 | 22.02 | 36105.01 | 0.36 | 1.64 | 0.90 |
2018-02-14 | 21.49 | 21.99 | 10 | 21.48 | 23331.04 | 0.44 | 2.05 | 0.58 |
排序
In [30]:
data.sort_values(by="open", ascending=False).head()
Out[30]:
open | high | close | low | volume | price_change | p_change | turnover | |
---|---|---|---|---|---|---|---|---|
2015-06-15 | 34.99 | 34.99 | 10 | 31.69 | 199369.53 | -3.52 | -10.00 | 6.82 |
2015-06-12 | 34.69 | 35.98 | 10 | 34.01 | 159825.88 | 0.82 | 2.38 | 5.47 |
2015-06-10 | 34.10 | 36.35 | 10 | 32.23 | 269033.12 | 0.51 | 1.53 | 9.21 |
2017-11-01 | 33.85 | 34.34 | 10 | 33.10 | 232325.30 | -0.61 | -1.77 | 5.81 |
2015-06-11 | 33.17 | 34.98 | 10 | 32.51 | 173075.73 | 0.54 | 1.59 | 5.92 |
In [32]:
data.sort_values(by=["open", "high"]).head()
Out[32]:
open | high | close | low | volume | price_change | p_change | turnover | |
---|---|---|---|---|---|---|---|---|
2015-03-02 | 12.25 | 12.67 | 10 | 12.20 | 96291.73 | 0.32 | 2.62 | 3.30 |
2015-09-02 | 12.30 | 14.11 | 10 | 12.30 | 70201.74 | -1.10 | -8.17 | 2.40 |
2015-03-03 | 12.52 | 13.06 | 10 | 12.52 | 139071.61 | 0.18 | 1.44 | 4.76 |
2015-03-04 | 12.80 | 12.92 | 10 | 12.61 | 67075.44 | 0.20 | 1.57 | 2.30 |
2015-03-05 | 12.88 | 13.45 | 10 | 12.87 | 93180.39 | 0.26 | 2.02 | 3.19 |
In [34]:
data.sort_index().head()
Out[34]:
open | high | close | low | volume | price_change | p_change | turnover | |
---|---|---|---|---|---|---|---|---|
2015-03-02 | 12.25 | 12.67 | 10 | 12.20 | 96291.73 | 0.32 | 2.62 | 3.30 |
2015-03-03 | 12.52 | 13.06 | 10 | 12.52 | 139071.61 | 0.18 | 1.44 | 4.76 |
2015-03-04 | 12.80 | 12.92 | 10 | 12.61 | 67075.44 | 0.20 | 1.57 | 2.30 |
2015-03-05 | 12.88 | 13.45 | 10 | 12.87 | 93180.39 | 0.26 | 2.02 | 3.19 |
2015-03-06 | 13.17 | 14.48 | 10 | 13.13 | 179831.72 | 1.12 | 8.51 | 6.16 |
In [37]:
data["high"].sort_values().head()
Out[37]:
2015-03-02 12.67
2015-03-04 12.92
2015-03-03 13.06
2015-09-07 13.38
2015-03-05 13.45
Name: high, dtype: float64
In [39]:
data["high"].sort_index().head()
Out[39]:
2015-03-02 12.67
2015-03-03 13.06
2015-03-04 12.92
2015-03-05 13.45
2015-03-06 14.48
Name: high, dtype: float64
In [79]:
import pandas as pd
import matplotlib.pyplot as plt
In [35]:
data = pd.read_csv("./data/stock_day.csv")
In [84]:
data.head()
Out[84]:
open | high | close | low | volume | price_change | p_change | turnover | |
---|---|---|---|---|---|---|---|---|
2015-03-02 | 12.25 | 12.67 | 12.52 | 12.20 | 96291.73 | 0.32 | 2.62 | 3.30 |
2015-03-03 | 12.52 | 13.06 | 12.70 | 12.52 | 139071.61 | 0.18 | 1.44 | 4.76 |
2015-03-04 | 12.80 | 12.92 | 12.90 | 12.61 | 67075.44 | 0.20 | 1.57 | 2.30 |
2015-03-05 | 12.88 | 13.45 | 13.16 | 12.87 | 93180.39 | 0.26 | 2.02 | 3.19 |
2015-03-06 | 13.17 | 14.48 | 14.28 | 13.13 | 179831.72 | 1.12 | 8.51 | 6.16 |
In [37]:
data = data.drop(["ma5","ma10","ma20","v_ma5","v_ma10","v_ma20"], axis=1)
In [38]:
data.head()
Out[38]:
open | high | close | low | volume | price_change | p_change | turnover | |
---|---|---|---|---|---|---|---|---|
2018-02-27 | 23.53 | 25.88 | 24.16 | 23.53 | 95578.03 | 0.63 | 2.68 | 2.39 |
2018-02-26 | 22.80 | 23.78 | 23.53 | 22.80 | 60985.11 | 0.69 | 3.02 | 1.53 |
2018-02-23 | 22.88 | 23.37 | 22.82 | 22.71 | 52914.01 | 0.54 | 2.42 | 1.32 |
2018-02-22 | 22.25 | 22.76 | 22.28 | 22.02 | 36105.01 | 0.36 | 1.64 | 0.90 |
2018-02-14 | 21.49 | 21.99 | 21.92 | 21.48 | 23331.04 | 0.44 | 2.05 | 0.58 |
算术运算
In [41]:
data["open"].add(10).head()
Out[41]:
2018-02-27 33.53
2018-02-26 32.80
2018-02-23 32.88
2018-02-22 32.25
2018-02-14 31.49
Name: open, dtype: float64
In [43]:
# data["open"]+10 # 一般不会这么使用
逻辑运算
In [45]:
# data["close"] > 20
In [48]:
data[data["open"] > 23].head()
Out[48]:
open | high | close | low | volume | price_change | p_change | turnover | |
---|---|---|---|---|---|---|---|---|
2018-02-27 | 23.53 | 25.88 | 24.16 | 23.53 | 95578.03 | 0.63 | 2.68 | 2.39 |
2018-02-01 | 23.71 | 23.86 | 22.42 | 22.22 | 66414.64 | -1.30 | -5.48 | 1.66 |
2018-01-31 | 23.85 | 23.98 | 23.72 | 23.31 | 49155.02 | -0.11 | -0.46 | 1.23 |
2018-01-30 | 23.71 | 24.08 | 23.83 | 23.70 | 32420.43 | 0.05 | 0.21 | 0.81 |
2018-01-29 | 24.40 | 24.63 | 23.77 | 23.72 | 65469.81 | -0.73 | -2.98 | 1.64 |
In [51]:
data[(data["open"]>23)&(data["open"]<24)].head()
Out[51]:
open | high | close | low | volume | price_change | p_change | turnover | |
---|---|---|---|---|---|---|---|---|
2018-02-27 | 23.53 | 25.88 | 24.16 | 23.53 | 95578.03 | 0.63 | 2.68 | 2.39 |
2018-02-01 | 23.71 | 23.86 | 22.42 | 22.22 | 66414.64 | -1.30 | -5.48 | 1.66 |
2018-01-31 | 23.85 | 23.98 | 23.72 | 23.31 | 49155.02 | -0.11 | -0.46 | 1.23 |
2018-01-30 | 23.71 | 24.08 | 23.83 | 23.70 | 32420.43 | 0.05 | 0.21 | 0.81 |
2018-01-16 | 23.40 | 24.60 | 24.40 | 23.30 | 101295.42 | 0.96 | 4.10 | 2.54 |
In [53]:
data.query("open<24 & open>23").head()
Out[53]:
open | high | close | low | volume | price_change | p_change | turnover | |
---|---|---|---|---|---|---|---|---|
2018-02-27 | 23.53 | 25.88 | 24.16 | 23.53 | 95578.03 | 0.63 | 2.68 | 2.39 |
2018-02-01 | 23.71 | 23.86 | 22.42 | 22.22 | 66414.64 | -1.30 | -5.48 | 1.66 |
2018-01-31 | 23.85 | 23.98 | 23.72 | 23.31 | 49155.02 | -0.11 | -0.46 | 1.23 |
2018-01-30 | 23.71 | 24.08 | 23.83 | 23.70 | 32420.43 | 0.05 | 0.21 | 0.81 |
2018-01-16 | 23.40 | 24.60 | 24.40 | 23.30 | 101295.42 | 0.96 | 4.10 | 2.54 |
In [57]:
data[data["open"].isin([23.23, 23.71])]
Out[57]:
open | high | close | low | volume | price_change | p_change | turnover | |
---|---|---|---|---|---|---|---|---|
2018-02-01 | 23.71 | 23.86 | 22.42 | 22.22 | 66414.64 | -1.30 | -5.48 | 1.66 |
2018-01-30 | 23.71 | 24.08 | 23.83 | 23.70 | 32420.43 | 0.05 | 0.21 | 0.81 |
2017-12-19 | 23.23 | 23.66 | 23.46 | 23.23 | 43068.70 | 0.31 | 1.34 | 1.08 |
统计运算
In [58]:
data.describe()
Out[58]:
open | high | close | low | volume | price_change | p_change | turnover | |
---|---|---|---|---|---|---|---|---|
count | 643.000000 | 643.000000 | 643.000000 | 643.000000 | 643.000000 | 643.000000 | 643.000000 | 643.000000 |
mean | 21.272706 | 21.900513 | 21.336267 | 20.771835 | 99905.519114 | 0.018802 | 0.190280 | 2.936190 |
std | 3.930973 | 4.077578 | 3.942806 | 3.791968 | 73879.119354 | 0.898476 | 4.079698 | 2.079375 |
min | 12.250000 | 12.670000 | 12.360000 | 12.200000 | 1158.120000 | -3.520000 | -10.030000 | 0.040000 |
25% | 19.000000 | 19.500000 | 19.045000 | 18.525000 | 48533.210000 | -0.390000 | -1.850000 | 1.360000 |
50% | 21.440000 | 21.970000 | 21.450000 | 20.980000 | 83175.930000 | 0.050000 | 0.260000 | 2.500000 |
75% | 23.400000 | 24.065000 | 23.415000 | 22.850000 | 127580.055000 | 0.455000 | 2.305000 | 3.915000 |
max | 34.990000 | 36.350000 | 35.210000 | 34.010000 | 501915.410000 | 3.030000 | 10.030000 | 12.560000 |
In [60]:
data.max(0)
Out[60]:
open 34.99
high 36.35
close 35.21
low 34.01
volume 501915.41
price_change 3.03
p_change 10.03
turnover 12.56
dtype: float64
In [62]:
# data.max(1)
In [63]:
df = pd.DataFrame({'COL1' : [2,3,4,5,4,2],
'COL2' : [0,1,2,3,4,2]})
In [64]:
df
Out[64]:
COL1 | COL2 | |
---|---|---|
0 | 2 | 0 |
1 | 3 | 1 |
2 | 4 | 2 |
3 | 5 | 3 |
4 | 4 | 4 |
5 | 2 | 2 |
In [65]:
df.median()
Out[65]:
COL1 3.5
COL2 2.0
dtype: float64
In [66]:
data.idxmax()
Out[66]:
open 2015-06-15
high 2015-06-10
close 2015-06-12
low 2015-06-12
volume 2017-10-26
price_change 2015-06-09
p_change 2015-08-28
turnover 2017-10-26
dtype: object
In [67]:
data.idxmin()
Out[67]:
open 2015-03-02
high 2015-03-02
close 2015-09-02
low 2015-03-02
volume 2016-07-06
price_change 2015-06-15
p_change 2015-09-01
turnover 2016-07-06
dtype: object
In [70]:
data = data.sort_index()
In [71]:
data.head()
Out[71]:
open | high | close | low | volume | price_change | p_change | turnover | |
---|---|---|---|---|---|---|---|---|
2015-03-02 | 12.25 | 12.67 | 12.52 | 12.20 | 96291.73 | 0.32 | 2.62 | 3.30 |
2015-03-03 | 12.52 | 13.06 | 12.70 | 12.52 | 139071.61 | 0.18 | 1.44 | 4.76 |
2015-03-04 | 12.80 | 12.92 | 12.90 | 12.61 | 67075.44 | 0.20 | 1.57 | 2.30 |
2015-03-05 | 12.88 | 13.45 | 13.16 | 12.87 | 93180.39 | 0.26 | 2.02 | 3.19 |
2015-03-06 | 13.17 | 14.48 | 14.28 | 13.13 | 179831.72 | 1.12 | 8.51 | 6.16 |
In [73]:
stock_rise = data["p_change"]
In [75]:
# stock_rise.cumsum()
In [80]:
stock_rise.cumsum().plot()
plt.show()
自定义运算
In [88]:
data[["open", "close"]].apply(lambda x: x.max()-x.min(), axis=0)
Out[88]:
open 22.74
close 22.85
dtype: float64
In [1]:
import pandas as pd
csv
In [2]:
data = pd.read_csv("./data/stock_day.csv", usecols=["open","close"])
In [3]:
data.head()
Out[3]:
open | close | |
---|---|---|
2018-02-27 | 23.53 | 24.16 |
2018-02-26 | 22.80 | 23.53 |
2018-02-23 | 22.88 | 22.82 |
2018-02-22 | 22.25 | 22.28 |
2018-02-14 | 21.49 | 21.92 |
In [4]:
# data.to_csv("./data/test.csv", columns=["close"])
data.to_csv("./data/test.csv", columns=["close"], index=False)
In [5]:
data = pd.read_csv("./data/test.csv")
In [6]:
data.head()
Out[6]:
close | |
---|---|
0 | 24.16 |
1 | 23.53 |
2 | 22.82 |
3 | 22.28 |
4 | 21.92 |
hdf5
In [7]:
day_close = pd.read_hdf("./data/day_close.h5")
In [8]:
day_close.head()
Out[8]:
000001.SZ | 000002.SZ | 000004.SZ | 000005.SZ | 000006.SZ | 000007.SZ | 000008.SZ | 000009.SZ | 000010.SZ | 000011.SZ | ... | 001965.SZ | 603283.SH | 002920.SZ | 002921.SZ | 300684.SZ | 002922.SZ | 300735.SZ | 603329.SH | 603655.SH | 603080.SH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.30 | 17.71 | 4.58 | 2.88 | 14.60 | 2.62 | 4.96 | 4.66 | 5.37 | 6.02 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 17.02 | 19.20 | 4.65 | 3.02 | 15.97 | 2.65 | 4.95 | 4.70 | 5.37 | 6.27 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 17.02 | 17.28 | 4.56 | 3.06 | 14.37 | 2.63 | 4.82 | 4.47 | 5.37 | 5.96 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 16.18 | 16.97 | 4.49 | 2.95 | 13.10 | 2.73 | 4.89 | 4.33 | 5.37 | 5.77 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 16.95 | 17.19 | 4.55 | 2.99 | 13.18 | 2.77 | 4.97 | 4.42 | 5.37 | 5.92 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 3562 columns
In [9]:
day_close.to_hdf("./data/test.h5", key="day_close")
In [10]:
new_data = pd.read_hdf("./data/test.h5", key="day_close")
In [11]:
new_data.head()
Out[11]:
000001.SZ | 000002.SZ | 000004.SZ | 000005.SZ | 000006.SZ | 000007.SZ | 000008.SZ | 000009.SZ | 000010.SZ | 000011.SZ | ... | 001965.SZ | 603283.SH | 002920.SZ | 002921.SZ | 300684.SZ | 002922.SZ | 300735.SZ | 603329.SH | 603655.SH | 603080.SH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.30 | 17.71 | 4.58 | 2.88 | 14.60 | 2.62 | 4.96 | 4.66 | 5.37 | 6.02 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 17.02 | 19.20 | 4.65 | 3.02 | 15.97 | 2.65 | 4.95 | 4.70 | 5.37 | 6.27 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 17.02 | 17.28 | 4.56 | 3.06 | 14.37 | 2.63 | 4.82 | 4.47 | 5.37 | 5.96 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 16.18 | 16.97 | 4.49 | 2.95 | 13.10 | 2.73 | 4.89 | 4.33 | 5.37 | 5.77 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 16.95 | 17.19 | 4.55 | 2.99 | 13.18 | 2.77 | 4.97 | 4.42 | 5.37 | 5.92 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 3562 columns
json
In [12]:
data = pd.read_json("./data/Sarcasm_Headlines_Dataset.json", orient="records", lines=True)
In [13]:
data.head()
Out[13]:
article_link | headline | is_sarcastic | |
---|---|---|---|
0 | https://www.huffingtonpost.com/entry/versace-b... | former versace store clerk sues over secret 'b... | 0 |
1 | https://www.huffingtonpost.com/entry/roseanne-... | the 'roseanne' revival catches up to our thorn... | 0 |
2 | https://local.theonion.com/mom-starting-to-fea... | mom starting to fear son's web series closest ... | 1 |
3 | https://politics.theonion.com/boehner-just-wan... | boehner just wants wife to listen, not come up... | 1 |
4 | https://www.huffingtonpost.com/entry/jk-rowlin... | j.k. rowling wishes snape happy birthday in th... | 0 |
In [14]:
data.to_json("./data/test.json", orient="records")
In [15]:
data.to_json("./data/test.json", orient="records", lines=True)
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
缺失值处理
In [2]:
movie = pd.read_csv("./data/IMDB-Movie-Data.csv")
In [3]:
movie.head()
Out[3]:
Rank | Title | Genre | Description | Director | Actors | Year | Runtime (Minutes) | Rating | Votes | Revenue (Millions) | Metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Guardians of the Galaxy | Action,Adventure,Sci-Fi | A group of intergalactic criminals are forced ... | James Gunn | Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... | 2014 | 121 | 8.1 | 757074 | 333.13 | 76.0 |
1 | 2 | Prometheus | Adventure,Mystery,Sci-Fi | Following clues to the origin of mankind, a te... | Ridley Scott | Noomi Rapace, Logan Marshall-Green, Michael Fa... | 2012 | 124 | 7.0 | 485820 | 126.46 | 65.0 |
2 | 3 | Split | Horror,Thriller | Three girls are kidnapped by a man with a diag... | M. Night Shyamalan | James McAvoy, Anya Taylor-Joy, Haley Lu Richar... | 2016 | 117 | 7.3 | 157606 | 138.12 | 62.0 |
3 | 4 | Sing | Animation,Comedy,Family | In a city of humanoid animals, a hustling thea... | Christophe Lourdelet | Matthew McConaughey,Reese Witherspoon, Seth Ma... | 2016 | 108 | 7.2 | 60545 | 270.32 | 59.0 |
4 | 5 | Suicide Squad | Action,Adventure,Fantasy | A secret government agency recruits some of th... | David Ayer | Will Smith, Jared Leto, Margot Robbie, Viola D... | 2016 | 123 | 6.2 | 393727 | 325.02 | 40.0 |
缺失值是nan
In [4]:
np.all(pd.notnull(movie)) # 里面如果有一个缺失值,那么会返回False,说明有缺失值
Out[4]:
False
In [5]:
np.any(pd.isnull(movie)) # 里面如果有一个缺失值,那么会返回True,说明有缺失值
Out[5]:
True
In [6]:
data = movie.dropna()
In [7]:
np.all(pd.notnull(data)) # 里面如果有一个缺失值,那么会返回False,说明有缺失值
Out[7]:
True
In [8]:
movie["Revenue (Millions)"].mean()
Out[8]:
82.95637614678898
In [9]:
movie["Revenue (Millions)"].fillna(movie["Revenue (Millions)"].mean(), inplace=True)
In [10]:
movie.head()
Out[10]:
Rank | Title | Genre | Description | Director | Actors | Year | Runtime (Minutes) | Rating | Votes | Revenue (Millions) | Metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Guardians of the Galaxy | Action,Adventure,Sci-Fi | A group of intergalactic criminals are forced ... | James Gunn | Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... | 2014 | 121 | 8.1 | 757074 | 333.13 | 76.0 |
1 | 2 | Prometheus | Adventure,Mystery,Sci-Fi | Following clues to the origin of mankind, a te... | Ridley Scott | Noomi Rapace, Logan Marshall-Green, Michael Fa... | 2012 | 124 | 7.0 | 485820 | 126.46 | 65.0 |
2 | 3 | Split | Horror,Thriller | Three girls are kidnapped by a man with a diag... | M. Night Shyamalan | James McAvoy, Anya Taylor-Joy, Haley Lu Richar... | 2016 | 117 | 7.3 | 157606 | 138.12 | 62.0 |
3 | 4 | Sing | Animation,Comedy,Family | In a city of humanoid animals, a hustling thea... | Christophe Lourdelet | Matthew McConaughey,Reese Witherspoon, Seth Ma... | 2016 | 108 | 7.2 | 60545 | 270.32 | 59.0 |
4 | 5 | Suicide Squad | Action,Adventure,Fantasy | A secret government agency recruits some of th... | David Ayer | Will Smith, Jared Leto, Margot Robbie, Viola D... | 2016 | 123 | 6.2 | 393727 | 325.02 | 40.0 |
In [11]:
for i in movie.columns:
if np.any(pd.isnull(movie[i])) == True:
print(i)
movie[i].fillna(movie[i].mean(), inplace=True)
Metascore
In [12]:
np.any(pd.isnull(movie)) # 里面如果有一个缺失值,那么会返回True,说明有缺失值
Out[12]:
False
In [13]:
wis = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data")
In [14]:
wis.head()
Out[14]:
1000025 | 5 | 1 | 1.1 | 1.2 | 2 | 1.3 | 3 | 1.4 | 1.5 | 2.1 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1002945 | 5 | 4 | 4 | 5 | 7 | 10 | 3 | 2 | 1 | 2 |
1 | 1015425 | 3 | 1 | 1 | 1 | 2 | 2 | 3 | 1 | 1 | 2 |
2 | 1016277 | 6 | 8 | 8 | 1 | 3 | 4 | 3 | 7 | 1 | 2 |
3 | 1017023 | 4 | 1 | 1 | 3 | 2 | 1 | 3 | 1 | 1 | 2 |
4 | 1017122 | 8 | 10 | 10 | 8 | 7 | 10 | 9 | 7 | 1 | 4 |
缺失值是其他符号
In [15]:
wis = wis.replace(to_replace="?", value=np.nan)
In [16]:
wis.head()
Out[16]:
1000025 | 5 | 1 | 1.1 | 1.2 | 2 | 1.3 | 3 | 1.4 | 1.5 | 2.1 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1002945 | 5 | 4 | 4 | 5 | 7 | 10 | 3 | 2 | 1 | 2 |
1 | 1015425 | 3 | 1 | 1 | 1 | 2 | 2 | 3 | 1 | 1 | 2 |
2 | 1016277 | 6 | 8 | 8 | 1 | 3 | 4 | 3 | 7 | 1 | 2 |
3 | 1017023 | 4 | 1 | 1 | 3 | 2 | 1 | 3 | 1 | 1 | 2 |
4 | 1017122 | 8 | 10 | 10 | 8 | 7 | 10 | 9 | 7 | 1 | 4 |
In [17]:
wis = wis.dropna()
In [18]:
np.any(pd.isnull(wis)) # 里面如果有一个缺失值,那么会返回True,说明有缺失值
Out[18]:
False
数据离散化
In [19]:
data = pd.read_csv("./data/stock_day.csv")
In [20]:
data.head()
Out[20]:
open | high | close | low | volume | price_change | p_change | ma5 | ma10 | ma20 | v_ma5 | v_ma10 | v_ma20 | turnover | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2018-02-27 | 23.53 | 25.88 | 24.16 | 23.53 | 95578.03 | 0.63 | 2.68 | 22.942 | 22.142 | 22.875 | 53782.64 | 46738.65 | 55576.11 | 2.39 |
2018-02-26 | 22.80 | 23.78 | 23.53 | 22.80 | 60985.11 | 0.69 | 3.02 | 22.406 | 21.955 | 22.942 | 40827.52 | 42736.34 | 56007.50 | 1.53 |
2018-02-23 | 22.88 | 23.37 | 22.82 | 22.71 | 52914.01 | 0.54 | 2.42 | 21.938 | 21.929 | 23.022 | 35119.58 | 41871.97 | 56372.85 | 1.32 |
2018-02-22 | 22.25 | 22.76 | 22.28 | 22.02 | 36105.01 | 0.36 | 1.64 | 21.446 | 21.909 | 23.137 | 35397.58 | 39904.78 | 60149.60 | 0.90 |
2018-02-14 | 21.49 | 21.99 | 21.92 | 21.48 | 23331.04 | 0.44 | 2.05 | 21.366 | 21.923 | 23.253 | 33590.21 | 42935.74 | 61716.11 | 0.58 |
In [21]:
p_change = data["p_change"]
In [22]:
p_change.head()
Out[22]:
2018-02-27 2.68
2018-02-26 3.02
2018-02-23 2.42
2018-02-22 1.64
2018-02-14 2.05
Name: p_change, dtype: float64
In [23]:
# 自动分成差不多数量的类别
qcut = pd.qcut(p_change, 10)
qcut.value_counts()
Out[23]:
(5.27, 10.03] 65
(0.26, 0.94] 65
(-0.462, 0.26] 65
(-10.030999999999999, -4.836] 65
(2.938, 5.27] 64
(1.738, 2.938] 64
(-1.352, -0.462] 64
(-2.444, -1.352] 64
(-4.836, -2.444] 64
(0.94, 1.738] 63
Name: p_change, dtype: int64
In [24]:
# 指定分组区间
bins = [-100, -7, -5, -3, 0, 3, 5, 7, 100]
p_count = pd.cut(p_change, bins)
In [25]:
p_count.value_counts()
Out[25]:
(0, 3] 215
(-3, 0] 188
(3, 5] 57
(-5, -3] 51
(7, 100] 35
(5, 7] 35
(-100, -7] 34
(-7, -5] 28
Name: p_change, dtype: int64
In [26]:
dummies = pd.get_dummies(p_count, prefix="rise")
dummies.head()
Out[26]:
rise_(-100, -7] | rise_(-7, -5] | rise_(-5, -3] | rise_(-3, 0] | rise_(0, 3] | rise_(3, 5] | rise_(5, 7] | rise_(7, 100] | |
---|---|---|---|---|---|---|---|---|
2018-02-27 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-02-26 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2018-02-23 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-02-22 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-02-14 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
合并
In [27]:
pd.concat([data, dummies],axis=1)
Out[27]:
open | high | close | low | volume | price_change | p_change | ma5 | ma10 | ma20 | ... | v_ma20 | turnover | rise_(-100, -7] | rise_(-7, -5] | rise_(-5, -3] | rise_(-3, 0] | rise_(0, 3] | rise_(3, 5] | rise_(5, 7] | rise_(7, 100] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2018-02-27 | 23.53 | 25.88 | 24.16 | 23.53 | 95578.03 | 0.63 | 2.68 | 22.942 | 22.142 | 22.875 | ... | 55576.11 | 2.39 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-02-26 | 22.80 | 23.78 | 23.53 | 22.80 | 60985.11 | 0.69 | 3.02 | 22.406 | 21.955 | 22.942 | ... | 56007.50 | 1.53 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2018-02-23 | 22.88 | 23.37 | 22.82 | 22.71 | 52914.01 | 0.54 | 2.42 | 21.938 | 21.929 | 23.022 | ... | 56372.85 | 1.32 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-02-22 | 22.25 | 22.76 | 22.28 | 22.02 | 36105.01 | 0.36 | 1.64 | 21.446 | 21.909 | 23.137 | ... | 60149.60 | 0.90 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-02-14 | 21.49 | 21.99 | 21.92 | 21.48 | 23331.04 | 0.44 | 2.05 | 21.366 | 21.923 | 23.253 | ... | 61716.11 | 0.58 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-02-13 | 21.40 | 21.90 | 21.48 | 21.31 | 30802.45 | 0.28 | 1.32 | 21.342 | 22.103 | 23.387 | ... | 65161.68 | 0.77 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-02-12 | 20.70 | 21.40 | 21.19 | 20.63 | 32445.39 | 0.82 | 4.03 | 21.504 | 22.338 | 23.533 | ... | 68686.33 | 0.81 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2018-02-09 | 21.20 | 21.46 | 20.36 | 20.19 | 54304.01 | -1.50 | -6.86 | 21.920 | 22.596 | 23.645 | ... | 70552.47 | 1.36 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2018-02-08 | 21.79 | 22.09 | 21.88 | 21.75 | 27068.16 | 0.09 | 0.41 | 22.372 | 23.009 | 23.839 | ... | 73852.45 | 0.68 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-02-07 | 22.69 | 23.11 | 21.80 | 21.29 | 53853.25 | -0.50 | -2.24 | 22.480 | 23.258 | 23.929 | ... | 74925.33 | 1.35 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2018-02-06 | 22.80 | 23.55 | 22.29 | 22.20 | 55555.00 | -0.97 | -4.17 | 22.864 | 23.607 | 24.029 | ... | 75738.95 | 1.39 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2018-02-05 | 22.45 | 23.39 | 23.27 | 22.25 | 52341.39 | 0.65 | 2.87 | 23.172 | 23.928 | 24.112 | ... | 77070.00 | 1.31 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-02-02 | 22.40 | 22.70 | 22.62 | 21.53 | 33242.11 | 0.20 | 0.89 | 23.272 | 24.114 | 24.184 | ... | 79929.71 | 0.83 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-02-01 | 23.71 | 23.86 | 22.42 | 22.22 | 66414.64 | -1.30 | -5.48 | 23.646 | 24.365 | 24.279 | ... | 88480.92 | 1.66 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2018-01-31 | 23.85 | 23.98 | 23.72 | 23.31 | 49155.02 | -0.11 | -0.46 | 24.036 | 24.583 | 24.411 | ... | 91666.75 | 1.23 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2018-01-30 | 23.71 | 24.08 | 23.83 | 23.70 | 32420.43 | 0.05 | 0.21 | 24.350 | 24.671 | 24.365 | ... | 92943.35 | 0.81 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-01-29 | 24.40 | 24.63 | 23.77 | 23.72 | 65469.81 | -0.73 | -2.98 | 24.684 | 24.728 | 24.294 | ... | 93456.22 | 1.64 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2018-01-26 | 24.27 | 24.74 | 24.49 | 24.22 | 50601.83 | 0.11 | 0.45 | 24.956 | 24.694 | 24.221 | ... | 91980.51 | 1.27 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-01-25 | 24.99 | 24.99 | 24.37 | 24.23 | 104097.59 | -0.93 | -3.68 | 25.084 | 24.669 | 24.109 | ... | 92262.67 | 2.61 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2018-01-24 | 25.49 | 26.28 | 25.29 | 25.20 | 134838.00 | -0.20 | -0.79 | 25.130 | 24.599 | 23.997 | ... | 89522.22 | 3.37 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2018-01-23 | 25.15 | 25.53 | 25.50 | 24.93 | 104205.76 | 0.39 | 1.55 | 24.992 | 24.450 | 23.844 | ... | 85876.80 | 2.61 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-01-22 | 25.14 | 25.40 | 25.13 | 24.75 | 68292.08 | -0.01 | -0.04 | 24.772 | 24.296 | 23.644 | ... | 84970.00 | 1.71 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2018-01-19 | 24.60 | 25.34 | 25.13 | 24.42 | 128449.11 | 0.53 | 2.15 | 24.432 | 24.254 | 23.537 | ... | 82975.10 | 3.21 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-01-18 | 24.40 | 24.88 | 24.60 | 24.30 | 67435.14 | 0.01 | 0.04 | 24.254 | 24.192 | 23.441 | ... | 78252.92 | 1.69 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-01-17 | 24.42 | 24.92 | 24.60 | 23.80 | 92242.51 | 0.20 | 0.82 | 24.068 | 24.239 | 23.378 | ... | 77049.61 | 2.31 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-01-16 | 23.40 | 24.60 | 24.40 | 23.30 | 101295.42 | 0.96 | 4.10 | 23.908 | 24.058 | 23.321 | ... | 74590.92 | 2.54 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2018-01-15 | 24.01 | 24.23 | 23.43 | 23.30 | 69768.17 | -0.80 | -3.30 | 23.820 | 23.860 | 23.257 | ... | 71006.65 | 1.75 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2018-01-12 | 23.70 | 25.15 | 24.24 | 23.42 | 120303.53 | 0.56 | 2.37 | 24.076 | 23.748 | 23.236 | ... | 69690.35 | 3.01 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2018-01-11 | 23.67 | 23.85 | 23.67 | 23.21 | 48525.75 | -0.12 | -0.50 | 24.130 | 23.548 | 23.197 | ... | 65928.23 | 1.21 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2018-01-10 | 24.10 | 24.60 | 23.80 | 23.40 | 70125.79 | -0.14 | -0.58 | 24.410 | 23.394 | 23.204 | ... | 66934.89 | 1.76 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2015-04-13 | 19.60 | 21.30 | 21.13 | 19.50 | 171822.69 | 1.70 | 8.75 | 19.228 | 17.812 | 16.563 | ... | 111752.31 | 5.88 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2015-04-10 | 19.55 | 19.89 | 19.43 | 19.20 | 112962.15 | -0.19 | -0.97 | 18.334 | 17.276 | 16.230 | ... | 106228.29 | 3.87 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2015-04-09 | 18.28 | 19.89 | 19.62 | 18.02 | 183119.05 | 1.20 | 6.51 | 17.736 | 16.826 | 15.964 | ... | 104829.10 | 6.27 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2015-04-08 | 17.60 | 18.53 | 18.42 | 17.60 | 157725.97 | 0.88 | 5.02 | 17.070 | 16.394 | 15.698 | ... | 101658.57 | 5.40 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2015-04-07 | 16.54 | 17.98 | 17.54 | 16.50 | 122471.85 | 0.88 | 5.28 | 16.620 | 16.120 | 15.510 | ... | 98832.94 | 4.19 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2015-04-03 | 16.44 | 16.77 | 16.66 | 16.25 | 91962.88 | 0.22 | 1.34 | 16.396 | 15.904 | 15.348 | ... | 99956.63 | 3.15 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2015-04-02 | 16.21 | 16.50 | 16.44 | 16.21 | 66336.32 | 0.15 | 0.92 | 16.218 | 15.772 | 15.229 | ... | 104350.08 | 2.27 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2015-04-01 | 16.18 | 16.48 | 16.29 | 16.00 | 68609.42 | 0.12 | 0.74 | 15.916 | 15.666 | 15.065 | ... | 105692.28 | 2.35 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2015-03-31 | 16.78 | 16.88 | 16.17 | 16.07 | 84467.62 | -0.25 | -1.52 | 15.718 | 15.568 | 14.896 | ... | 105615.58 | 2.89 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2015-03-30 | 15.99 | 16.63 | 16.42 | 15.99 | 85090.45 | 0.65 | 4.12 | 15.620 | 15.469 | 14.722 | ... | 108345.78 | 2.91 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2015-03-27 | 14.90 | 15.86 | 15.77 | 14.90 | 120352.13 | 0.84 | 5.63 | 15.412 | 15.314 | 14.527 | ... | 108905.84 | 4.12 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2015-03-26 | 15.14 | 15.35 | 14.93 | 14.91 | 84877.75 | -0.37 | -2.42 | 15.326 | 15.184 | 14.462 | ... | 108303.41 | 2.91 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2015-03-25 | 15.97 | 15.97 | 15.30 | 15.18 | 97174.40 | -0.38 | -2.42 | 15.416 | 15.102 | 14.436 | ... | 109604.83 | 3.33 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2015-03-24 | 15.38 | 16.16 | 15.68 | 15.28 | 153390.08 | 0.30 | 1.95 | 15.418 | 15.002 | 14.385 | ... | 110336.03 | 5.25 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2015-03-23 | 15.34 | 15.56 | 15.38 | 15.25 | 89461.32 | 0.04 | 0.26 | 15.318 | 14.899 | 14.304 | ... | 107645.16 | 3.06 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2015-03-20 | 15.38 | 15.48 | 15.34 | 15.18 | 76800.13 | -0.04 | -0.26 | 15.216 | 14.792 | 14.232 | ... | 108857.41 | 2.63 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2015-03-19 | 15.20 | 15.64 | 15.38 | 15.11 | 93644.19 | 0.07 | 0.46 | 15.042 | 14.686 | 14.153 | ... | 111147.22 | 3.21 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2015-03-18 | 15.18 | 15.66 | 15.31 | 15.02 | 121538.71 | 0.13 | 0.86 | 14.788 | 14.464 | 14.058 | ... | 112493.60 | 4.16 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2015-03-17 | 14.90 | 15.44 | 15.18 | 14.63 | 158770.77 | 0.31 | 2.08 | 14.586 | 14.223 | 13.954 | ... | 111739.85 | 5.43 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2015-03-16 | 14.52 | 15.05 | 14.87 | 14.51 | 94468.30 | 0.40 | 2.76 | 14.480 | 13.975 | 13.843 | ... | 107464.31 | 3.23 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2015-03-13 | 14.13 | 14.50 | 14.47 | 14.08 | 61342.22 | 0.36 | 2.55 | 14.368 | 13.740 | 13.740 | ... | 108763.91 | 2.10 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2015-03-12 | 14.11 | 14.80 | 14.11 | 13.95 | 84978.37 | -0.19 | -1.33 | 14.330 | 13.659 | 13.659 | ... | 114032.98 | 2.91 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2015-03-11 | 14.80 | 15.08 | 14.30 | 14.14 | 119708.43 | -0.35 | -2.39 | 14.140 | 13.603 | 13.603 | ... | 117664.81 | 4.10 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2015-03-10 | 14.20 | 14.80 | 14.65 | 14.01 | 101213.51 | 0.34 | 2.38 | 13.860 | 13.503 | 13.503 | ... | 117372.87 | 3.46 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2015-03-09 | 14.14 | 14.85 | 14.31 | 13.80 | 144945.66 | 0.03 | 0.21 | 13.470 | 13.312 | 13.312 | ... | 120066.09 | 4.96 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2015-03-06 | 13.17 | 14.48 | 14.28 | 13.13 | 179831.72 | 1.12 | 8.51 | 13.112 | 13.112 | 13.112 | ... | 115090.18 | 6.16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2015-03-05 | 12.88 | 13.45 | 13.16 | 12.87 | 93180.39 | 0.26 | 2.02 | 12.820 | 12.820 | 12.820 | ... | 98904.79 | 3.19 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2015-03-04 | 12.80 | 12.92 | 12.90 | 12.61 | 67075.44 | 0.20 | 1.57 | 12.707 | 12.707 | 12.707 | ... | 100812.93 | 2.30 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2015-03-03 | 12.52 | 13.06 | 12.70 | 12.52 | 139071.61 | 0.18 | 1.44 | 12.610 | 12.610 | 12.610 | ... | 117681.67 | 4.76 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2015-03-02 | 12.25 | 12.67 | 12.52 | 12.20 | 96291.73 | 0.32 | 2.62 | 12.520 | 12.520 | 12.520 | ... | 96291.73 | 3.30 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
643 rows × 22 columns
In [28]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
In [29]:
left
Out[29]:
A | B | key1 | key2 | |
---|---|---|---|---|
0 | A0 | B0 | K0 | K0 |
1 | A1 | B1 | K0 | K1 |
2 | A2 | B2 | K1 | K0 |
3 | A3 | B3 | K2 | K1 |
In [30]:
right
Out[30]:
C | D | key1 | key2 | |
---|---|---|---|---|
0 | C0 | D0 | K0 | K0 |
1 | C1 | D1 | K1 | K0 |
2 | C2 | D2 | K1 | K0 |
3 | C3 | D3 | K2 | K0 |
In [31]:
pd.merge(left, right, on=["key1", "key2"])
Out[31]:
A | B | key1 | key2 | C | D | |
---|---|---|---|---|---|---|
0 | A0 | B0 | K0 | K0 | C0 | D0 |
1 | A2 | B2 | K1 | K0 | C1 | D1 |
2 | A2 | B2 | K1 | K0 | C2 | D2 |
In [32]:
pd.merge(left, right, on=["key1", "key2"], how="inner")
Out[32]:
A | B | key1 | key2 | C | D | |
---|---|---|---|---|---|---|
0 | A0 | B0 | K0 | K0 | C0 | D0 |
1 | A2 | B2 | K1 | K0 | C1 | D1 |
2 | A2 | B2 | K1 | K0 | C2 | D2 |
In [33]:
pd.merge(left, right, on=["key1", "key2"], how="left")
Out[33]:
A | B | key1 | key2 | C | D | |
---|---|---|---|---|---|---|
0 | A0 | B0 | K0 | K0 | C0 | D0 |
1 | A1 | B1 | K0 | K1 | NaN | NaN |
2 | A2 | B2 | K1 | K0 | C1 | D1 |
3 | A2 | B2 | K1 | K0 | C2 | D2 |
4 | A3 | B3 | K2 | K1 | NaN | NaN |
In [34]:
pd.merge(left, right, on=["key1", "key2"], how="right")
Out[34]:
A | B | key1 | key2 | C | D | |
---|---|---|---|---|---|---|
0 | A0 | B0 | K0 | K0 | C0 | D0 |
1 | A2 | B2 | K1 | K0 | C1 | D1 |
2 | A2 | B2 | K1 | K0 | C2 | D2 |
3 | NaN | NaN | K2 | K0 | C3 | D3 |
In [35]:
pd.merge(left, right, on=["key1", "key2"], how="outer")
Out[35]:
A | B | key1 | key2 | C | D | |
---|---|---|---|---|---|---|
0 | A0 | B0 | K0 | K0 | C0 | D0 |
1 | A1 | B1 | K0 | K1 | NaN | NaN |
2 | A2 | B2 | K1 | K0 | C1 | D1 |
3 | A2 | B2 | K1 | K0 | C2 | D2 |
4 | A3 | B3 | K2 | K1 | NaN | NaN |
5 | NaN | NaN | K2 | K0 | C3 | D3 |
交叉表透视表
In [36]:
data.head()
Out[36]:
open | high | close | low | volume | price_change | p_change | ma5 | ma10 | ma20 | v_ma5 | v_ma10 | v_ma20 | turnover | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2018-02-27 | 23.53 | 25.88 | 24.16 | 23.53 | 95578.03 | 0.63 | 2.68 | 22.942 | 22.142 | 22.875 | 53782.64 | 46738.65 | 55576.11 | 2.39 |
2018-02-26 | 22.80 | 23.78 | 23.53 | 22.80 | 60985.11 | 0.69 | 3.02 | 22.406 | 21.955 | 22.942 | 40827.52 | 42736.34 | 56007.50 | 1.53 |
2018-02-23 | 22.88 | 23.37 | 22.82 | 22.71 | 52914.01 | 0.54 | 2.42 | 21.938 | 21.929 | 23.022 | 35119.58 | 41871.97 | 56372.85 | 1.32 |
2018-02-22 | 22.25 | 22.76 | 22.28 | 22.02 | 36105.01 | 0.36 | 1.64 | 21.446 | 21.909 | 23.137 | 35397.58 | 39904.78 | 60149.60 | 0.90 |
2018-02-14 | 21.49 | 21.99 | 21.92 | 21.48 | 23331.04 | 0.44 | 2.05 | 21.366 | 21.923 | 23.253 | 33590.21 | 42935.74 | 61716.11 | 0.58 |
In [37]:
data.index
Out[37]:
Index(['2018-02-27', '2018-02-26', '2018-02-23', '2018-02-22', '2018-02-14',
'2018-02-13', '2018-02-12', '2018-02-09', '2018-02-08', '2018-02-07',
...
'2015-03-13', '2015-03-12', '2015-03-11', '2015-03-10', '2015-03-09',
'2015-03-06', '2015-03-05', '2015-03-04', '2015-03-03', '2015-03-02'],
dtype='object', length=643)
In [38]:
time = pd.to_datetime(data.index)
In [39]:
time.weekday
Out[39]:
Int64Index([1, 0, 4, 3, 2, 1, 0, 4, 3, 2,
...
4, 3, 2, 1, 0, 4, 3, 2, 1, 0],
dtype='int64', length=643)
In [40]:
time.day
Out[40]:
Int64Index([27, 26, 23, 22, 14, 13, 12, 9, 8, 7,
...
13, 12, 11, 10, 9, 6, 5, 4, 3, 2],
dtype='int64', length=643)
In [41]:
data["week"] = time.weekday
In [42]:
data.head()
Out[42]:
open | high | close | low | volume | price_change | p_change | ma5 | ma10 | ma20 | v_ma5 | v_ma10 | v_ma20 | turnover | week | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2018-02-27 | 23.53 | 25.88 | 24.16 | 23.53 | 95578.03 | 0.63 | 2.68 | 22.942 | 22.142 | 22.875 | 53782.64 | 46738.65 | 55576.11 | 2.39 | 1 |
2018-02-26 | 22.80 | 23.78 | 23.53 | 22.80 | 60985.11 | 0.69 | 3.02 | 22.406 | 21.955 | 22.942 | 40827.52 | 42736.34 | 56007.50 | 1.53 | 0 |
2018-02-23 | 22.88 | 23.37 | 22.82 | 22.71 | 52914.01 | 0.54 | 2.42 | 21.938 | 21.929 | 23.022 | 35119.58 | 41871.97 | 56372.85 | 1.32 | 4 |
2018-02-22 | 22.25 | 22.76 | 22.28 | 22.02 | 36105.01 | 0.36 | 1.64 | 21.446 | 21.909 | 23.137 | 35397.58 | 39904.78 | 60149.60 | 0.90 | 3 |
2018-02-14 | 21.49 | 21.99 | 21.92 | 21.48 | 23331.04 | 0.44 | 2.05 | 21.366 | 21.923 | 23.253 | 33590.21 | 42935.74 | 61716.11 | 0.58 | 2 |
In [43]:
data["p_n"] = np.where(data["p_change"] > 0, 1, 0)
In [44]:
data.head()
Out[44]:
open | high | close | low | volume | price_change | p_change | ma5 | ma10 | ma20 | v_ma5 | v_ma10 | v_ma20 | turnover | week | p_n | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2018-02-27 | 23.53 | 25.88 | 24.16 | 23.53 | 95578.03 | 0.63 | 2.68 | 22.942 | 22.142 | 22.875 | 53782.64 | 46738.65 | 55576.11 | 2.39 | 1 | 1 |
2018-02-26 | 22.80 | 23.78 | 23.53 | 22.80 | 60985.11 | 0.69 | 3.02 | 22.406 | 21.955 | 22.942 | 40827.52 | 42736.34 | 56007.50 | 1.53 | 0 | 1 |
2018-02-23 | 22.88 | 23.37 | 22.82 | 22.71 | 52914.01 | 0.54 | 2.42 | 21.938 | 21.929 | 23.022 | 35119.58 | 41871.97 | 56372.85 | 1.32 | 4 | 1 |
2018-02-22 | 22.25 | 22.76 | 22.28 | 22.02 | 36105.01 | 0.36 | 1.64 | 21.446 | 21.909 | 23.137 | 35397.58 | 39904.78 | 60149.60 | 0.90 | 3 | 1 |
2018-02-14 | 21.49 | 21.99 | 21.92 | 21.48 | 23331.04 | 0.44 | 2.05 | 21.366 | 21.923 | 23.253 | 33590.21 | 42935.74 | 61716.11 | 0.58 | 2 | 1 |
In [45]:
count = pd.crosstab(data["week"], data["p_n"])
count
Out[45]:
p_n | 0 | 1 |
---|---|---|
week | ||
0 | 63 | 62 |
1 | 55 | 76 |
2 | 61 | 71 |
3 | 63 | 65 |
4 | 59 | 68 |
In [46]:
sum = count.sum(axis=1).astype(np.float32)
sum
Out[46]:
week
0 125.0
1 131.0
2 132.0
3 128.0
4 127.0
dtype: float32
In [47]:
ret = count.div(sum, axis=0)
In [48]:
ret
Out[48]:
p_n | 0 | 1 |
---|---|---|
week | ||
0 | 0.504000 | 0.496000 |
1 | 0.419847 | 0.580153 |
2 | 0.462121 | 0.537879 |
3 | 0.492188 | 0.507812 |
4 | 0.464567 | 0.535433 |
In [49]:
ret.plot(kind="bar", stacked=True)
plt.show()
In [50]:
data.pivot_table(["p_n"], index="week")
Out[50]:
p_n | |
---|---|
week | |
0 | 0.496000 |
1 | 0.580153 |
2 | 0.537879 |
3 | 0.507812 |
4 | 0.535433 |
分组聚合
In [51]:
col =pd.DataFrame({'color': ['white','red','green','red','green'], 'object': ['pen','pencil','pencil','ashtray','pen'],'price1':[5.56,4.20,1.30,0.56,2.75],'price2':[4.75,4.12,1.60,0.75,3.15]})
In [52]:
col
Out[52]:
color | object | price1 | price2 | |
---|---|---|---|---|
0 | white | pen | 5.56 | 4.75 |
1 | red | pencil | 4.20 | 4.12 |
2 | green | pencil | 1.30 | 1.60 |
3 | red | ashtray | 0.56 | 0.75 |
4 | green | pen | 2.75 | 3.15 |
In [53]:
col.groupby(["color"])["price1"].mean()
Out[53]:
color
green 2.025
red 2.380
white 5.560
Name: price1, dtype: float64
In [54]:
col["price1"].groupby(col["color"]).mean()
Out[54]:
color
green 2.025
red 2.380
white 5.560
Name: price1, dtype: float64
In [55]:
col.groupby(["color"], as_index=False)["price1"].mean()
Out[55]:
color | price1 | |
---|---|---|
0 | green | 2.025 |
1 | red | 2.380 |
2 | white | 5.560 |
In [56]:
starbucks = pd.read_csv("./data/starbucks/directory.csv")
In [57]:
starbucks.head()
Out[57]:
Brand | Store Number | Store Name | Ownership Type | Street Address | City | State/Province | Country | Postcode | Phone Number | Timezone | Longitude | Latitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Starbucks | 47370-257954 | Meritxell, 96 | Licensed | Av. Meritxell, 96 | Andorra la Vella | 7 | AD | AD500 | 376818720 | GMT+1:00 Europe/Andorra | 1.53 | 42.51 |
1 | Starbucks | 22331-212325 | Ajman Drive Thru | Licensed | 1 Street 69, Al Jarf | Ajman | AJ | AE | NaN | NaN | GMT+04:00 Asia/Dubai | 55.47 | 25.42 |
2 | Starbucks | 47089-256771 | Dana Mall | Licensed | Sheikh Khalifa Bin Zayed St. | Ajman | AJ | AE | NaN | NaN | GMT+04:00 Asia/Dubai | 55.47 | 25.39 |
3 | Starbucks | 22126-218024 | Twofour 54 | Licensed | Al Salam Street | Abu Dhabi | AZ | AE | NaN | NaN | GMT+04:00 Asia/Dubai | 54.38 | 24.48 |
4 | Starbucks | 17127-178586 | Al Ain Tower | Licensed | Khaldiya Area, Abu Dhabi Island | Abu Dhabi | AZ | AE | NaN | NaN | GMT+04:00 Asia/Dubai | 54.54 | 24.51 |
In [58]:
count = starbucks.groupby(["Country"]).count()
In [59]:
count["Brand"].plot(kind="bar", figsize=(20, 8))
plt.show()
In [60]:
starbucks.groupby(["Country", "State/Province"]).count()
Out[60]:
Brand | Store Number | Store Name | Ownership Type | Street Address | City | Postcode | Phone Number | Timezone | Longitude | Latitude | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Country | State/Province | |||||||||||
AD | 7 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
AE | AJ | 2 | 2 | 2 | 2 | 2 | 2 | 0 | 0 | 2 | 2 | 2 |
AZ | 48 | 48 | 48 | 48 | 48 | 48 | 7 | 20 | 48 | 48 | 48 | |
DU | 82 | 82 | 82 | 82 | 82 | 82 | 16 | 50 | 82 | 82 | 82 | |
FU | 2 | 2 | 2 | 2 | 2 | 2 | 1 | 0 | 2 | 2 | 2 | |
RK | 3 | 3 | 3 | 3 | 3 | 3 | 0 | 3 | 3 | 3 | 3 | |
SH | 6 | 6 | 6 | 6 | 6 | 6 | 0 | 5 | 6 | 6 | 6 | |
UQ | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | |
AR | B | 21 | 21 | 21 | 21 | 21 | 21 | 18 | 5 | 21 | 21 | 21 |
C | 73 | 73 | 73 | 73 | 73 | 73 | 71 | 24 | 73 | 73 | 73 | |
M | 5 | 5 | 5 | 5 | 5 | 5 | 2 | 0 | 5 | 5 | 5 | |
S | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 0 | 3 | 3 | 3 | |
X | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 0 | 6 | 6 | 6 | |
AT | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
5 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | |
9 | 14 | 14 | 14 | 14 | 14 | 14 | 14 | 13 | 14 | 14 | 14 | |
AU | NSW | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 0 | 9 | 9 | 9 |
QLD | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 0 | 8 | 8 | 8 | |
VIC | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 0 | 5 | 5 | 5 | |
AW | AW | 3 | 3 | 3 | 3 | 3 | 3 | 0 | 3 | 3 | 3 | 3 |
AZ | BA | 3 | 3 | 3 | 3 | 3 | 3 | 2 | 3 | 3 | 3 | 3 |
SAB | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
BE | BE | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 0 | 4 | 4 | 4 |
VAN | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | |
VBR | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0 | 2 | 2 | 2 | |
VLG | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 1 | 10 | 10 | 10 | |
WAL | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0 | 2 | 2 | 2 | |
BG | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 |
23 | 4 | 4 | 4 | 4 | 4 | 4 | 1 | 0 | 4 | 4 | 4 | |
BH | 13 | 16 | 16 | 16 | 16 | 16 | 16 | 2 | 10 | 16 | 16 | 16 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
US | MO | 188 | 188 | 188 | 188 | 188 | 188 | 188 | 175 | 188 | 188 | 188 |
MS | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 28 | 32 | 32 | 32 | |
MT | 36 | 36 | 36 | 36 | 36 | 36 | 36 | 36 | 36 | 36 | 36 | |
NC | 338 | 338 | 338 | 338 | 338 | 338 | 338 | 322 | 338 | 338 | 338 | |
ND | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | |
NE | 58 | 58 | 58 | 58 | 58 | 58 | 58 | 56 | 58 | 58 | 58 | |
NH | 29 | 29 | 29 | 29 | 29 | 29 | 29 | 27 | 29 | 29 | 29 | |
NJ | 261 | 261 | 261 | 261 | 261 | 261 | 261 | 250 | 261 | 261 | 261 | |
NM | 76 | 76 | 76 | 76 | 76 | 76 | 76 | 75 | 76 | 76 | 76 | |
NV | 253 | 253 | 253 | 253 | 253 | 253 | 253 | 230 | 253 | 253 | 253 | |
NY | 645 | 645 | 645 | 645 | 645 | 645 | 645 | 627 | 645 | 645 | 645 | |
OH | 378 | 378 | 378 | 378 | 378 | 378 | 377 | 357 | 378 | 378 | 378 | |
OK | 79 | 79 | 79 | 79 | 79 | 79 | 79 | 76 | 79 | 79 | 79 | |
OR | 359 | 359 | 359 | 359 | 359 | 359 | 359 | 343 | 359 | 359 | 359 | |
PA | 357 | 357 | 357 | 357 | 357 | 357 | 357 | 350 | 357 | 357 | 357 | |
RI | 27 | 27 | 27 | 27 | 27 | 27 | 27 | 27 | 27 | 27 | 27 | |
SC | 131 | 131 | 131 | 131 | 131 | 131 | 131 | 125 | 131 | 131 | 131 | |
SD | 25 | 25 | 25 | 25 | 25 | 25 | 25 | 25 | 25 | 25 | 25 | |
TN | 180 | 180 | 180 | 180 | 180 | 180 | 180 | 162 | 180 | 180 | 180 | |
TX | 1042 | 1042 | 1042 | 1042 | 1042 | 1042 | 1042 | 1002 | 1042 | 1042 | 1042 | |
UT | 101 | 101 | 101 | 101 | 101 | 101 | 101 | 99 | 101 | 101 | 101 | |
VA | 432 | 432 | 432 | 432 | 432 | 432 | 432 | 413 | 432 | 432 | 432 | |
VT | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | |
WA | 757 | 757 | 757 | 757 | 757 | 757 | 757 | 738 | 757 | 757 | 757 | |
WI | 145 | 145 | 145 | 145 | 145 | 145 | 145 | 144 | 145 | 145 | 145 | |
WV | 25 | 25 | 25 | 25 | 25 | 25 | 25 | 23 | 25 | 25 | 25 | |
WY | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 22 | 23 | 23 | 23 | |
VN | HN | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
SG | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 17 | 19 | 19 | 19 | |
ZA | GT | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 2 | 3 | 3 | 3 |
545 rows × 11 columns
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [3]:
movie = pd.read_csv("./data/IMDB-Movie-Data.csv")
In [4]:
movie.head()
Out[4]:
Rank | Title | Genre | Description | Director | Actors | Year | Runtime (Minutes) | Rating | Votes | Revenue (Millions) | Metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Guardians of the Galaxy | Action,Adventure,Sci-Fi | A group of intergalactic criminals are forced ... | James Gunn | Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... | 2014 | 121 | 8.1 | 757074 | 333.13 | 76.0 |
1 | 2 | Prometheus | Adventure,Mystery,Sci-Fi | Following clues to the origin of mankind, a te... | Ridley Scott | Noomi Rapace, Logan Marshall-Green, Michael Fa... | 2012 | 124 | 7.0 | 485820 | 126.46 | 65.0 |
2 | 3 | Split | Horror,Thriller | Three girls are kidnapped by a man with a diag... | M. Night Shyamalan | James McAvoy, Anya Taylor-Joy, Haley Lu Richar... | 2016 | 117 | 7.3 | 157606 | 138.12 | 62.0 |
3 | 4 | Sing | Animation,Comedy,Family | In a city of humanoid animals, a hustling thea... | Christophe Lourdelet | Matthew McConaughey,Reese Witherspoon, Seth Ma... | 2016 | 108 | 7.2 | 60545 | 270.32 | 59.0 |
4 | 5 | Suicide Squad | Action,Adventure,Fantasy | A secret government agency recruits some of th... | David Ayer | Will Smith, Jared Leto, Margot Robbie, Viola D... | 2016 | 123 | 6.2 | 393727 | 325.02 | 40.0 |
我们想知道这些电影数据中评分的平均分,导演的人数等信息,我们应该怎么获取?
In [8]:
movie["Rating"].mean()
Out[8]:
6.723199999999999
In [10]:
np.unique(movie["Director"]).shape[0]
Out[10]:
644
对于这一组电影数据,如果我们想Rating,Runtime (Minutes)的分布情况,应该如何呈现数据?
In [13]:
# Rating分布
movie["Rating"].plot(kind="hist")
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x123e2ccf8>
In [16]:
# Rating分布
# 1.创建画布
plt.figure(figsize=(20, 8), dpi=100)
# 2.绘制图像
plt.hist(movie["Rating"].values, bins=20)
# 2.1 添加刻度
max_ = movie["Rating"].max()
min_ = movie["Rating"].min()
t1 = np.linspace(min_, max_, num=21)
plt.xticks(t1)
# 2.2 添加网格
plt.grid()
# 3.显示
plt.show()
In [17]:
# Runtime (Minutes)分布
# 1.创建画布
plt.figure(figsize=(20, 8), dpi=100)
# 2.绘制图像
plt.hist(movie["Runtime (Minutes)"].values, bins=20)
# 2.1 添加刻度
max_ = movie["Runtime (Minutes)"].max()
min_ = movie["Runtime (Minutes)"].min()
t1 = np.linspace(min_, max_, num=21)
plt.xticks(t1)
# 2.2 添加网格
plt.grid()
# 3.显示
plt.show()
对于这一组电影数据,如果我们希望统计电影分类(genre)的情况,应该如何处理数据?
In [22]:
# movie["Genre"]
temp_list = [i.split(",") for i in movie["Genre"]]
In [23]:
temp_list
Out[23]:
[['Action', 'Adventure', 'Sci-Fi'],
['Adventure', 'Mystery', 'Sci-Fi'],
['Horror', 'Thriller'],
['Animation', 'Comedy', 'Family'],
['Action', 'Adventure', 'Fantasy'],
['Action', 'Adventure', 'Fantasy'],
['Comedy', 'Drama', 'Music'],
['Comedy'],
['Action', 'Adventure', 'Biography'],
['Adventure', 'Drama', 'Romance'],
['Adventure', 'Family', 'Fantasy'],
['Biography', 'Drama', 'History'],
['Action', 'Adventure', 'Sci-Fi'],
['Animation', 'Adventure', 'Comedy'],
['Action', 'Comedy', 'Drama'],
['Animation', 'Adventure', 'Comedy'],
['Biography', 'Drama', 'History'],
['Action', 'Thriller'],
['Biography', 'Drama'],
['Drama', 'Mystery', 'Sci-Fi'],
['Adventure', 'Drama', 'Thriller'],
['Drama'],
['Crime', 'Drama', 'Horror'],
['Animation', 'Adventure', 'Comedy'],
['Action', 'Adventure', 'Sci-Fi'],
['Comedy'],
['Action', 'Adventure', 'Drama'],
['Horror', 'Thriller'],
['Comedy'],
['Action', 'Adventure', 'Drama'],
['Comedy'],
['Drama', 'Thriller'],
['Action', 'Adventure', 'Sci-Fi'],
['Action', 'Adventure', 'Comedy'],
['Action', 'Horror', 'Sci-Fi'],
['Action', 'Adventure', 'Sci-Fi'],
['Adventure', 'Drama', 'Sci-Fi'],
['Action', 'Adventure', 'Fantasy'],
['Action', 'Adventure', 'Western'],
['Comedy', 'Drama'],
['Animation', 'Adventure', 'Comedy'],
['Drama'],
['Horror'],
['Biography', 'Drama', 'History'],
['Drama'],
['Action', 'Adventure', 'Fantasy'],
['Drama', 'Thriller'],
['Adventure', 'Drama', 'Fantasy'],
['Action', 'Adventure', 'Sci-Fi'],
['Drama'],
['Action', 'Adventure', 'Fantasy'],
['Action', 'Adventure', 'Fantasy'],
['Comedy', 'Drama'],
['Action', 'Crime', 'Thriller'],
['Action', 'Crime', 'Drama'],
['Adventure', 'Drama', 'History'],
['Crime', 'Horror', 'Thriller'],
['Drama', 'Romance'],
['Comedy', 'Drama', 'Romance'],
['Biography', 'Drama'],
['Action', 'Adventure', 'Sci-Fi'],
['Horror', 'Mystery', 'Thriller'],
['Crime', 'Drama', 'Mystery'],
['Drama', 'Romance', 'Thriller'],
['Drama', 'Mystery', 'Sci-Fi'],
['Action', 'Adventure', 'Comedy'],
['Drama', 'History', 'Thriller'],
['Action', 'Adventure', 'Sci-Fi'],
['Drama'],
['Action', 'Drama', 'Thriller'],
['Drama', 'History'],
['Action', 'Drama', 'Romance'],
['Drama', 'Fantasy'],
['Drama', 'Romance'],
['Animation', 'Adventure', 'Comedy'],
['Action', 'Adventure', 'Fantasy'],
['Action', 'Sci-Fi'],
['Adventure', 'Drama', 'War'],
['Action', 'Adventure', 'Fantasy'],
['Action', 'Comedy', 'Fantasy'],
['Action', 'Adventure', 'Sci-Fi'],
['Comedy', 'Drama'],
['Biography', 'Comedy', 'Crime'],
['Crime', 'Drama', 'Mystery'],
['Action', 'Crime', 'Thriller'],
['Action', 'Adventure', 'Sci-Fi'],
['Crime', 'Drama'],
['Action', 'Adventure', 'Fantasy'],
['Crime', 'Drama', 'Mystery'],
['Action', 'Crime', 'Drama'],
['Crime', 'Drama', 'Mystery'],
['Action', 'Adventure', 'Fantasy'],
['Drama'],
['Comedy', 'Crime', 'Drama'],
['Action', 'Adventure', 'Sci-Fi'],
['Action', 'Comedy', 'Crime'],
['Animation', 'Drama', 'Fantasy'],
['Horror', 'Mystery', 'Sci-Fi'],
['Drama', 'Mystery', 'Thriller'],
['Crime', 'Drama', 'Thriller'],
['Biography', 'Crime', 'Drama'],
['Action', 'Adventure', 'Fantasy'],
['Adventure', 'Drama', 'Sci-Fi'],
['Crime', 'Mystery', 'Thriller'],
['Action', 'Adventure', 'Comedy'],
['Crime', 'Drama', 'Thriller'],
['Comedy'],
['Action', 'Adventure', 'Drama'],
['Drama'],
['Drama', 'Mystery', 'Sci-Fi'],
['Action', 'Horror', 'Thriller'],
['Biography', 'Drama', 'History'],
['Romance', 'Sci-Fi'],
['Action', 'Fantasy', 'War'],
['Adventure', 'Drama', 'Fantasy'],
['Comedy'],
['Horror', 'Thriller'],
['Action', 'Biography', 'Drama'],
['Drama', 'Horror', 'Mystery'],
['Animation', 'Adventure', 'Comedy'],
['Adventure', 'Drama', 'Family'],
['Adventure', 'Mystery', 'Sci-Fi'],
['Adventure', 'Comedy', 'Romance'],
['Action'],
['Action', 'Thriller'],
['Adventure', 'Drama', 'Family'],
['Action', 'Adventure', 'Sci-Fi'],
['Adventure', 'Crime', 'Mystery'],
['Comedy', 'Family', 'Musical'],
['Adventure', 'Drama', 'Thriller'],
['Drama'],
['Adventure', 'Comedy', 'Drama'],
['Drama', 'Horror', 'Thriller'],
['Drama', 'Music'],
['Action', 'Crime', 'Thriller'],
['Crime', 'Drama', 'Thriller'],
['Crime', 'Drama', 'Thriller'],
['Drama', 'Romance'],
['Mystery', 'Thriller'],
['Mystery', 'Thriller', 'Western'],
['Action', 'Adventure', 'Sci-Fi'],
['Comedy', 'Family'],
['Biography', 'Comedy', 'Drama'],
['Drama'],
['Drama', 'Western'],
['Drama', 'Mystery', 'Romance'],
['Comedy', 'Drama'],
['Action', 'Drama', 'Mystery'],
['Comedy'],
['Action', 'Adventure', 'Crime'],
['Adventure', 'Family', 'Fantasy'],
['Adventure', 'Sci-Fi', 'Thriller'],
['Drama'],
['Action', 'Crime', 'Drama'],
['Drama', 'Horror', 'Mystery'],
['Action', 'Horror', 'Sci-Fi'],
['Action', 'Adventure', 'Sci-Fi'],
['Comedy', 'Drama', 'Romance'],
['Action', 'Comedy', 'Fantasy'],
['Action', 'Comedy', 'Mystery'],
['Thriller', 'War'],
['Action', 'Comedy', 'Crime'],
['Action', 'Adventure', 'Sci-Fi'],
['Action', 'Adventure', 'Crime'],
['Action', 'Adventure', 'Thriller'],
['Drama', 'Fantasy', 'Romance'],
['Action', 'Adventure', 'Comedy'],
['Biography', 'Drama', 'History'],
['Action', 'Drama', 'History'],
['Action', 'Adventure', 'Thriller'],
['Crime', 'Drama', 'Thriller'],
['Animation', 'Adventure', 'Family'],
['Adventure', 'Horror'],
['Drama', 'Romance', 'Sci-Fi'],
['Animation', 'Adventure', 'Comedy'],
['Action', 'Adventure', 'Family'],
['Action', 'Adventure', 'Drama'],
['Action', 'Comedy'],
['Horror', 'Mystery', 'Thriller'],
['Action', 'Adventure', 'Comedy'],
['Comedy', 'Romance'],
['Horror', 'Mystery'],
['Drama', 'Family', 'Fantasy'],
['Sci-Fi'],
['Drama', 'Thriller'],
['Drama', 'Romance'],
['Drama', 'War'],
['Drama', 'Fantasy', 'Horror'],
['Crime', 'Drama'],
['Comedy', 'Drama', 'Romance'],
['Drama', 'Romance'],
['Drama'],
['Crime', 'Drama', 'History'],
['Horror', 'Sci-Fi', 'Thriller'],
['Action', 'Drama', 'Sport'],
['Action', 'Adventure', 'Sci-Fi'],
['Crime', 'Drama', 'Thriller'],
['Adventure', 'Biography', 'Drama'],
['Biography', 'Drama', 'Thriller'],
['Action', 'Comedy', 'Crime'],
['Action', 'Adventure', 'Sci-Fi'],
['Drama', 'Fantasy', 'Horror'],
['Biography', 'Drama', 'Thriller'],
['Action', 'Adventure', 'Sci-Fi'],
['Action', 'Adventure', 'Mystery'],
['Action', 'Adventure', 'Sci-Fi'],
['Drama', 'Horror'],
['Comedy', 'Drama', 'Romance'],
['Comedy', 'Romance'],
['Drama', 'Horror', 'Thriller'],
['Action', 'Adventure', 'Drama'],
['Drama'],
['Action', 'Adventure', 'Sci-Fi'],
['Action', 'Drama', 'Mystery'],
['Action', 'Adventure', 'Fantasy'],
['Action', 'Adventure', 'Fantasy'],
['Action', 'Adventure', 'Sci-Fi'],
['Action', 'Adventure', 'Comedy'],
['Drama', 'Horror'],
['Action', 'Comedy'],
['Action', 'Adventure', 'Sci-Fi'],
['Animation', 'Adventure', 'Comedy'],
['Horror', 'Mystery'],
['Crime', 'Drama', 'Mystery'],
['Comedy', 'Crime'],
['Drama'],
['Comedy', 'Drama', 'Romance'],
['Action', 'Adventure', 'Sci-Fi'],
['Action', 'Adventure', 'Family'],
['Horror', 'Sci-Fi', 'Thriller'],
['Drama', 'Fantasy', 'War'],
['Crime', 'Drama', 'Thriller'],
['Action', 'Adventure', 'Drama'],
['Action', 'Adventure', 'Thriller'],
['Action', 'Adventure', 'Drama'],
['Drama', 'Romance'],
['Biography', 'Drama', 'History'],
['Drama', 'Horror', 'Thriller'],
['Adventure', 'Comedy', 'Drama'],
['Action', 'Adventure', 'Romance'],
['Action', 'Drama', 'War'],
['Animation', 'Adventure', 'Comedy'],
['Animation', 'Adventure', 'Comedy'],
['Action', 'Adventure', 'Sci-Fi'],
['Adventure', 'Family', 'Fantasy'],
['Drama', 'Musical', 'Romance'],
['Drama', 'Sci-Fi', 'Thriller'],
['Comedy', 'Drama'],
['Action', 'Comedy', 'Crime'],
['Biography', 'Comedy', 'Drama'],
['Comedy', 'Drama', 'Romance'],
['Drama', 'Thriller'],
['Biography', 'Drama', 'History'],
['Action', 'Adventure', 'Sci-Fi'],
['Horror', 'Mystery', 'Thriller'],
['Comedy'],
['Action', 'Adventure', 'Sci-Fi'],
['Action', 'Drama', 'Sci-Fi'],
['Horror'],
['Drama', 'Thriller'],
['Comedy', 'Drama', 'Romance'],
['Drama', 'Thriller'],
['Comedy', 'Drama'],
['Drama'],
['Action', 'Adventure', 'Comedy'],
['Drama', 'Horror', 'Thriller'],
['Comedy'],
['Drama', 'Sci-Fi'],
['Action', 'Adventure', 'Sci-Fi'],
['Horror'],
['Action', 'Adventure', 'Thriller'],
['Adventure', 'Fantasy'],
['Action', 'Comedy', 'Crime'],
['Comedy', 'Drama', 'Music'],
['Animation', 'Adventure', 'Comedy'],
['Action', 'Adventure', 'Mystery'],
['Action', 'Comedy', 'Crime'],
['Crime', 'Drama', 'History'],
['Comedy'],
['Action', 'Adventure', 'Sci-Fi'],
['Crime', 'Mystery', 'Thriller'],
['Action', 'Adventure', 'Crime'],
['Thriller'],
['Biography', 'Drama', 'Romance'],
['Action', 'Adventure'],
['Action', 'Fantasy'],
['Action', 'Comedy'],
['Action', 'Adventure', 'Sci-Fi'],
['Action', 'Comedy', 'Crime'],
['Thriller'],
['Action', 'Drama', 'Horror'],
['Comedy', 'Music', 'Romance'],
['Comedy'],
['Drama'],
['Action', 'Adventure', 'Fantasy'],
['Drama', 'Romance'],
['Animation', 'Adventure', 'Comedy'],
['Comedy', 'Drama'],
['Biography', 'Crime', 'Drama'],
['Drama', 'History'],
['Action', 'Crime', 'Thriller'],
['Action', 'Biography', 'Drama'],
['Horror'],
['Comedy', 'Romance'],
['Comedy', 'Romance'],
['Comedy', 'Crime', 'Drama'],
['Adventure', 'Family', 'Fantasy'],
['Crime', 'Drama', 'Thriller'],
['Action', 'Crime', 'Thriller'],
['Comedy', 'Romance'],
['Biography', 'Drama', 'Sport'],
['Drama', 'Romance'],
['Drama', 'Horror'],
['Adventure', 'Fantasy'],
['Adventure', 'Family', 'Fantasy'],
['Action', 'Drama', 'Sci-Fi'],
['Action', 'Adventure', 'Sci-Fi'],
['Action', 'Horror'],
['Comedy', 'Horror', 'Thriller'],
['Action', 'Crime', 'Thriller'],
['Crime', 'Drama', 'Music'],
['Drama'],
['Action', 'Crime', 'Thriller'],
['Action', 'Sci-Fi', 'Thriller'],
['Biography', 'Drama'],
['Action', 'Adventure', 'Fantasy'],
['Drama', 'Horror', 'Sci-Fi'],
['Biography', 'Comedy', 'Drama'],
['Crime', 'Horror', 'Thriller'],
['Crime', 'Drama', 'Mystery'],
['Animation', 'Adventure', 'Comedy'],
['Action', 'Biography', 'Drama'],
['Biography', 'Drama'],
['Biography', 'Drama', 'History'],
['Action', 'Biography', 'Drama'],
['Drama', 'Fantasy', 'Horror'],
['Comedy', 'Drama', 'Romance'],
['Drama', 'Sport'],
['Drama', 'Romance'],
['Comedy', 'Romance'],
['Action', 'Crime', 'Thriller'],
['Action', 'Crime', 'Drama'],
['Action', 'Drama', 'Thriller'],
['Adventure', 'Family', 'Fantasy'],
['Action', 'Adventure'],
['Action', 'Adventure', 'Romance'],
['Adventure', 'Family', 'Fantasy'],
['Crime', 'Drama'],
['Comedy', 'Horror'],
['Comedy', 'Fantasy', 'Romance'],
['Drama'],
['Drama'],
['Comedy', 'Drama'],
['Comedy', 'Drama', 'Romance'],
['Adventure', 'Sci-Fi', 'Thriller'],
['Action', 'Adventure', 'Fantasy'],
['Comedy', 'Drama'],
['Biography', 'Drama', 'Romance'],
['Comedy', 'Fantasy'],
['Comedy', 'Drama', 'Fantasy'],
['Comedy'],
['Horror', 'Thriller'],
['Action', 'Adventure', 'Sci-Fi'],
['Adventure', 'Comedy', 'Horror'],
['Comedy', 'Mystery'],
['Drama'],
['Adventure', 'Drama', 'Fantasy'],
['Drama', 'Sport'],
['Action', 'Adventure'],
['Action', 'Adventure', 'Drama'],
['Action', 'Drama', 'Sci-Fi'],
['Action', 'Mystery', 'Sci-Fi'],
['Action', 'Crime', 'Drama'],
['Action', 'Crime', 'Fantasy'],
['Biography', 'Comedy', 'Drama'],
['Action', 'Crime', 'Thriller'],
['Biography', 'Crime', 'Drama'],
['Drama', 'Sport'],
['Adventure', 'Comedy', 'Drama'],
['Action', 'Adventure', 'Thriller'],
['Comedy', 'Fantasy', 'Horror'],
['Drama', 'Sport'],
['Horror', 'Thriller'],
['Drama', 'History', 'Thriller'],
['Animation', 'Action', 'Adventure'],
['Action', 'Adventure', 'Drama'],
['Action', 'Comedy', 'Family'],
['Action', 'Adventure', 'Drama'],
['Action', 'Adventure', 'Sci-Fi'],
['Action', 'Adventure', 'Sci-Fi'],
['Action', 'Comedy'],
['Action', 'Crime', 'Drama'],
['Biography', 'Drama'],
['Comedy', 'Romance'],
['Comedy'],
['Drama', 'Fantasy', 'Romance'],
['Action', 'Adventure', 'Sci-Fi'],
['Comedy'],
['Comedy', 'Sci-Fi'],
['Comedy', 'Drama'],
['Animation', 'Action', 'Adventure'],
['Horror'],
['Action', 'Biography', 'Crime'],
['Animation', 'Adventure', 'Comedy'],
['Drama', 'Romance'],
['Drama', 'Mystery', 'Thriller'],
['Drama', 'History', 'Thriller'],
['Animation', 'Adventure', 'Comedy'],
['Action', 'Adventure', 'Sci-Fi'],
['Adventure', 'Comedy'],
['Action', 'Thriller'],
['Comedy', 'Music'],
['Animation', 'Adventure', 'Comedy'],
['Crime', 'Drama', 'Thriller'],
['Action', 'Adventure', 'Crime'],
['Comedy', 'Drama', 'Horror'],
['Drama'],
['Drama', 'Mystery', 'Romance'],
['Adventure', 'Family', 'Fantasy'],
['Drama'],
['Action', 'Drama', 'Thriller'],
['Drama'],
['Action', 'Horror', 'Romance'],
['Action', 'Drama', 'Fantasy'],
['Action', 'Crime', 'Drama'],
['Drama', 'Fantasy', 'Romance'],
['Action', 'Crime', 'Thriller'],
['Action', 'Mystery', 'Thriller'],
['Horror', 'Mystery', 'Thriller'],
['Action', 'Horror', 'Sci-Fi'],
['Comedy', 'Drama'],
['Comedy'],
['Action', 'Adventure', 'Horror'],
['Action', 'Adventure', 'Thriller'],
['Action', 'Crime', 'Drama'],
['Comedy', 'Crime', 'Drama'],
['Drama', 'Romance'],
['Drama', 'Thriller'],
['Action', 'Comedy', 'Crime'],
['Comedy'],
['Adventure', 'Family', 'Fantasy'],
['Drama', 'Romance'],
['Animation', 'Family', 'Fantasy'],
['Drama', 'Romance'],
['Thriller'],
['Adventure', 'Horror', 'Mystery'],
['Action', 'Sci-Fi'],
['Adventure', 'Comedy', 'Drama'],
['Animation', 'Action', 'Adventure'],
['Drama', 'Horror'],
['Action', 'Adventure', 'Sci-Fi'],
['Comedy', 'Drama'],
['Action', 'Horror', 'Mystery'],
['Action', 'Thriller'],
['Action', 'Adventure', 'Sci-Fi'],
['Drama'],
['Comedy', 'Drama', 'Romance'],
['Comedy', 'Crime'],
['Comedy', 'Romance'],
['Drama', 'Romance'],
['Crime', 'Drama', 'Thriller'],
['Horror', 'Mystery', 'Thriller'],
['Biography', 'Drama'],
['Drama', 'Mystery', 'Sci-Fi'],
['Adventure', 'Comedy', 'Family'],
['Action', 'Adventure', 'Crime'],
['Action', 'Crime', 'Mystery'],
['Mystery', 'Thriller'],
['Action', 'Sci-Fi', 'Thriller'],
['Action', 'Comedy', 'Crime'],
['Biography', 'Crime', 'Drama'],
['Biography', 'Drama', 'History'],
['Action', 'Adventure', 'Sci-Fi'],
['Adventure', 'Family', 'Fantasy'],
['Biography', 'Drama', 'History'],
['Biography', 'Comedy', 'Drama'],
['Drama', 'Thriller'],
['Horror', 'Thriller'],
['Drama'],
['Drama', 'War'],
['Comedy', 'Drama', 'Romance'],
['Drama', 'Romance', 'Sci-Fi'],
['Action', 'Crime', 'Drama'],
['Comedy', 'Drama'],
['Animation', 'Action', 'Adventure'],
['Adventure', 'Comedy', 'Drama'],
['Comedy', 'Drama', 'Family'],
['Drama', 'Romance', 'Thriller'],
['Comedy', 'Crime', 'Drama'],
['Animation', 'Comedy', 'Family'],
['Drama', 'Horror', 'Sci-Fi'],
['Action', 'Adventure', 'Drama'],
['Action', 'Horror', 'Sci-Fi'],
['Action', 'Crime', 'Sport'],
['Drama', 'Horror', 'Sci-Fi'],
['Drama', 'Horror', 'Sci-Fi'],
['Action', 'Adventure', 'Comedy'],
['Mystery', 'Sci-Fi', 'Thriller'],
['Crime', 'Drama', 'Thriller'],
['Animation', 'Adventure', 'Comedy'],
['Action', 'Sci-Fi', 'Thriller'],
['Drama', 'Romance'],
['Crime', 'Drama', 'Thriller'],
['Comedy', 'Drama', 'Music'],
['Drama', 'Fantasy', 'Romance'],
['Crime', 'Drama', 'Thriller'],
['Crime', 'Drama', 'Thriller'],
['Comedy', 'Drama', 'Romance'],
['Comedy', 'Romance'],
['Drama', 'Sci-Fi', 'Thriller'],
['Drama', 'War'],
['Action', 'Crime', 'Drama'],
['Sci-Fi', 'Thriller'],
['Adventure', 'Drama', 'Horror'],
['Comedy', 'Drama', 'Music'],
['Comedy', 'Drama', 'Romance'],
['Action', 'Adventure', 'Drama'],
['Action', 'Crime', 'Drama'],
['Adventure', 'Fantasy'],
['Drama', 'Romance'],
['Biography', 'History', 'Thriller'],
['Crime', 'Drama', 'Thriller'],
['Action', 'Drama', 'History'],
['Biography', 'Comedy', 'Drama'],
['Crime', 'Drama', 'Thriller'],
['Action', 'Biography', 'Drama'],
['Action', 'Drama', 'Sci-Fi'],
['Adventure', 'Horror'],
['Action', 'Adventure', 'Sci-Fi'],
['Action', 'Adventure', 'Mystery'],
['Comedy', 'Drama', 'Romance'],
['Horror', 'Thriller'],
['Action', 'Sci-Fi', 'Thriller'],
['Action', 'Sci-Fi', 'Thriller'],
['Biography', 'Drama'],
['Action', 'Crime', 'Drama'],
['Action', 'Crime', 'Mystery'],
['Action', 'Adventure', 'Comedy'],
['Crime', 'Drama', 'Thriller'],
['Crime', 'Drama'],
['Mystery', 'Thriller'],
['Mystery', 'Sci-Fi', 'Thriller'],
['Action', 'Mystery', 'Sci-Fi'],
['Drama', 'Romance'],
['Drama', 'Thriller'],
['Drama', 'Mystery', 'Sci-Fi'],
['Comedy', 'Drama'],
['Adventure', 'Family', 'Fantasy'],
['Biography', 'Drama', 'Sport'],
['Drama'],
['Comedy', 'Drama', 'Romance'],
['Biography', 'Drama', 'Romance'],
['Action', 'Adventure', 'Sci-Fi'],
['Drama', 'Sci-Fi', 'Thriller'],
['Drama', 'Romance', 'Thriller'],
['Mystery', 'Thriller'],
['Mystery', 'Thriller'],
['Action', 'Drama', 'Fantasy'],
['Action', 'Adventure', 'Biography'],
['Adventure', 'Comedy', 'Sci-Fi'],
['Action', 'Adventure', 'Thriller'],
['Fantasy', 'Horror'],
['Horror', 'Mystery'],
['Animation', 'Adventure', 'Comedy'],
['Action', 'Adventure', 'Drama'],
['Adventure', 'Family', 'Fantasy'],
['Action', 'Adventure', 'Sci-Fi'],
['Comedy', 'Drama'],
['Comedy', 'Drama'],
['Crime', 'Drama', 'Thriller'],
['Comedy', 'Romance'],
['Animation', 'Comedy', 'Family'],
['Comedy', 'Drama'],
['Comedy', 'Drama'],
['Biography', 'Drama', 'Sport'],
['Action', 'Adventure', 'Fantasy'],
['Action', 'Drama', 'History'],
['Action', 'Adventure', 'Sci-Fi'],
['Action', 'Adventure', 'Mystery'],
['Crime', 'Drama', 'Mystery'],
['Action'],
['Action', 'Adventure', 'Family'],
['Comedy', 'Romance'],
['Comedy', 'Drama', 'Romance'],
['Biography', 'Drama', 'Sport'],
['Action', 'Fantasy', 'Thriller'],
['Biography', 'Drama', 'Sport'],
['Action', 'Drama', 'Fantasy'],
['Adventure', 'Sci-Fi', 'Thriller'],
['Animation', 'Adventure', 'Comedy'],
['Drama', 'Mystery', 'Thriller'],
['Drama', 'Romance'],
['Crime', 'Drama', 'Mystery'],
['Comedy', 'Romance', 'Sport'],
['Comedy', 'Family'],
['Drama', 'Horror', 'Mystery'],
['Action', 'Drama', 'Sport'],
['Action', 'Adventure', 'Comedy'],
['Drama', 'Mystery', 'Sci-Fi'],
['Animation', 'Action', 'Comedy'],
['Action', 'Crime', 'Drama'],
['Action', 'Crime', 'Drama'],
['Comedy', 'Drama', 'Romance'],
['Animation', 'Action', 'Adventure'],
['Crime', 'Drama'],
['Drama'],
['Drama'],
['Comedy', 'Crime'],
['Drama'],
['Action', 'Adventure', 'Fantasy'],
['Drama', 'Fantasy', 'Romance'],
['Comedy', 'Drama'],
['Drama', 'Fantasy', 'Thriller'],
['Biography', 'Crime', 'Drama'],
['Comedy', 'Drama', 'Romance'],
['Action', 'Crime', 'Drama'],
['Sci-Fi'],
['Action', 'Biography', 'Drama'],
['Action', 'Comedy', 'Romance'],
['Adventure', 'Comedy', 'Drama'],
['Comedy', 'Crime', 'Drama'],
['Action', 'Fantasy', 'Horror'],
['Drama', 'Horror'],
['Horror'],
['Action', 'Thriller'],
['Action', 'Adventure', 'Mystery'],
['Action', 'Adventure', 'Fantasy'],
['Comedy', 'Drama', 'Romance'],
['Crime', 'Drama', 'Mystery'],
['Adventure', 'Comedy', 'Family'],
['Comedy', 'Drama', 'Romance'],
['Comedy'],
['Comedy', 'Drama', 'Horror'],
['Drama', 'Horror', 'Thriller'],
['Animation', 'Adventure', 'Family'],
['Comedy', 'Romance'],
['Mystery', 'Romance', 'Sci-Fi'],
['Crime', 'Drama'],
['Drama', 'Horror', 'Mystery'],
['Comedy'],
['Biography', 'Drama'],
['Comedy', 'Drama', 'Thriller'],
['Comedy', 'Western'],
['Drama', 'History', 'War'],
['Drama', 'Horror', 'Sci-Fi'],
['Drama'],
['Comedy', 'Drama'],
['Fantasy', 'Horror', 'Thriller'],
['Drama', 'Romance'],
['Action', 'Comedy', 'Fantasy'],
['Drama', 'Horror', 'Musical'],
['Crime', 'Drama', 'Mystery'],
['Horror', 'Mystery', 'Thriller'],
['Comedy', 'Music'],
['Drama'],
['Biography', 'Crime', 'Drama'],
['Drama'],
['Action', 'Adventure', 'Comedy'],
['Crime', 'Drama', 'Mystery'],
['Drama'],
['Action', 'Comedy', 'Crime'],
['Comedy', 'Drama', 'Romance'],
['Crime', 'Drama', 'Mystery'],
['Action', 'Comedy', 'Crime'],
['Drama'],
['Drama', 'Romance'],
['Crime', 'Drama', 'Mystery'],
['Adventure', 'Comedy', 'Romance'],
['Comedy', 'Crime', 'Drama'],
['Adventure', 'Drama', 'Thriller'],
['Biography', 'Crime', 'Drama'],
['Crime', 'Drama', 'Thriller'],
['Drama', 'History', 'Thriller'],
['Action', 'Adventure', 'Sci-Fi'],
['Action', 'Comedy'],
['Horror'],
['Action', 'Crime', 'Mystery'],
['Comedy', 'Romance'],
['Comedy'],
['Action', 'Drama', 'Thriller'],
['Action', 'Adventure', 'Sci-Fi'],
['Drama', 'Mystery', 'Thriller'],
['Comedy', 'Drama', 'Romance'],
['Action', 'Fantasy', 'Horror'],
['Drama', 'Romance'],
['Biography', 'Drama'],
['Biography', 'Drama'],
['Action', 'Adventure', 'Sci-Fi'],
['Animation', 'Adventure', 'Comedy'],
['Drama', 'Mystery', 'Thriller'],
['Action', 'Horror', 'Sci-Fi'],
['Drama', 'Romance'],
['Biography', 'Drama'],
['Action', 'Adventure', 'Drama'],
['Adventure', 'Drama', 'Fantasy'],
['Drama', 'Family'],
['Comedy', 'Drama', 'Romance'],
['Drama', 'Romance', 'Sci-Fi'],
['Action', 'Adventure', 'Thriller'],
['Comedy', 'Romance'],
['Crime', 'Drama', 'Horror'],
['Comedy', 'Fantasy'],
['Action', 'Comedy', 'Crime'],
['Adventure', 'Drama', 'Romance'],
['Action', 'Crime', 'Drama'],
['Crime', 'Horror', 'Thriller'],
['Romance', 'Sci-Fi', 'Thriller'],
['Comedy', 'Drama', 'Romance'],
['Crime', 'Drama'],
['Crime', 'Drama', 'Mystery'],
['Action', 'Adventure', 'Sci-Fi'],
['Animation', 'Fantasy'],
['Animation', 'Adventure', 'Comedy'],
['Drama', 'Mystery', 'War'],
['Comedy', 'Romance'],
['Animation', 'Comedy', 'Family'],
['Comedy'],
['Horror', 'Mystery', 'Thriller'],
['Action', 'Adventure', 'Drama'],
['Comedy'],
['Drama'],
['Adventure', 'Biography', 'Drama'],
['Comedy'],
['Horror', 'Thriller'],
['Action', 'Drama', 'Family'],
['Comedy', 'Fantasy', 'Horror'],
['Comedy', 'Romance'],
['Drama', 'Mystery', 'Romance'],
['Action', 'Adventure', 'Comedy'],
['Thriller'],
['Comedy'],
['Adventure', 'Comedy', 'Sci-Fi'],
['Comedy', 'Drama', 'Fantasy'],
['Mystery', 'Thriller'],
['Comedy', 'Drama'],
['Adventure', 'Drama', 'Family'],
['Horror', 'Thriller'],
['Action', 'Drama', 'Romance'],
['Drama', 'Romance'],
['Action', 'Adventure', 'Fantasy'],
['Comedy'],
['Action', 'Biography', 'Drama'],
['Drama', 'Mystery', 'Romance'],
['Adventure', 'Drama', 'Western'],
['Drama', 'Music', 'Romance'],
['Comedy', 'Romance', 'Western'],
['Thriller'],
['Comedy', 'Drama', 'Romance'],
['Horror', 'Thriller'],
['Adventure', 'Family', 'Fantasy'],
['Crime', 'Drama', 'Mystery'],
['Horror', 'Mystery'],
['Comedy', 'Crime', 'Drama'],
['Action', 'Comedy', 'Romance'],
['Biography', 'Drama', 'History'],
['Adventure', 'Drama'],
['Drama', 'Thriller'],
['Drama'],
['Action', 'Adventure', 'Fantasy'],
['Action', 'Biography', 'Drama'],
['Drama', 'Music'],
['Comedy', 'Drama'],
['Drama', 'Thriller', 'War'],
['Action', 'Mystery', 'Thriller'],
['Horror', 'Sci-Fi', 'Thriller'],
['Comedy', 'Drama', 'Romance'],
['Action', 'Sci-Fi'],
['Action', 'Adventure', 'Fantasy'],
['Drama', 'Mystery', 'Romance'],
['Drama'],
['Action', 'Adventure', 'Thriller'],
['Action', 'Crime', 'Thriller'],
['Animation', 'Action', 'Adventure'],
['Drama', 'Fantasy', 'Mystery'],
['Drama', 'Sci-Fi'],
['Animation', 'Adventure', 'Comedy'],
['Horror', 'Thriller'],
['Action', 'Thriller'],
['Comedy'],
['Biography', 'Drama'],
['Action', 'Mystery', 'Thriller'],
['Action', 'Mystery', 'Sci-Fi'],
['Crime', 'Drama', 'Thriller'],
['Comedy', 'Romance'],
['Comedy', 'Drama', 'Romance'],
['Biography', 'Drama', 'Thriller'],
['Drama'],
['Action', 'Adventure', 'Family'],
['Animation', 'Comedy', 'Family'],
['Action', 'Crime', 'Drama'],
['Comedy'],
['Comedy', 'Crime', 'Thriller'],
['Comedy', 'Romance'],
['Animation', 'Comedy', 'Drama'],
['Action', 'Crime', 'Thriller'],
['Comedy', 'Romance'],
['Adventure', 'Biography', 'Drama'],
['Animation', 'Adventure', 'Comedy'],
['Crime', 'Drama', 'Mystery'],
['Action', 'Comedy', 'Sci-Fi'],
['Comedy', 'Fantasy', 'Horror'],
['Comedy', 'Crime'],
['Animation', 'Action', 'Adventure'],
['Action', 'Drama', 'Thriller'],
['Fantasy', 'Horror'],
['Crime', 'Drama', 'Thriller'],
['Action', 'Adventure', 'Fantasy'],
['Comedy', 'Drama', 'Romance'],
['Biography', 'Drama', 'Romance'],
['Action', 'Drama', 'History'],
['Action', 'Adventure', 'Comedy'],
['Horror', 'Thriller'],
['Horror', 'Mystery', 'Thriller'],
['Comedy', 'Romance'],
['Animation', 'Adventure', 'Comedy'],
['Crime', 'Drama', 'Mystery'],
['Crime', 'Drama', 'Mystery'],
['Adventure', 'Biography', 'Drama'],
['Horror', 'Mystery', 'Thriller'],
['Horror', 'Thriller'],
['Drama', 'Romance', 'War'],
['Adventure', 'Fantasy', 'Mystery'],
['Action', 'Adventure', 'Sci-Fi'],
['Biography', 'Drama'],
['Drama', 'Thriller'],
['Horror', 'Thriller'],
['Drama', 'Horror', 'Thriller'],
['Action', 'Adventure', 'Fantasy'],
['Action', 'Horror', 'Thriller'],
['Comedy'],
['Drama', 'Sport'],
['Comedy', 'Family'],
['Drama', 'Romance'],
['Action', 'Adventure', 'Comedy'],
['Comedy'],
['Mystery', 'Romance', 'Thriller'],
['Crime', 'Drama'],
['Action', 'Comedy'],
['Crime', 'Drama', 'Mystery'],
['Biography', 'Drama', 'Romance'],
['Comedy', 'Crime'],
['Drama', 'Thriller'],
['Drama'],
['Animation', 'Adventure', 'Comedy'],
['Action', 'Thriller'],
['Drama', 'Thriller'],
['Animation', 'Adventure', 'Comedy'],
['Crime', 'Drama', 'Mystery'],
['Thriller'],
['Biography', 'Drama', 'Sport'],
['Crime', 'Drama', 'Thriller'],
['Drama', 'Music'],
['Crime', 'Drama', 'Thriller'],
['Drama', 'Romance'],
['Animation', 'Action', 'Adventure'],
['Comedy', 'Drama'],
['Action', 'Adventure', 'Drama'],
['Biography', 'Crime', 'Drama'],
['Horror'],
['Biography', 'Drama', 'Mystery'],
['Drama', 'Romance'],
['Animation', 'Drama', 'Romance'],
['Comedy', 'Family'],
['Drama'],
['Mystery', 'Thriller'],
['Drama', 'Fantasy', 'Horror'],
['Drama', 'Romance'],
['Biography', 'Drama', 'History'],
['Comedy', 'Family'],
['Action', 'Adventure', 'Thriller'],
['Comedy', 'Drama'],
['Action', 'Adventure', 'Fantasy'],
['Action', 'Thriller'],
['Drama', 'Romance'],
['Comedy', 'Drama', 'Romance'],
['Drama', 'Horror', 'Sci-Fi'],
['Comedy', 'Horror', 'Romance'],
['Drama'],
['Action', 'Adventure', 'Sci-Fi'],
['Action', 'Adventure', 'Fantasy'],
['Action', 'Adventure', 'Drama'],
['Biography', 'Comedy', 'Drama'],
['Drama', 'Mystery', 'Romance'],
['Animation', 'Adventure', 'Comedy'],
['Drama', 'Romance', 'Sci-Fi'],
['Drama'],
['Drama', 'Fantasy'],
['Drama', 'Romance'],
['Comedy', 'Horror', 'Thriller'],
['Comedy', 'Drama', 'Romance'],
['Crime', 'Drama'],
['Comedy', 'Romance'],
['Action', 'Drama', 'Family'],
['Comedy', 'Drama', 'Romance'],
['Action', 'Thriller', 'War'],
['Action', 'Comedy', 'Horror'],
['Biography', 'Drama', 'Sport'],
['Adventure', 'Comedy', 'Drama'],
['Comedy', 'Romance'],
['Comedy', 'Romance'],
['Comedy', 'Drama', 'Romance'],
['Action', 'Adventure', 'Crime'],
['Comedy', 'Romance'],
['Animation', 'Action', 'Adventure'],
['Action', 'Crime', 'Sci-Fi'],
['Drama'],
['Comedy', 'Drama', 'Romance'],
['Crime', 'Thriller'],
['Comedy', 'Horror', 'Sci-Fi'],
['Drama', 'Thriller'],
['Drama', 'Fantasy', 'Horror'],
['Thriller'],
['Adventure', 'Drama', 'Family'],
['Mystery', 'Sci-Fi', 'Thriller'],
['Biography', 'Crime', 'Drama'],
['Drama', 'Fantasy', 'Horror'],
['Action', 'Adventure', 'Thriller'],
['Crime', 'Drama', 'Horror'],
['Crime', 'Drama', 'Fantasy'],
['Adventure', 'Family', 'Fantasy'],
['Action', 'Adventure', 'Drama'],
['Action', 'Comedy', 'Horror'],
['Comedy', 'Drama', 'Family'],
['Action', 'Thriller'],
['Action', 'Adventure', 'Sci-Fi'],
['Adventure', 'Drama', 'Fantasy'],
['Drama'],
['Drama'],
['Comedy'],
['Drama'],
['Comedy', 'Drama', 'Music'],
['Drama', 'Fantasy', 'Music'],
['Drama'],
['Thriller'],
['Comedy', 'Horror'],
['Action', 'Comedy', 'Sport'],
['Horror'],
['Comedy', 'Drama'],
['Action', 'Drama', 'Thriller'],
['Drama', 'Romance'],
['Horror', 'Mystery'],
['Adventure', 'Drama', 'Fantasy'],
['Thriller'],
['Comedy', 'Romance'],
['Action', 'Sci-Fi', 'Thriller'],
['Fantasy', 'Mystery', 'Thriller'],
['Biography', 'Drama'],
['Crime', 'Drama'],
['Action', 'Adventure', 'Sci-Fi'],
['Adventure'],
['Comedy', 'Drama'],
['Comedy', 'Drama'],
['Comedy', 'Drama', 'Romance'],
['Adventure', 'Comedy', 'Drama'],
['Action', 'Sci-Fi', 'Thriller'],
['Comedy', 'Romance'],
['Action', 'Fantasy', 'Horror'],
['Crime', 'Drama', 'Thriller'],
['Action', 'Drama', 'Thriller'],
['Crime', 'Drama', 'Mystery'],
['Crime', 'Drama', 'Mystery'],
['Drama', 'Sci-Fi', 'Thriller'],
['Biography', 'Drama', 'History'],
['Crime', 'Horror', 'Thriller'],
['Drama'],
['Drama', 'Mystery', 'Thriller'],
['Adventure', 'Biography'],
['Adventure', 'Biography', 'Crime'],
['Action', 'Horror', 'Thriller'],
['Action', 'Adventure', 'Western'],
['Horror', 'Thriller'],
['Drama', 'Mystery', 'Thriller'],
['Comedy', 'Drama', 'Musical'],
['Horror', 'Mystery'],
['Biography', 'Drama', 'Sport'],
['Comedy', 'Family', 'Romance'],
['Drama', 'Mystery', 'Thriller'],
['Comedy'],
['Drama'],
['Drama', 'Thriller'],
['Biography', 'Drama', 'Family'],
['Comedy', 'Drama', 'Family'],
['Drama', 'Fantasy', 'Musical'],
['Comedy'],
['Adventure', 'Family'],
['Adventure', 'Comedy', 'Fantasy'],
['Horror', 'Thriller'],
['Drama', 'Romance'],
['Horror'],
['Biography', 'Drama', 'History'],
['Action', 'Adventure', 'Fantasy'],
['Drama', 'Family', 'Music'],
['Comedy', 'Drama', 'Romance'],
['Action', 'Adventure', 'Horror'],
['Comedy'],
['Crime', 'Drama', 'Mystery'],
['Horror'],
['Drama', 'Music', 'Romance'],
['Adventure', 'Comedy'],
['Comedy', 'Family', 'Fantasy']]
In [42]:
genre_list = np.unique([i for j in temp_list for i in j])
genre_list
Out[42]:
array(['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime',
'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music',
'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Sport', 'Thriller',
'War', 'Western'], dtype='<U9')
In [29]:
zeros = np.zeros([movie.shape[0], genre_list.shape[0]])
In [30]:
temp_movie = pd.DataFrame(zeros, columns=genre_list)
In [32]:
temp_movie.head()
Out[32]:
Action | Adventure | Animation | Biography | Comedy | Crime | Drama | Family | Fantasy | History | Horror | Music | Musical | Mystery | Romance | Sci-Fi | Sport | Thriller | War | Western | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
In [33]:
for i in range(1000):
temp_movie.ix[i, temp_list[i]] = 1
In [35]:
temp_movie.head()
Out[35]:
Action | Adventure | Animation | Biography | Comedy | Crime | Drama | Family | Fantasy | History | Horror | Music | Musical | Mystery | Romance | Sci-Fi | Sport | Thriller | War | Western | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
In [38]:
genre = temp_movie.sum().sort_values(ascending=False)
genre
Out[38]:
Drama 513.0
Action 303.0
Comedy 279.0
Adventure 259.0
Thriller 195.0
Crime 150.0
Romance 141.0
Sci-Fi 120.0
Horror 119.0
Mystery 106.0
Fantasy 101.0
Biography 81.0
Family 51.0
Animation 49.0
History 29.0
Sport 18.0
Music 16.0
War 13.0
Western 7.0
Musical 5.0
dtype: float64
In [41]:
genre.plot(kind="bar", colormap="cool", figsize=(20, 8), fontsize=16)
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x128bdea58>