前言
前面构造了movieList数据集,但对于我们人为的去读取不是很友好,所以我们将调整其存放格式,并保存为csv文件。
数据构造
从movieList的结构我们可以看出,它是一个有元组构成的列表,为了方便查看我们将让他存储为DataFrame,并以日期作为索引。
新增一个构造DataFrame的方法,参数为movieList和日期:
def buildDataFrame(movieList, date):
index = [date for i in range(len(movieList))]
df = pd.DataFrame(movieList, columns=['name', 'box', 'boxRatio', 'playRatio', 'attendance'], index=index)
df['box'] = df['box'].astype('float64')
return df
再执行看一下效果
if __name__ == "__main__":
data = getData(url)
html = getHtml(data)
movieList = parseHtml(html)
df = buildDataFrame(movieList, '2017-08-01')
print df
输出结果为
In [92]: runfile('C:/Users/Administrator/.spyder2/temp.py', wdir='C:/Users/Administrator/.spyder2')
name box boxRatio playRatio attendance
2017-08-01 战狼2 29249.43 86.3% 56.4% 42.4%
2017-08-01 建军大业 3248.29 9.6% 22.3% 19.3%
2017-08-01 神偷奶爸3 464.67 1.4% 4.7% 13.7%
2017-08-01 大耳朵图图之美食狂想曲 225.61 0.7% 3.9% 10.0%
2017-08-01 绣春刀II:修罗战场 186.07 0.5% 3.0% 11.4%
2017-08-01 闪光少女 137.56 0.4% 1.4% 15.6%
2017-08-01 悟空传 130.83 0.4% 2.2% 10.2%
2017-08-01 豆福传 66.90 0.2% 2.8% 6.9%
2017-08-01 父子雄兵 44.96 0.1% 0.9% 11.4%
2017-08-01 大护法 29.90 0.1% 0.4% 10.5%
2017-08-01 阿唐奇遇 23.77 0.1% 0.5% 9.8%
2017-08-01 夜半凶铃 17.66 0.1% 0.7% 9.1%
2017-08-01 血战湘江 17.26 0.1% 0.0% 64.3%
2017-08-01 京城81号II 9.74 0.0% 0.2% 9.3%
2017-08-01 绿野仙踪之奥兹国奇幻之旅 7.40 0.0% 0.2% 6.9%
2017-08-01 深夜食堂2 7.34 0.0% 0.1% 12.1%
2017-08-01 地球:神奇的一天 4.25 0.0% 0.0% 100%
2017-08-01 冈仁波齐 4.04 0.0% 0.1% 12.2%
2017-08-01 李三娘 2.26 0.0% 0.0% 70.2%
2017-08-01 喵星人 2.06 0.0% 0.1% 10.1%
2017-08-01 战狼 1.75 0.0% 0.0% 5.5%
2017-08-01 鲛珠传 1.73 0.0% 0.0% 100%
2017-08-01 阳光萌星社 1.64 0.0% 0.0% 69.4%
2017-08-01 我是马布里 1.60 0.0% 0.0% 76.7%
2017-08-01 重返·狼群 1.59 0.0% 0.0% 9.5%
我们可以看到,这种展示对于我们来说就非常友好了。
我们尝试找一下2017-08-01当天票房超过100W的电影
In [93]: df[df.box > 100]
Out[93]:
name box boxRatio playRatio attendance
2017-08-01 战狼2 29249.43 86.3% 56.4% 42.4%
2017-08-01 建军大业 3248.29 9.6% 22.3% 19.3%
2017-08-01 神偷奶爸3 464.67 1.4% 4.7% 13.7%
2017-08-01 大耳朵图图之美食狂想曲 225.61 0.7% 3.9% 10.0%
2017-08-01 绣春刀II:修罗战场 186.07 0.5% 3.0% 11.4%
2017-08-01 闪光少女 137.56 0.4% 1.4% 15.6%
2017-08-01 悟空传 130.83 0.4% 2.2% 10.2%
获取多天数据
时间参数化
为了获取多天数据,我们就需要将url中的日期参数进行参数化传入。
修改我们的getData方法,并将名字改为getDataByDate:
def getDataByDate(date):
url = 'https://piaofang.maoyan.com/dayoffice?date=%s&cnt=10' % date
headers={
"authority": "piaofang.maoyan.com",
"method": "GET",
"path": "/dayoffice?date=%s&cnt=10" % date,
"scheme": "https",
"accept": "*/*",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.8",
"referer": "https://piaofang.maoyan.com/?date=%s" % date,
"uid": "e4e5902fc42ad5e198b207d76af1d82e7056cb82",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
"x-requested-with": "XMLHttpRequest"
}
req = requests.get(url, headers=headers)
return req.content
然后我们获取2017-08-02的数据
if __name__ == "__main__":
day = '2017-08-02'
data = getDataByDate(day)
html = getHtml(data)
movieList = parseHtml(html)
df = buildDataFrame(movieList, day)
print df
输出结果为:
In [107]: runfile('C:/Users/Administrator/.spyder2/temp.py', wdir='C:/Users/Administrator/.spyder2')
name box boxRatio playRatio attendance
2017-08-02 战狼2 27898.31 88.8% 59.1% 39.2%
2017-08-02 建军大业 2229.18 7.1% 20.4% 15.5%
2017-08-02 神偷奶爸3 431.69 1.4% 4.8% 12.8%
2017-08-02 大耳朵图图之美食狂想曲 203.25 0.6% 3.8% 9.4%
2017-08-02 绣春刀II:修罗战场 168.69 0.5% 2.8% 11.3%
2017-08-02 闪光少女 122.52 0.4% 1.4% 14.4%
2017-08-02 悟空传 108.73 0.3% 2.0% 9.6%
2017-08-02 豆福传 59.12 0.2% 2.5% 7.0%
2017-08-02 父子雄兵 41.56 0.1% 0.8% 11.9%
2017-08-02 血战湘江 27.31 0.1% 0.0% 61.7%
2017-08-02 大护法 25.56 0.1% 0.4% 10.3%
2017-08-02 阿唐奇遇 21.95 0.1% 0.5% 8.5%
2017-08-02 夜半凶铃 15.29 0.0% 0.6% 9.3%
2017-08-02 京城81号II 8.72 0.0% 0.2% 8.4%
2017-08-02 皮绳上的魂 7.07 0.0% 0.0% 99.2%
2017-08-02 绿野仙踪之奥兹国奇幻之旅 5.30 0.0% 0.2% 6.3%
2017-08-02 我是马布里 4.38 0.0% 0.0% 93.1%
2017-08-02 深夜食堂2 4.31 0.0% 0.1% 9.5%
2017-08-02 冈仁波齐 4.25 0.0% 0.1% 13.0%
2017-08-02 李三娘 3.86 0.0% 0.0% 69.3%
2017-08-02 阳光萌星社 2.30 0.0% 0.0% 65.6%
2017-08-02 喵星人 2.25 0.0% 0.1% 13.1%
2017-08-02 破·局 1.93 0.0% 0.0% 75.2%
2017-08-02 龙之战 1.52 0.0% 0.0% 98.0%
2017-08-02 重返·狼群 1.46 0.0% 0.0% 10.3%
2017-08-02 战狼 1.37 0.0% 0.0% 1.7%
2017-08-02 穆桂英挂帅 1.12 0.0% 0.0% 100%
获取时间范围数据
现在获取特定日期的数据已经实现了,那么我们再进一步抽象和封装,可以获取指定时间范围的数据
增加一个方法用于构造日期的列表
from datetime import date, timedelta
def buildDates(start, days):
day = timedelta(days=1)
for i in range(days):
yield start + day*i
将原来main函数里的执行逻辑进行封装,封装为一个新的getData方法,用于获取指定日期范围内的所有数据
def getData(Y, M, D, days):
start = date(Y, M, D)
df = pd.DataFrame()
for d in buildDates(start, days):
day = str(d)
data = getDataByDate(day)
html = getHtml(data)
movieList = parseHtml(html)
temp = buildDataFrame(movieList, day)
df = df.append(temp)
return df
在主函数里执行下看看效果,获取从2017-08-01至2017-08-03的数据:
if __name__ == "__main__":
df = getData(2017, 8, 1, 3)
print df
执行结果如下:
runfile('C:/Users/Administrator/.spyder2/temp.py', wdir='C:/Users/Administrator/.spyder2')
name box boxRatio playRatio attendance
2017-08-01 战狼2 29249.43 86.3% 56.4% 42.4%
2017-08-01 建军大业 3248.29 9.6% 22.3% 19.3%
2017-08-01 神偷奶爸3 464.67 1.4% 4.7% 13.7%
2017-08-01 大耳朵图图之美食狂想曲 225.61 0.7% 3.9% 10.0%
2017-08-01 绣春刀II:修罗战场 186.07 0.5% 3.0% 11.4%
2017-08-01 闪光少女 137.56 0.4% 1.4% 15.6%
2017-08-01 悟空传 130.83 0.4% 2.2% 10.2%
2017-08-01 豆福传 66.90 0.2% 2.8% 6.9%
2017-08-01 父子雄兵 44.96 0.1% 0.9% 11.4%
2017-08-01 大护法 29.90 0.1% 0.4% 10.5%
2017-08-01 阿唐奇遇 23.77 0.1% 0.5% 9.8%
2017-08-01 夜半凶铃 17.66 0.1% 0.7% 9.1%
2017-08-01 血战湘江 17.26 0.1% 0.0% 64.3%
2017-08-01 京城81号II 9.74 0.0% 0.2% 9.3%
2017-08-01 绿野仙踪之奥兹国奇幻之旅 7.40 0.0% 0.2% 6.9%
2017-08-01 深夜食堂2 7.34 0.0% 0.1% 12.1%
2017-08-01 地球:神奇的一天 4.25 0.0% 0.0% 100%
2017-08-01 冈仁波齐 4.04 0.0% 0.1% 12.2%
2017-08-01 李三娘 2.26 0.0% 0.0% 70.2%
2017-08-01 喵星人 2.06 0.0% 0.1% 10.1%
2017-08-01 战狼 1.75 0.0% 0.0% 5.5%
2017-08-01 鲛珠传 1.73 0.0% 0.0% 100%
2017-08-01 阳光萌星社 1.64 0.0% 0.0% 69.4%
2017-08-01 我是马布里 1.60 0.0% 0.0% 76.7%
2017-08-01 重返·狼群 1.59 0.0% 0.0% 9.5%
2017-08-02 战狼2 27898.31 88.8% 59.1% 39.2%
2017-08-02 建军大业 2229.18 7.1% 20.4% 15.5%
2017-08-02 神偷奶爸3 431.69 1.4% 4.8% 12.8%
2017-08-02 大耳朵图图之美食狂想曲 203.25 0.6% 3.8% 9.4%
2017-08-02 绣春刀II:修罗战场 168.69 0.5% 2.8% 11.3%
... ... ... ... ... ...
2017-08-02 龙之战 1.52 0.0% 0.0% 98.0%
2017-08-02 重返·狼群 1.46 0.0% 0.0% 10.3%
2017-08-02 战狼 1.37 0.0% 0.0% 1.7%
2017-08-02 穆桂英挂帅 1.12 0.0% 0.0% 100%
2017-08-03 战狼2 22786.77 54.7% 44.7% 43.4%
2017-08-03 三生三世十里桃花 16933.81 40.7% 32.2% 43.7%
2017-08-03 建军大业 1186.55 2.8% 10.2% 16.7%
2017-08-03 神偷奶爸3 259.93 0.6% 3.2% 10.4%
2017-08-03 大耳朵图图之美食狂想曲 134.45 0.3% 2.4% 9.1%
2017-08-03 闪光少女 58.17 0.1% 0.7% 11.8%
2017-08-03 绣春刀II:修罗战场 53.99 0.1% 1.1% 8.6%
2017-08-03 谁是球王 51.46 0.1% 2.0% 28.5%
2017-08-03 心理罪 33.21 0.1% 0.0% 88.4%
2017-08-03 豆福传 26.74 0.1% 1.2% 5.7%
2017-08-03 悟空传 26.13 0.1% 0.7% 7.0%
2017-08-03 血战湘江 24.51 0.1% 0.0% 49.9%
2017-08-03 阿唐奇遇 12.50 0.0% 0.3% 7.8%
2017-08-03 父子雄兵 11.56 0.0% 0.3% 8.8%
2017-08-03 大护法 10.00 0.0% 0.2% 8.0%
2017-08-03 夜半凶铃 7.45 0.0% 0.3% 8.0%
2017-08-03 我是马布里 6.35 0.0% 0.0% 41.6%
2017-08-03 李三娘 3.60 0.0% 0.0% 0.0%
2017-08-03 绿野仙踪之奥兹国奇幻之旅 3.01 0.0% 0.1% 5.5%
2017-08-03 阳光萌星社 2.69 0.0% 0.0% 66.2%
2017-08-03 京城81号II 2.15 0.0% 0.1% 9.1%
2017-08-03 冈仁波齐 1.94 0.0% 0.0% 10.0%
2017-08-03 深夜食堂2 1.45 0.0% 0.0% 8.6%
2017-08-03 战狼 1.19 0.0% 0.0% 4.7%
2017-08-03 喵星人 1.17 0.0% 0.0% 19.8%
2017-08-03 重返·狼群 1.10 0.0% 0.0% 11.7%
[78 rows x 5 columns]
保存为csv文件
将DataFrame保存为CSV文件很简单,直接使用DataFrame中的方法to_csv就行
def writeToCSV(df, path):
df.to_csv(path)
主函数内调用该方法,将三天的数据存入csv文件中
if __name__ == "__main__":
df = getData(2017, 8, 1, 3)
writeToCSV(df, 'data\out.csv')
执行完成后可以看到指定目录下,新增了out.csv文件,打开后可以看到我们的数据都被存入了。
完整代码
为了后面好操作,已将占比都改为了数字,本节完整代码如下:
#-*- coding: utf-8 -*-
import json
import re
import requests
import pandas as pd
from datetime import date, timedelta
def getHtml(jsonData):
data = json.loads(jsonData)
return data['ticketList'].encode('utf-8').replace('\n', '').replace(' ','')
def parseHtml(html):
reg = r"<ul.+?><liclass='c1'><b>(.+?)</b>.+?</li>"
reg += r"<liclass=\"c2\"><b>(.+?)</b>.+?</li>"
reg += r"<liclass=\"c3\">(.+?)%</li>"
reg += r"<liclass=\"c4\">(.+?)%</li>"
reg += r"<liclass=\"c5\"><spanstyle=\"margin-right:-.1rem\">(.+?)%</span>"
pattern = re.compile(reg)
movieList = re.findall(pattern, html)
return movieList
def getDataByDate(date):
url = 'https://piaofang.maoyan.com/dayoffice?date=%s&cnt=10' % date
headers={
"authority": "piaofang.maoyan.com",
"method": "GET",
"path": "/dayoffice?date=%s&cnt=10" % date,
"scheme": "https",
"accept": "*/*",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.8",
"referer": "https://piaofang.maoyan.com/?date=%s" % date,
"uid": "e4e5902fc42ad5e198b207d76af1d82e7056cb82",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
"x-requested-with": "XMLHttpRequest"
}
req = requests.get(url, headers=headers)
return req.content
def buildDataFrame(movieList, date):
index = [date for i in range(len(movieList))]
df = pd.DataFrame(movieList, columns=['name', 'box', 'boxRatio', 'playRatio', 'attendance'], index=index)
df['box'] = df['box'].astype('float64')
df['boxRatio'] = df['boxRatio'].astype('float64')
df['playRatio'] = df['playRatio'].astype('float64')
df['attendance'] = df['attendance'].astype('float64')
return df
def buildDates(start, days):
day = timedelta(days=1)
for i in range(days):
yield start + day*i
def getData(Y, M, D, days):
start = date(Y, M, D)
df = pd.DataFrame()
for d in buildDates(start, days):
day = str(d)
data = getDataByDate(day)
html = getHtml(data)
movieList = parseHtml(html)
temp = buildDataFrame(movieList, day)
df = df.append(temp)
return df
def writeToCSV(df, path):
df.to_csv(path)
if __name__ == "__main__":
df = getData(2017, 8, 1, 3)
print df
writeToCSV(df, 'data\out.csv')