目录
3.2点击注册,跳转去注册界面(register_ui.py )
3.3 movie_util.py文件主要功能是抓取和读取电影数据
1.项目架构
1.1介绍
一个基于爬虫技术的海量电影数据分析+登录、注册界面连接Mysql系统
1.2系统架构
dao
|-- __init__.py
|-- movie_dao.py # 电影dao层接口类
|-- login_dao.py # 用户dao层接口类
ui
|-- __init__.py
|-- register_ui.py # 注册界面
|-- login_ui.py # 登录界面
|-- movie_ui.py # 电影主界面
data
|-- __init__.py
|-- movies_tablets.csv # 电影表格数据
|-- moviesBoxOffice.csv # 历史电影数据
|-- recentlyMovies.csv # 在映电影数据
|-- top10_data.csv # 电影前十数据
|-- top_movie.csv # 电影排名数据
utils
|-- __init__.py
|-- db_helper.py # dbhelper帮助类
|-- movie_util.py # 电影排行榜和关键字查询电影接口定义
|-- pyec.py # 电影排行榜和关键字查询电影接口定义
main.py # 运行程序入口
1.3所需依赖包
numpy:Python中基于数组对象的科学计算库。
pandas:Python的一个数据分析包,该工具为解决数据分析任务而创建。
requests:Python中常用的HTTP库,它提供了方便的HTTP请求和响应处理功能。
json:json模块提供了处理JSON数据的强大工具。
sklearn:Scikit-learn(sklearn)是机器学习中常用的第三方模块。
webbrowser:使用默认浏览器。
tkinter:Python 的标准 GUI 库。Python 使用 Tkinter 可以快速的创建 GUI 应用程序。
collections:Python内建的一个集合模块,提供了许多有用的集合类和方法。
pyecharts: 一个基于 ECharts 的 Python 数据可视化库,它允许用户使用 Python 语言生成各种类型的交互式图表和数据可视化。
2.项目的使用
运行ui中的login_ui.py即可
3.项目效果图
3.1登录界面(login_ui.py):
调用了login_dao.py中的函数(user_login:登录验证)
def user_login(self, username: str, password: str): # 创建db数据库 db = db_helper() # sql语句 sql = "select * from t_user where luser = %s and lpwd = %s" # 占位符 values = [username, password] # 调用db中的添加函数 inserts = db.execute_query(str(sql), values) # 检查结果 if inserts: print("登录成功!") return 1 else: print("登录失败,密码或者账号错误") return 0
3.2点击注册,跳转去注册界面(register_ui.py )
def do_search_kw(self): # 销毁当前窗口 self.root.destroy() # 打开新窗口 from ui.register_ui import register_ui re = register_ui() re.show()
3.3 movie_util.py文件主要功能是抓取和读取电影数据
(1)recently()
这一函数主要是抓取最近上映票房排名前十名的电影信息。
url = "https://ys.endata.cn/enlib-api/api/movie/getMovie_BoxOffice_Day_Chart.do"
header = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36',
"Cookie": 'JSESSIONID=b2685bfa-aa4f-4359-ae96-57befaf8d1ec; route=4e39643a15b7003e568cadd862137cf3; Hm_lvt_82932fc4fc199c08b9a83c4c9d02af11=1649834963,1649852471,1649859039,1649900037; Hm_lpvt_82932fc4fc199c08b9a83c4c9d02af11=1649917933'
}
post_BoxOffice_Day_data = {
'r': 0.7572955414768414,
'datetype': 'Day',
'date': datetime.now().strftime('%Y-%m-%d'),
'sdate': datetime.now().strftime('%Y-%m-%d'),
'edate': datetime.now().strftime('%Y-%m-%d'),
'bserviceprice': 1
}
```
以上代码块是运行爬虫前的准备工作,包含抓取的网址url、爬虫所需的请求头、请求时需要附带的数据。
python
res = requests.post(url, headers=header, data=post_BoxOffice_Day_data).text
json_data = json.loads(res)
data0 = json_data['data']['table0']
data1 = json_data['data']['table1']
以上代码块是运行爬虫并将其解析为json形式,方便后面对数据进行取出。
movie_rank = []
movie_details_MovieName = []
movie_details_BoxOffice = []
movie_details_ShowCount = []
movie_details_AudienceCount = []
movie_details_Attendance = []
movie_percent_BoxOfficePercent = []
movie_percent_ShowCountPercent = []
movie_percent_AudienceCountPercent = []
以上代码是部分定义的所需的数据字段。
for i in range(10):
movie_rank.append(data0[i]['Irank'])
movie_details_MovieName.append(data0[i]['MovieName'])
movie_details_BoxOffice.append(data0[i]['BoxOffice'])
movie_details_ShowCount.append(data0[i]['ShowCount'])
movie_details_AudienceCount.append(data0[i]['AudienceCount'])
movie_details_Attendance.append(data0[i]['Attendance'])
以上是从json数据中取数据的过程。
top10_data = pd.DataFrame({
"影片排名": movie_rank,
"影片名称": movie_details_MovieName,
"影片票房": movie_details_BoxOffice,
"影片场次": movie_details_ShowCount,
"影片人次": movie_details_AudienceCount,
"上座率": movie_details_Attendance,
"影片票房占比": movie_percent_BoxOfficePercent,
"影片场次占比": movie_percent_ShowCountPercent,
"影片人次占比": movie_percent_AudienceCountPercent,
"一线城市票房": movie_city1_BoxOffice,
"一线城市场次": movie_city1_ShowCount,
"一线城市人次": movie_city1_AudienceCount,
"二线城市票房": movie_city2_BoxOffice,
"二线城市场次": movie_city2_ShowCount,
"二线城市人次": movie_city2_AudienceCount,
"三线城市票房": movie_city3_BoxOffice,
"三线城市场次": movie_city3_ShowCount,
"三线城市人次": movie_city3_AudienceCount,
"四线城市票房": movie_city4_BoxOffice,
"四线城市场次": movie_city4_ShowCount,
"四线城市人次": movie_city4_AudienceCount,
"其它票房": movie_others_BoxOffice,
"其它场次": movie_others_ShowCount,
"其它人次": movie_others_AudienceCount
})
print(top10_data)
top10_data.to_csv("data/top10_data.csv", encoding='gbk', index=False)
以上是定义数据表并将数据表填满,打印数据表,保存数据表的过程。
3.4 点击历史电影搜索:
# 历史电影数据
def history(self):
# showerror(title="失败", message="历史电影数据获取失败")
if self.treeview is not None:
self.clear_tree(self.treeview) # 清空表格
self.create_tree_history()
self.btn_top2['text'] = '正在努力搜索'
list = history(int(self.movie_num_entry.get()))
self.add_tree(list, self.treeview)
self.btn_top2['state'] = NORMAL
self.btn_top2['text'] = '历史电影搜索'
调用movie_util.py中的方法
def history(num:int):
data = pd.read_csv("C:/Users/LoveB/PycharmProjects/pythonProject1/python/yy_movie//data/moviesBoxOffice.csv", encoding='gbk')
data = np.array(data[:num]).tolist()
print(data)
return data
3.5 在映电影搜索
# 在映电影搜索
def showing(self):
# showerror(title="失败", message="系统爬虫失效或超时,请联系系统开发者")
if self.treeview is not None:
self.clear_tree(self.treeview) # 清空表格
self.create_tree_showing()
self.btn_top['text'] = '正在努力搜索'
showing(int(self.movie_num_entry.get()))
list = np.array(pd.read_csv("C:/Users/LoveB/PycharmProjects/pythonProject1/python/yy_movie/data/recentlyMovies.csv", encoding='gbk')).tolist()
self.add_tree(list, self.treeview) # 将数据添加到tree中
self.btn_top['state'] = NORMAL
self.btn_top['text'] = '在映电影搜索'
def showing(num: int):
header = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36',
"Cookie": 'JSESSIONID=edf01a0d-deae-4143-9071-2e7eda2c5055; route=4e39643a15b7003e568cadd862137cf3; Hm_lvt_82932fc4fc199c08b9a83c4c9d02af11=1649859039,1649900037,1649983572,1649988152; Hm_lpvt_82932fc4fc199c08b9a83c4c9d02af11=1650016413'
}
url_total = "https://ys.endata.cn/enlib-api/api/movie/getMovie_BoxOffice_Day_List.do"
total_post_data = {
'r': 0.08330054546930543,
'datetype': 'Day',
'date': datetime.now().strftime('%Y-%m-%d'),
'sdate': datetime.now().strftime('%Y-%m-%d'),
'edate': datetime.now().strftime('%Y-%m-%d'),
'bserviceprice': 1,
'columnslist': '100,102,103,119,105,107,109,106,112,129,142,143,163,164,165',
'pageindex': 1,
'pagesize': 20,
'order': 103,
'ordertype': 'desc',
}
total_res = requests.post(url_total, headers=header, data=total_post_data).text
total_json_data = json.loads(total_res)
pagesize = total_json_data['data']['table2'][0]['TotalCounts']
total_post_data = {
'r': 0.08330054546930543,
'datetype': 'Day',
'date': datetime.now().strftime('%Y-%m-%d'),
'sdate': datetime.now().strftime('%Y-%m-%d'),
'edate': datetime.now().strftime('%Y-%m-%d'),
'bserviceprice': 1,
'columnslist': '100,102,103,119,105,107,109,106,112,129,142,143,163,164,165',
'pageindex': 1,
'pagesize': pagesize,
'order': 103,
'ordertype': 'desc',
}
total_res = requests.post(url_total, headers=header, data=total_post_data).text
total_json_data = json.loads(total_res)['data']['table1']
print(total_json_data)
print('total=',len(total_json_data))
movies_rank = []
movies_MovieName = []
movies_BoxOffice = []
movies_ReleaseDate = []
movies_TotalBoxOffice = []
movies_ShowCount = []
movies_AudienceCount = []
movies_BoxOfficePercent = []
movies_ReleaseDay = []
movies_ShowDay = []
movies_HjBoxOffice = []
movies_HjShowCount = []
movies_HjBoxOfficePercent = []
movies_HjShowCountPercent = []
movies_HjAudienceCountPercent = []
movies_MaoYanWantToSee = []
movies_TaoPiaoPiaoWantToSee = []
movies_DouBanWantToSee = []
for i in range(num):
if total_json_data[i]['EntMovieID'] != 0:
movies_rank.append(total_json_data[i]['Irank'])
movies_MovieName.append(total_json_data[i]['MovieName'])
movies_BoxOffice.append(total_json_data[i]['BoxOffice'])
movies_ReleaseDate.append(total_json_data[i]['ReleaseDate'])
movies_TotalBoxOffice.append(total_json_data[i]['TotalBoxOffice'])
movies_ShowCount.append(total_json_data[i]['ShowCount'])
movies_AudienceCount.append(total_json_data[i]['AudienceCount'])
movies_BoxOfficePercent.append(total_json_data[i]['BoxOfficePercent'])
movies_ReleaseDay.append(total_json_data[i]['ReleaseDay'])
movies_ShowDay.append(total_json_data[i]['ShowDay'])
movies_HjBoxOffice.append(total_json_data[i]['HjBoxOffice'])
movies_HjShowCount.append(total_json_data[i]['HjShowCount'])
movies_HjBoxOfficePercent.append(total_json_data[i]['HjBoxOfficePercent'])
movies_HjShowCountPercent.append(total_json_data[i]['HjShowCountPercent'])
movies_HjAudienceCountPercent.append(total_json_data[i]['HjAudienceCountPercent'])
post_data = {
'r': 0.3270070971758279,
'entmovieid': total_json_data[i]['EntMovieID']
}
res = json.loads(
requests.post(url="https://ys.endata.cn/enlib-api/api/movie/getMovie_HeadBoxOfficeByMovieID.do",
headers=header, data=post_data).text)
print(total_json_data[i]['EntMovieID'])
print(res)
movies_MaoYanWantToSee.append(res['data']['table0'][0]['MaoYanWantToSee'])
print(movies_MaoYanWantToSee)
movies_TaoPiaoPiaoWantToSee.append(res['data']['table0'][0]['TaoPiaoPiaoWantToSee'])
movies_DouBanWantToSee.append(res['data']['table0'][0]['DouBanWantToSee'])
total_data = pd.DataFrame({
"排名": movies_rank,
"影片名称": movies_MovieName,
"当前票房": movies_BoxOffice,
"上映日期": movies_ReleaseDate,
"累计票房": movies_TotalBoxOffice,
"当前场次": movies_ShowCount,
"当前人次": movies_AudienceCount,
"票房占比": movies_BoxOfficePercent,
"累计上映天数": movies_ReleaseDay,
"当前统计天数": movies_ShowDay,
"淘票票想看数": movies_TaoPiaoPiaoWantToSee,
"猫眼想看数": movies_MaoYanWantToSee,
"豆瓣想看数": movies_DouBanWantToSee,
"黄金场票房": movies_HjBoxOffice,
"黄金场场次": movies_HjShowCount,
"黄金场票房占比": movies_HjBoxOfficePercent,
"黄金场场次占比": movies_HjShowCountPercent,
"黄金场人次占比": movies_HjAudienceCountPercent
})
total_data.to_csv("C:/Users/LoveB/PycharmProjects/pythonProject1/python/yy_movie/data/recentlyMovies.csv", encoding='gbk', index=False)
3.6 统计图
def clicking(self):
# showerror(title="失败", message="跳转在映电影数据分析失败")
recently()
from python.yy_movie.utils.pyec import Showing
Showing()
webbrowser.open("C:/Users/LoveB/PycharmProjects/pythonProject1/python/yy_movie/在映电影分析.html")
4.总结
以上就是电影项目大致流程,更详细点击我的主页获取完整代码,如果对你有帮助,点赞关注评论一下呗~~