任务:
获取https://movie.douban.com/页面中所有电影信息
标签自动获取而不是写死在代码中
将爬取信息保存到数据库
爬取前思考:
1、确定数据在html中还是在xhr中
requests.get(url)检查是否存在想要的数据
存在-》html中
不存在-》xhr
2、处理数据
html-》bs4/xpath
xhr-》找到对应接口-》requests.get(接口url)->json
右击检查-》网络-》xhr-》选择请求-》响应
依次检查接口数据是不是自己想要的
步骤及代码:
download模块中代码:
import requests import time def get_text(url): headers={ "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0" } #try 检测异常,如果遇到异常就会跳转到except分支 try: print(f"正在访问{url}...") #get获取响应,指定头部信息 response=requests.get(url,headers=headers) #如果状态码不是2xx,就抛出异常并退出 response.raise_for_status() #自适应网页编码 response.encoding=response.apparent_encoding time.sleep(1) return response.text except: print(f"访问{url}出错!") return ""
运行代码:
from download import get_text import json tags_url="https://movie.douban.com/j/search_tags?type=movie&source=index" subjects_url="https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&page_limit=50&page_start=0" #获取所有tags tags_json=get_text(tags_url) print(tags_json)#返回数据为字典,key为tags,value就是热门,最新等 tags_obj=json.loads(tags_json)["tags"] for tag in tags_obj: subjects_url = f"https://movie.douban.com/j/search_subjects?type=movie&tag={tag}&page_limit=50&page_start=0" subjects_json=get_text(subjects_url) subjects_obj=json.loads(subjects_json)["subjects"] #将数据保存到文件中 # print(subjects_obj) for subject in subjects_obj: print(tag,subject["title"],subject["rate"],subject["cover"])
将结果保存到数据库中:
from download import get_text import json tags_url="https://movie.douban.com/j/search_tags?type=movie&source=index" subjects_url="https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&page_limit=50&page_start=0" #获取所有tags tags_json=get_text(tags_url) print(tags_json)#返回数据为字典,key为tags,value就是热门,最新等 tags_obj=json.loads(tags_json)["tags"] #进入数据库创建表 import pymysql #自己数据库的信息 db = pymysql.connect(host="192.168.67.148", port=3306, user='root', password="123456", db="mytest") print(db) # 创建游标(执行者) cursor = db.cursor() #防止表已存在 sql = "drop table if exists homework" cursor.execute(sql) # 执行sql语句:execute(sql),创建表格homework sql = """create table homework( tag char(20), name char(20), rate float not null, cover char(225) ); """ cursor.execute(sql) #抓取数据写入表格 for tag in tags_obj: subjects_url = f"https://movie.douban.com/j/search_subjects?type=movie&tag={tag}&page_limit=50&page_start=0" # subjects_obj=get_json(subjects_url) subjects_json=get_text(subjects_url) subjects_obj=json.loads(subjects_json)["subjects"] all_list=[] for subject in subjects_obj: all_list.append([tag,subject["title"],subject["rate"],subject["cover"]]) #向数据库中插入爬取的数据 sql="insert into homework(tag,name,rate,cover) values(%s,%s,%s,%s)" cursor.executemany(sql,all_list) db.commit()
查看:
登录连接的数据库
use mytest;进入数据库mytest
select*from homework;查看保存数据的表格