此次的目标是爬取电影天堂最新200页的最新电影的电影名称和下载链接,电影的下载链接在二级页面,所以需要先匹配一级页面的所有链接,然后逐个请求二级页面,代码如下:
""" 爬取电影天堂2019年的电影名称和链接 """ import requests import csv from fake_useragent import UserAgent from lxml import etree import re import time import random class DianyingtiantangSpider(object): def __init__(self): self.url = 'https://www.dytt8.net/html/gndy/dyzz/list_23_{}.html' # self.url2 = 'https://www.dytt8.net/html/gndy/dyzz/20190918/59138.html' def get_headers(self): """ 构造请求头 :return: """ ua = UserAgent() headers = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "zh-CN,zh;