社区有相同的爬取代码,试用后出现
pycharm运行之后出现<script src="/_guard/auto.js"></script>
经多次查找是,该网站增加了反爬虫机制。经过调试增加cookie。正常爬取。 import re import requests url = "https://www.dytt89.com/" headers = { "Cookie" : "guardok=6I18VOAyw6EqY0iBU/du7SV3hZbFGROUfDRCf8hFXC0wf8/Lez2mxNaCGb3Zij0faZJBnZQIukMBTdiNqc7cvw==", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0" } response = requests.get(url=url,headers=headers) response.encoding = "gb2312" response_text = response.text # print(response_text) response.close() obj_ul = re.compile(r"2024必看热片.*?<ul>(?P<ul>.*?)</ul>" , re.S) # 再相应数据中提取信息 ul = obj_ul.finditer(response_text) for ul_lines in ul: print(ul_lines.group('ul'))