玩爬虫的时候,遇到了很多不能爬取的问题,网页为空、返回码出错(400、412)、图片不展示但源码中存在图片链接等,问题通常都是服务器检测到浏览器是自动化爬取。
文章记录一下通用解决办法
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
# Chromedriver_path = ''
options = webdriver.ChromeOptions()
options.add_argument("--disable-extensions")
options.add_argument("--disable-gpu")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
driver = webdriver.Chrome(ChromeDriverManager().install()) # 参数也可指定Chromedriver_path
driver.execute_cdp_cmd("Network.enable", {})
driver.execute_cdp_cmd("Network.setExtraHTTPHeaders", {"headers": {"User-Agent": "browserClientA"}})
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
"source": """
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
})
"""
})
driver.get('***url***')
html = driver.page_source
time.sleep(3)
print(html)
driver.close()
一本万利的方法(不用手动配置这么多参数):
undetected_chromedriver包内部加了很多参数和script代码段,如有其他语言玩爬虫遇到相同问题也可以参照解决。
import undetected_chromedriver as uc
from webdriver_manager.chrome import ChromeDriverManager
driver = uc.Chrome(driver_executable_path=ChromeDriverManager().install())