仅为学习,仅为学习,仅为学习
在学习爬虫期间,有讲到去爬取http://scxk.nmpa.gov.cn:81/xk内容,但是由于最新的改版,已经很难了,添加了加密的过程,导致一直爬取不到数据,如
http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?hKHnQfLv=5RzcDnoZGWKeUOstQcpwLAZnI_YXd8U22RSxBylWTxaaJuoHxQ0mZT6eNeV4UWLZZ84VMQtGsFW0JXDkNYFzgx.1MTLPtQBTJTdOZmeia2NI75DSDRiktWm8GAKT6Vaz.LXqMyVvCOA0ZZ_0zXI8rxeBogx.FmWMyY05UnRA0Abi.5_CjmuHNwijNpLGdIyev6v1RcpGINeBb8E4H8gLr6byITdTxVyOMG70lC2zjbNMyHEifKtrFD2WFTwzlSl8YzNazLbgHyYpMDF4AUSVvc6JzifOaWzZiIuQUN9yxUG3&8X7Yi61c=4gJZWYBU.vueIqDtVqOZBVV2kDTLKhwQoWmd1Tyr8i9R4wg1LtILP.stGr7zOvspClrkmY2hU09XQa1ka9SlDR7Z6DCMHfDW1sx1ih_UCwkZuSoCErd.Pn57QXV5fs5rM
由于添加了hKHnQfLv、8X7Yi61c参数,而这两个参数无法进行破解(没有找到方法,貌似是jquery.myPagination.js和portal.js这两个文件搞得鬼),所以放弃直接破解接口,转而使用selenium,selenium模拟进行浏览器请求(还是有几率被拦截),然后使用最笨的方法,被拦截了就重复请求下,最终破解代码如下
import time
from lxml import etree
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
def getContent():
count = 0
while True:
option = Options()
# 为Chrome配置无头模式
# option.add_argument("--headless")
# 防止被检测
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_experimental_option('useAutomationExtension', False)
# 实例化浏览器
driver = Service(executable_path='./chromedriver', log_path='./chrome.log')
web = webdriver.Chrome(service=driver, options=option)
web.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
"source": """
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
})
"""
})
web.get('http://scxk.nmpa.gov.cn:81/xk')
page_text = web.page_source
html_content = etree.HTML(page_text)
items_content = html_content.xpath('/html/body//ul[@id="gzlist"]/li')
web.close()
if len(items_content) > 0:
return page_text
print('未获取到数据,睡眠', (10 + count * 5), '秒')
time.sleep(10 + count * 5)
count += 1
if __name__ == '__main__':
content = getContent()
html = etree.HTML(content)
items = html.xpath('/html/body//ul[@id="gzlist"]/li')
for item in items:
print(item.xpath('./dl/@title')[0])