我用python编写了一个脚本,从一个网页获取指向不同文章的不同链接。运行我的脚本,我可以完美地得到它们。然而,我面临的问题是,文章链接会遍历多个页面,因为它们的数量很大,无法容纳在一个页面中。如果我单击next page按钮,我可以在开发人员工具中看到附加的信息,这些工具实际上通过post请求生成ajax调用。由于下一页按钮上没有链接,我无法找到任何方法进入下一页并从那里解析链接。我试过用post request和这个formdata但似乎不起作用。我哪里出错了?在
这是我使用chrome开发工具在单击“下一页”按钮时获得的信息:GENERAL
=======================================================
Request URL: https://www.ncbi.nlm.nih.gov/pubmed/
Request Method: POST
Status Code: 200 OK
Remote Address: 130.14.29.110:443
Referrer Policy: origin-when-cross-origin
RESPONSE HEADERS
=======================================================
Cache-Control: private
Connection: Keep-Alive
Content-Encoding: gzip
Content-Security-Policy: upgrade-insecure-requests
Content-Type: text/html; charset=UTF-8
Date: Fri, 29 Jun 2018 10:27:42 GMT
Keep-Alive: timeout=1, max=9
NCBI-PHID: 396E3400B36089610000000000C6005E.m_12.03.m_8
NCBI-SID: CE8C479DB3510951_0083SID
Referrer-Policy: origin-when-cross-origin
Server: Apache
Set-Cookie: ncbi_sid=CE8C479DB3510951_0083SID; domain=.nih.gov; path=/; expires=Sat, 29 Jun 2019 10:27:42 GMT
Set-Cookie: WebEnv=1Jqk9ZOlyZSMGjHikFxNDsJ_ObuK0OxHkidgMrx8vWy2g9zqu8wopb8_D9qXGsLJQ9mdylAaDMA_T-tvHJ40Sq_FODOo33__T-tAH%40CE8C479DB3510951_0083SID; domain=.nlm.nih.gov; path=/; expires=Fri, 29 Jun 2018 18:27:42 GMT
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-UA-Compatible: IE=Edge
X-XSS-Protection: 1; mode=block
REQUEST HEADERS
========================================================
Accept: text/html, */*; q=0.01
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Connection: keep-alive
Content-Length: 395
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Cookie: ncbi_sid=CE8C479DB3510951_0083SID; _ga=GA1.2.1222765292.1530204312; _gid=GA1.2.739858891.1530204312; _gat=1; WebEnv=18Kcapkr72VVldfGaODQIbB2bzuU50uUwU7wrUi-x-bNDgwH73vW0M9dVXA_JOyukBSscTE8Qmd1BmLAi2nDUz7DRBZpKj1wuA_QB%40CE8C479DB3510951_0083SID; starnext=MYGwlsDWB2CmAeAXAXAbgA4CdYDcDOsAhpsABZoCu0IA9oQCZxLJA===
Host: www.ncbi.nlm.nih.gov
NCBI-PHID: 396E3400B36089610000000000C6005E.m_12.03
Origin: https://www.ncbi.nlm.nih.gov
Referer: https://www.ncbi.nlm.nih.gov/pubmed
User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36
X-Requested-With: XMLHttpRequest
FORM DATA
========================================================
p$l: AjaxServer
portlets: id=relevancesortad:sort=;id=timelinead:blobid=NCID_1_120519284_130.14.22.215_9001_1530267709_1070655576_0MetA0_S_MegaStore_F_1:yr=:term=%222015%22%5BDate%20-%20Publication%5D%20%3A%20%223000%22%5BDate%20-%20Publication%5D;id=reldata:db=pubmed:querykey=1;id=searchdetails;id=recentactivity
load: yes
这是到目前为止我的脚本(如果没有注释,get请求可以完美地工作,但是对于第一页):
^{pr2}$
我不想去任何与浏览器模拟器相关的解决方案。提前谢谢。在