简书文章异步加载
之前爬虫小分队的第一次作业就是爬取简书七日热门,同学们应该知道部分数据是异步加载的,对于阅读,评论,喜欢的抓取数据策略为使用正则表达式匹配,收录专题就是找包来获取数据的。
![3629157-f9b39c0f361d092b.jpg](https://i-blog.csdnimg.cn/blog_migrate/9980f31afe7371543883840a7c5ed3a7.webp?x-image-process=image/format,png)
![3629157-048705c27701a641.jpg](https://i-blog.csdnimg.cn/blog_migrate/b4493abd739e077f825f5a525e3d4798.webp?x-image-process=image/format,png)
Selenium代码
from selenium import webdriver
url = 'http://www.jianshu.com/p/c9bae3e9e252'
def get_info(url):
include_title =[]
driver = webdriver.PhantomJS()
driver.get(url)
driver.implicitly_wait(20)
author = driver.find_element_by_xpath('//span[@class="name"]/a').text
date = driver.find_element_by_xpath('//span[@class="publish-time"]').text
word = driver.find_element_by_xpath('//span[@class="wordage"]').text
view = driver.find_element_by_xpath('//span[@class="views-count"]').text
comment = driver.find_element_by_xpath('//span[@class="comments-count"]').text
like = driver.find_element_by_xpath('//span[@class="likes-count"]').text
included_names = driver.find_elements_by_xpath('//div[@class="include-collection"]/a/div')
for i in included_names:
include_title.append(i.text)
print(author,date,word,view,comment,like,include_title)
get_info(url)
由于只搞了一个页面的,没有存入数据库,就打印了结果。
![3629157-0b2ead888f6baaa9.jpg](https://i-blog.csdnimg.cn/blog_migrate/8e1d350c1e1219971694cb692977b607.webp?x-image-process=image/format,png)
代码分析
由于selenium是加载了javascript的,所以我们用chrome浏览器,直接检查的xpath路径就能提取到信息,以收录专题为例,检查元素,来构造xpath路径,这样就不用找包啦。
![3629157-087250fff9940f5e.jpg](https://i-blog.csdnimg.cn/blog_migrate/7b1f0399aacd1b0b93f3483bb6758361.webp?x-image-process=image/format,png)