■写在前头:
爬取有异步加载页的时信息时,要理解其原理才是最重要的。
带AJAX页因为一次获取不到,所以要想办法摸拟出AJAX异步效果,得到返回数据,再分析,最终才能得到想要的结果。
■所需import 包
import requests
from lxml import etree
# -*- coding: utf-8 -*- import scrapy import requests from lxml import etree class FindallnameSpider(scrapy.Spider): name = 'findAllName' start_urls = ['http://www.xxx.com/'] def parse(self, response): jumpUrl = response.xpath("//div[@class='g_biChan']/div[1]/a/@href").extract_first() print('1【【【jumpUrl is : '+ jumpUrl+'】】】') yield scrapy.Request(response.urljoin(jumpUrl), callback=self.parse2) def parse2(self, response): info = {} page1url="http://www.xxx.com/xxx/shop/queryshopproduct.html?ran=0.7350681925523111" param = "&shopid=018&pageno=1&order=1&ordertype=2&showtype=1" r = requests.get(page1url+param).text selector = etree.HTML(r) for level1s in selector.xpath("//div[@class='g_shouJinwyy']/div[normalize-space(@class='g_tuiJianyue')]"): info = { 'productDescript': level1s.xpath(".//a[@class='g_shouNamede']/text()"), 'price': level1s.xpath(".//p/text()") } print(info['productDescript']) print(info['price'])
关键点在于
1 request.get(URL).text
意思是从URL取得对向的text
2 selector = etree.HTML(r)
意思是将text对向转为html对象。因为html对象xpath遍历要比str对象的截取容易操作得多,所以此处转成xtml对像,当然,为了方便操作还有beautifulSOAP对像等等。
3 AJAX访问的URL及参数,可以在CHROME的network标签下获得。