链家移动页面分析
“pg”后面的数字就是页码
经纪人详情页URL:
可以通过经纪人列表的xpath的href属性得到
1.建立项目,使用crawl模板生产spider文件
scrapy startproject lianjia01
cd lianjia01
scrapy genspider lianjia m.lianjia.com
2.定义item.py。这里只提取经纪人姓名,负责区域。
import scrapy
class Lianjia01Item(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
region = scrapy.Field()
tran_num = scrapy.Field()
pass
3.编写spider文件,导入必要的包和类,重写start_urls:
import scrapy
from lianjia01.items import Lianjia01Item
class LianjiaSpider(scrapy.Spider):
name = 'lianjia'
allowed_domains = ['m.lianjia.com']
start_urls = ['http://m.lianjia.com/bj/jingjiren/ao22pg' + str(2)