scrapy爬虫框架
1.scrapy安装
pip install scrapy
yum -y install scrapy
vim .bashrc
alias scrapy="home/user1/python3/bin/scrapy"
source .bashrc
2.创建scrapy 项目爬取链家租房
-
打开创建项目的目录,点击shfit 键,鼠标右击,打开在此处打开命名窗口
scrapy startprojects 项目名称
-
items.py 设置要获取的字段
-
middlewares 设置响应头、代理ip 等
-
piplines 数据存储
-
创建爬虫文件
cd spiders
scrapy genspider lianjiazufang https://sh.lianjia.com/zufang/
方法一,terminal中直接运行
scrapy crawl lianjiazufang
方法二,创建main.py ,运行main脚本即可
from scrapy import cmdline
cmdline.execute("scrapy crawl lianjiazufang".split())
整体框架
lianjiazufang.py
import scrapy
from lianjia.items import LianjiaItem
import re
class LianjiazufangSpider(scrapy.Spider):
name = 'lianjiazufang'
allowed_domains = ['sh.lianjia.com']
start_urls = ['https://sh.lianjia.com/zufang/pg1/']
def parse(self, response):
name_item_list = response.xpath('//div[@class="content__list--item"]')
print(response.request.headers)
for name_item in name_item_list:
info={
}
info["content_title"] = name_item.xpath('.//div/p/a/text()').extract_first().strip()
info["content_url"] = "https://sh.lianjia.com"+name_item.xpath('.//div/p/a/@href').extract_first()