Scrapy框架(一)
1,Scrapy简介
Scrapy 是纯 python 开发的一个高效,结构化的网页抓取框架,主要用于抓取静态页面
- Windows上安装:https://www.lfd.uci.edu/~gohlke/pythonlibs/
python -m pip install
上面下载的包(绝对路径)python -m pip install scrapy
- Ubuntu安装
-
创建虚拟环境:
mkvirtualenv -p /user/bin/python3.6 环境名
-
进入对应的虚拟环境执行:
sudo apt-get install python-dev python-pip libxml2-dex libxsltl-dev zliblg-dev libffi-dev libssl-dev
sudo rm /var/cache/apt/archives/lock
sudo rm /var/lib/dpkg/lock
-
在python3上面安装:
sudo apt-get install python3-dev
-
下载Scrapy框架:
pip install scrapy
2,Scrapy框架的使用
-
新建项目
命令:
scrapy startproject <project_name> [project_dir]
会生成一个项目根目录文件夹 -
编写爬虫
-
手写(在siders文件夹下创建)
import scrapy class TzSpider(scrapy.Spider): # 继承 Spider 这个类 name = 'tz' # spider 的名字,唯一 allowed_domains = ['www.shiguangkey.com'] # 限制抓取范围 start_urls = ['https://www.shiguangkey.com/course/list'] # 初始 url 列表 def parse(self, response): # 每个初始 url 爬取之后,会调用这个方法 pass
-
命令行编写
# 进入爬虫根目录下,看到 scrapy.cfg 这个文件 scrapy genspider <文件名> <初始url> 例:scrapy genspider tzc www.shiguangkey.com
-
-
查看可运行的爬虫文件
命令:
scrapy list
-
运行爬虫
命令:
scrapy crawl <爬虫名>
-
设置是否遵守robots.txt协议
在
setting.py
里面把ROBOTSTXT_OBEY = True
改为False
案例一:爬取Boss直聘网站
import scrapy
import json
import re
class TzSpider(scrapy.Spider):
name = 'boss' # spider的名字,唯一
start_urls = ['https://www.zhipin.com/job_detail/?query=%E5%B5%8C%E5%85%A5%E5%BC%8F&city=101210100&industry=&position='] # 初始url列表
# 每个初始url爬取之后,会调用这个方法
def parse(self, response):
# 所有的行数据
trs = response.xpath('//div[@class="job-primary"]')
with open('嵌入式开发.json', 'a+', encoding='utf-8') as f:
for tr in trs:
data = {
'岗位': tr.xpath('.//div[@class="job-title"]/text()').extract_first(),
'薪水': tr.xpath('.//span[@class="red"]/text()').extract_first(),
'公司名称': tr.xpath('.//div[@class="company-text"]/h3/a/text()').extract_first(),
'公司地址': tr.xpath('.//div[@class="info-primary"]/p/text()').extract_first(),
'工作经验': re.findall(r"data=\'(.*?)\'", str(tr.xpath('.//div[@class="info-primary"]/p/text()')[1])) ,
'学历要求': re.findall(r"data=\'(.*?)\'", str(tr.xpath('.//div[@class="info-primary"]/p/text()')[2])),
'规模': re.findall(r'\d*人.*', tr.xpath('string(.//div[@class="company-text"]/p)').extract_first())[0]
}
json.dump(data, f, ensure_ascii=False)
f.write('\n')
next_url = 'https://www.zhipin.com' + response.xpath('//a[@ka="page-next"]/@href').extract_first()
# if 'page=10' in next_url:
# return
# 返回一个request请求
return scrapy.Request(next_url)
3,Scrapy的运行流程
调度模块,引擎,下载器,管道,中间件,爬虫
4,Scrapy管道
-
启动管道
配置
setting.py
文件,开启管道ITEM_PIPELINES = { 'tanzhou.pipelines.TanzhouPipeline': 300, }
-
item设置(定义需要解析的字段)
import scrapy class TanzhouItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() job_name = scrapy.Field() company_name = scrapy.Field() work_place = scrapy.Field() salary = scrapy.Field() public_day = scrapy.Field()
-
编写管道
import json class TanzhouPipeline(object): def process_item(self, item, spider): # 处理item,yield一个item,process_item就调用一次 data = str(item).replace('\n', '') self.f.write(json.dumps(data, ensure_ascii=False)) self.f.write('\n') return item def open_spider(self, spider): # 爬虫打开的时候运行 print('------------------爬取开始--------------------') self.f = open('工作.json', 'w+', encoding='utf-8') def close_spider(self, spider): # 爬虫关闭的时候运行 print('------------------爬取结束--------------------') self.f.close()
-
编写爬虫
import scrapy from ..items import TanzhouItem class ZpwSpider(scrapy.Spider): name = 'zpw' # allowed_domains = ['https://search.51job.com/jobsearch/search_result.php'] start_urls = ['https://search.51job.com/list/080200,000000,0000,00,9,99,python,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare='] def parse(self, response): trs = response.xpath('//div[@class="el"]')[4:-1] if trs: for tr in trs: item = TanzhouItem() item['job_name'] = tr.xpath('./p//a/@title').extract_first() item['company_name'] = tr.xpath('./span[@class="t2"]//a/@title').extract_first() item['work_place'] = tr.xpath('./span[@class="t3"]/text()').extract_first() item['salary'] = tr.xpath('./span[@class="t4"]/text()').extract_first() item['public_day'] = tr.xpath('./span[@class="t5"]/text()').extract_first() yield item next_page = response.xpath('//li[@class="bk"]')[1].xpath('./a/@href').extract_first() yield scrapy.Request(next_page) else: # 如果没有数据,return空,结束爬虫 return