一直想优化这个项目,终于完成了,并不是很难,主要完成了以下两点优化:
一:职位全部爬取
二:存储到数据库,mongodb
首先看一下tencentspider:
# -*- coding: utf-8 -*-
import scrapy
from tencent.items import TencentItem
class TencentSpiderSpider(scrapy.Spider):
name = 'tencent_spider'
allowed_domains = ['hr.tencent.com']
start_urls = ['https://hr.tencent.com/position.php?keywords=python&start=0#a']
def parse(self, response):
trs = response.xpath("//table[@class='tablelist']//tr")[1:-1]
for tr in trs:
tds = tr.xpath(".//td")
name = tds[0].xpath("./a/text()").get()
# print(position)
category = tds[1].xpath("./text()").get()
# print(category)
num = tds[2].xpath("./text()").get()
# print(num)
city = tds[3].xpath("./text()").get()
# print(city)
item = TencentItem(name=name , category=category , num=num , city=city)
yield item
next_page = response.xpath("//a[@id='next']/@href").get()
url = response.urljoin(next_page)
yield scrapy.Request(url=url , callback=self.parse)
我们看到修改到的地方:
next_page = response.xpath("//a[@id='next']/@href").get() url = response.urljoin(next_page) yield scrapy.Request(url=url , callback=self.parse)
我们看到xpath helper只解析到一个标签,也就是说它是唯一的。我们取到href属性即取到下一页。
然后是一个callback回调函数。
这里记录一个报错:
scrapy TypeError: Cannot mix str and non-str arguments
原因是:
next_page = response.xpath("//a[@id='next']/@href")
这一句没有加get()方法。
然后看到piplines.py
import pymongo
class MonGoPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mogo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DB')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mogo_db]
def process_item(self, item, spider):
name = item.__class__.__name__
try:
self.db[name].insert(dict(item))
print("存储成功")
return item
except Exception as e:
print("ERROR: " , e.args)
def close_spider(self, spider):
self.client.close()
这里是把数据存储到数据库。
然后是settings.py
关闭协议:
ROBOTSTXT_OBEY = False
我们开一下爬取延时:
DOWNLOAD_DELAY = 1
请求头:
DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36' ' (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36' }
piplines:
ITEM_PIPELINES = { 'tencent.pipelines.MonGoPipeline': 300, } MONGO_URI = 'localhost' MONGO_DB = 'tencent'
改成我们自己的piplines
最后看到items,跟上次的没变化:
import scrapy
class TencentItem(scrapy.Item):
name = scrapy.Field()
category = scrapy.Field()
num = scrapy.Field()
city = scrapy.Field()
我们运行start.py:
from scrapy import cmdline
cmdline.execute("scrapy crawl tencent_spider".split())
看一下数据库:
558条数据
我们看看腾讯招聘多少条
558,我们爬完了。
以上,记录一下scrapy的运用。