python3 + Scrapy爬虫学习之腾讯职位爬取优化

最新推荐文章于 2021-02-15 16:48:37 发布

s_kangkang_A

最新推荐文章于 2021-02-15 16:48:37 发布

阅读量575

点赞数

分类专栏： python3+scrapy爬虫文章标签： python3 scrapy 腾讯招聘全部职位爬取 MongoDB

本文链接：https://blog.csdn.net/s_kangkang_A/article/details/89502215

版权

python3+scrapy爬虫专栏收录该内容

7 篇文章 0 订阅

订阅专栏

一直想优化这个项目，终于完成了，并不是很难，主要完成了以下两点优化：

一：职位全部爬取

二：存储到数据库，mongodb

首先看一下tencentspider：

# -*- coding: utf-8 -*-
import scrapy
from tencent.items import TencentItem

class TencentSpiderSpider(scrapy.Spider):
    name = 'tencent_spider'
    allowed_domains = ['hr.tencent.com']
    start_urls = ['https://hr.tencent.com/position.php?keywords=python&start=0#a']

    def parse(self, response):
        trs = response.xpath("//table[@class='tablelist']//tr")[1:-1]
        for tr in trs:
            tds = tr.xpath(".//td")
            name = tds[0].xpath("./a/text()").get()
            # print(position)
            category = tds[1].xpath("./text()").get()
            # print(category)
            num = tds[2].xpath("./text()").get()
            # print(num)
            city = tds[3].xpath("./text()").get()
            # print(city)
            item = TencentItem(name=name , category=category , num=num , city=city)
            yield item
        next_page = response.xpath("//a[@id='next']/@href").get()
        url = response.urljoin(next_page)
        yield scrapy.Request(url=url , callback=self.parse)

我们看到修改到的地方：

next_page = response.xpath("//a[@id='next']/@href").get()
url = response.urljoin(next_page)
yield scrapy.Request(url=url , callback=self.parse)

我们看到xpath helper只解析到一个标签，也就是说它是唯一的。我们取到href属性即取到下一页。

然后是一个callback回调函数。

这里记录一个报错：

scrapy TypeError: Cannot mix str and non-str arguments

原因是：

next_page = response.xpath("//a[@id='next']/@href")

这一句没有加get()方法。

然后看到piplines.py

import pymongo

class MonGoPipeline(object):
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mogo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DB')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mogo_db]

    def process_item(self, item, spider):
        name = item.__class__.__name__
        try:
            self.db[name].insert(dict(item))
            print("存储成功")
            return item
        except Exception as e:
            print("ERROR: " , e.args)

    def close_spider(self, spider):
        self.client.close()

这里是把数据存储到数据库。

然后是settings.py

关闭协议：

ROBOTSTXT_OBEY = False

我们开一下爬取延时：

DOWNLOAD_DELAY = 1

请求头：

DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36'
                 ' (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}

piplines：

ITEM_PIPELINES = {
   'tencent.pipelines.MonGoPipeline': 300,
}
MONGO_URI = 'localhost'
MONGO_DB = 'tencent'

改成我们自己的piplines

最后看到items，跟上次的没变化：

import scrapy


class TencentItem(scrapy.Item):
    name = scrapy.Field()
    category = scrapy.Field()
    num = scrapy.Field()
    city = scrapy.Field()

我们运行start.py：

from scrapy import cmdline

cmdline.execute("scrapy crawl tencent_spider".split())

看一下数据库：

558条数据

我们看看腾讯招聘多少条

558，我们爬完了。

以上，记录一下scrapy的运用。

s_kangkang_A

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
python3 + Scrapy爬虫学习之腾讯职位爬取优化

一直想优化这个项目，终于完成了，并不是很难，主要完成了以下两点优化：一：职位全部爬取二：存储到数据库，mongodb首先看一下tencentspider：# -*- coding: utf-8 -*-import scrapyfrom tencent.items import TencentItemclass TencentSpiderSpider(scrapy.Spid...
复制链接

扫一扫

专栏目录