2021/6/3爬虫第二十次课（腾讯招聘网职位、scrapy shell、settings.py补充）

最新推荐文章于 2024-04-30 13:43:00 发布

笔记本IT

最新推荐文章于 2024-04-30 13:43:00 发布

阅读量185

点赞数 1

文章标签： ajax scrapy

本文链接：https://blog.csdn.net/httpsssss/article/details/117513365

版权

文章目录

一、案例：腾讯招聘网职位及对应职责
二、scrapy shell
三、settings.py文件的补充

一、案例：腾讯招聘网职位及对应职责

1.1页面分析：

是否为静态—>动态（ajax）
翻页在scrapy中：1>列出几个URL，然后.format 2>直接找下一页的Url地址最后 yield scrapy.Request(url，callback=None)
实现（爬虫程序、items、piplines.py）
注：若需要点击跳转到另一个URL 这时看看检查里是否有现成的、找规律
检查、response、preview、网页源代码、json

1.2具体：

1.2.1翻页：

职位的URL：
https://careers.tencent.com/search.html?index=1 第一页
https://careers.tencent.com/search.html?index=2 第二页
https://careers.tencent.com/search.html?index=3 第三页

每一个岗位的职责的URL（可能在职位的URL中也有，但可能不全）
https://careers.tencent.com/jobdesc.html?postId=1357242769973714944
https://careers.tencent.com/jobdesc.html?postId=1357242775422115840
[postId不一样]

1.2.2 ：代码：

import scrapy
import json
from recruit.items import RecruitItem
class PositionSpider(scrapy.Spider):
    name = 'position'
    allowed_domains = ['tencent.com']
    first_url='https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1622635924738&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'
    second_url='https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1622637678428&postId={}&language=zh-cn'
    start_urls = [first_url.format(1)]

    def parse(self, response):
        # 解析数据 re xpath bs4..
        # response.encoding='utf-8'
        data = json.loads(response.text)
        for job in data['Data']['Posts']:
            item = RecruitItem()
            item['job_name'] = job['RecruitPostName']
            post_id = job['PostId']
            # 获取详情页的url
            detail_url = self.second_url.format(post_id)
            #对每一个职位，再次请求url获取职位
            yield scrapy.Request(
                url=detail_url,
                callback=self.detail_content,
                meta={'item':item}
            )
            # print(item)

        # 翻页
        for page in range(2,3):
            url = self.first_url.format(page)
            yield scrapy.Request(url=url)

    def detail_content(self,response):
        # 数据有没有过来
        # item = response.meta['item']
        item = response.meta.get('item')
        data = json.loads(response.text)
        item['job_duty'] = data['Data']['Responsibility']
        print(item)

总结：
今后用什么方式来爬取数据？ - 先实现功能 - 优化程序
是根据自己掌握技术的优先级

补充：
html中：
在这里插入图片描述

上节课古诗文代码的扩展（进一步爬取每一首古诗的译文与注释）
代码：（poem.py）

import scrapy
from poems.items import PoemsItem
import re
from lxml import etree
import requests
import time

class PoemSpider(scrapy.Spider):
    name = 'poem'
    allowed_domains = ['gushiwen.org','gushiwen.cn']
    start_urls = ['https://www.gushiwen.org/default_1.aspx']

    def parse(self, response):

        gsw_divs=response.xpath('//div[@class="sons"]/div[@class="cont"]')

        # 过滤非正常诗文，并得取诗文详细内容链接
        if gsw_divs:
            for gsw_div in gsw_divs:
                # time.sleep(0.1)
                if gsw_div.xpath('./div[@class="yizhu"]'):
                    href = gsw_div.xpath('./p/a/@href').get()
                    href_url = response.urljoin(href)
                    yield scrapy.Request(url=href_url, callback=self.parse_1)

        #翻页处理：
        # next_href = response.xpath('//a[@id="amore"]/@href').get()
        # # 翻页
        # if next_href:
        #     next_url = response.urljoin(next_href)  # urljoin()可以进行url地址的补全
        #     # request = scrapy.Request(next_url)
        #     # yield request
        #     yield scrapy.Request(
        #         url=next_url,
        #         callback=self.parse  # 如果这个逻辑是这个parse 就可以省略
        #     )
    #这里的response进入到详细页了
    def parse_1(self,response):
        #获取四项：名称...
        html_text = response.xpath('//div[@id="sonsyuanwen"][1]/div[@class="cont"]')
        title = html_text.xpath('./h1/text()').get()
        author = html_text.xpath('./p[@class="source"]/a[1]/text()').get()
        dynasty = html_text.xpath('./p[@class="source"]/a[2]/text()').get()
        content_list = html_text.xpath('./div[@class="contson"]//text()').getall()
        content = ''.join(content_list).strip()

        #取得“展开阅读全文”链接，如果有则得取ajax页的译文和注释，没有则提取本页上的译文和注释
        pd=response.xpath('//a[@style="text-decoration:none;"]/@href').get()
        #'//div[@style="text-align:center; margin-top:-5px;"]/a/@href'
        if pd:
            # try:
                ID=re.match(r"javascript:.*?,'(.*?)'",str(pd)).group(1)
                # print(type(ID))
                base_url='https://so.gushiwen.cn/nocdn/ajaxfanyi.aspx?id={}'
                parse_html = etree.HTML(requests.get(base_url.format(ID)).text)
                # print(parse_html)
                html_fanyi= parse_html.xpath('//div[@class="contyishang"]//text()')
                # print(html_fanyi)  测试用
                # print('*'*10)
                
            # except:
            #     pass
        else:
            html_fanyi = response.xpath(
                '//div[@class="left"]/div[@class="sons"][2]/div[@class="contyishang"]//text()').getall()
            # print(html_fanyi)

        html_fanyi = ''.join(html_fanyi).replace('\n', '').replace('译文及注释', '').replace('译文', '').replace('注释', '|')

        if html_fanyi:
            fanyi = html_fanyi.split('|')
            translation = fanyi[0]
            notes = fanyi[1]
        else:
            translation = ''
            notes = ''

        item = PoemsItem(title=title,
                       author=author,
                       dynasty=dynasty,
                       content=content,
                       translation=translation,
                       notes=notes)
        yield item

总结：

URL一定要正确，一般scrapy不会出现乱码如果出现了settings.py文件中加个配置项 FEED_EXPORT_ENCODING = 'utf-8
爬到好多空列表—很可能 xpath(就是爬取数据是否正确) 、解析是否正确、反爬、要登录

二、scrapy shell

详见

三、settings.py文件的补充

settings.py文件可以作为配置项变量大写表示公共、配置
用于连接外部与scrapy 编程时不知把一些逻辑（简单）放哪里，可以放settings.py文件中
例子：
在爬虫文件与管道中引用settings.py文件中的MYSQL_HOST = ‘127.0.0.1’
1.爬虫文件：

第一种方式：
from mySpider.settings import MYSQL_HOST
item['db_host'] = MYSQL_HOST

第二种方式：
item['db_host'] = self.settings.get('MYSQL_HOST')

2.piplines.py:

第一种方式同上

第二种方式：
item['pip_host'] = spider.settings.get('MYSQL_HOST')

笔记本IT

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
1
评论
2021/6/3爬虫第二十次课（腾讯招聘网职位、scrapy shell、settings.py补充）

一、案例：腾讯招聘网职位及对应职责1.1页面分析：是否为静态—>动态（ajax）翻页在scrapy中：1>列出几个URL，然后.format 2>直接找下一页的Url地址最后 yield scrapy.Request(url，callback=None)实现（爬虫程序、items、piplines.py）注：若需要点击跳转到另一个URL 这时1.2具体：1.2.1翻页：职位的URL：https://careers.tencent.com/search
复制链接

扫一扫