爬虫新的方法----中级

最新推荐文章于 2022-07-06 10:08:52 发布

anxuxiao

最新推荐文章于 2022-07-06 10:08:52 发布

阅读量423

点赞数

分类专栏： python爬虫文章标签：函数爬虫 jobs

python爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

前面博客提到的是在startproject的中自已新建spider文件夹中的主函数
其实现在可以用一条指令就可以把这部分也省了。来源自官方：http://python.gotrained.com/scrapy-tutorial-web-scraping-craigslist/
在命令行中输入

scrapy genspider jobs https://newyork.craigslist.org/search/egr
//jobs是自己起的名字，后面网址是开爬的网址

最后在spiders目录下会出现jobs.py，为爬虫主函数，如下

# -*- coding: utf-7 -*-
import scrapy


class JobsSpider(scrapy.Spider):
    name = 'jobs'
    allowed_domains = ['https://newyork.craigslist.org/search/egr']
    //这里要修改为newyork.craigslist.org，不然报错
    start_urls = ['http://https://newyork.craigslist.org/search/egr/']

    def parse(self, response):
        pass            
 //以上都是通过genspider指令自动生成的
~

这是整个结构目录

下面主要进行parse部分，因为这是最关键接部分，从最简单的开始，看下面一行解析代码


titles = response.xpath('//a[@class="result-title hdrlnk"]/text()').extract()
// 为起始点 
 /a为某种标签后面的@class=... 是指标签中必需要有该元素 text()指a中指向的文本信息  extract()是将网页中所有满足上述条件的值都返回给a 
extract_first()则只返回第一个。

对就源码

<a href="/brk/egr/6085878649.html" data-id="6085878649" class="result-title hdrlnk">Chief Engineer</a>

上一博客提到的用scrapy shell xxxxxx.com后输入print(response)，会返回 <200 https://newyork.craigslist.org/search/egr> ,说明response是对于整个网页地址的连接，相应的print(response.body)将打印出整个网页的内容。但是在爬虫是我们只需要我们需要的部分内容，就需要使用response.xpath(),就如上述titles的提取一样。之后可以print(titles),整个代码修改如下：

import scrapy

class JobsSpider(scrapy.Spider):
    name = "jobs"
    allowed_domains = ["craigslist.org"]
    start_urls = ['https://newyork.craigslist.org/search/egr']

    def parse(self, response):
        titles = response.xpath('//a[@class="result-title hdrlnk"]/text()').extract()
        print(titles)

执行 $ scrapy crawl jobs
后打印出信息

[u'Junior/ Mid-Level  Architect for Immediate Hire', u'SE BUSCA LLANTERO/ LOOKING FOR TIRE CAR WORKER CON EXPERIENCIA', u'Draftsperson/Detailer', u'Controls/ Instrumentation Engineer', u'Project Manager', u'Tunnel Inspectors - Must be willing to Relocate to Los Angeles', u'Senior Designer - Large Scale', u'Construction Estimator/Project Manager', u'CAD Draftsman/Estimator', u'Project Manager']

有点太乱了，希望能按固定样式输出，就可以采用：

for title in titles:
    yield {'Title': title}
    //这里yield与print功能是相似的，但scrapy中用的是yield.

在scrapy中数据可以存储为CSV, JSON 或 XML三种格式。如我们将上面的输出的titles信息保存

scrapy crawl jobs -o result.csv

在当然目录下会出现一个新的.csv文件。
接着上面进行扩展，对某一titles中的岗位细节介绍进行爬取。

<li class="result-row" data-pid="6112478644">
    <a href="/brk/egr/6112478644.html" class="result-image gallery empty"></a>
    <p class="result-info">
        <span class="icon icon-star" role="button">
            <span class="screen-reader-text">favorite this post</span>
        </span>
        <time class="result-date" datetime="2017-05-01 12:35" title="Mon 01 May 12:35:41 PM">May 1</time>
        <a href="/brk/egr/6112478644.html" data-id="6112478644" class="result-title hdrlnk">Project Architect</a>
        <span class="result-meta">
            <span class="result-hood"> (Brooklyn)</span>
            <span class="result-tags">
                <span class="maptag" data-pid="6112478644">map</span>
            </span>
            <span class="banish icon icon-trash" role="button">
                <span class="screen-reader-text">hide this posting</span>
            </span>
            <span class="unbanish icon icon-trash red" role="button" aria-hidden="true"></span>
            <a href="#" class="restore-link">
                <span class="restore-narrow-text">restore</span>
                <span class="restore-wide-text">restore this posting</span>
            </a>
        </span>
    </p>
</li>

在如上的源码中，使用

jobs = response.xpath('//p[@class="result-info"]')
//注意这里没有extract()，相当于这里是个容器，里面包含了所有信息，对容器中的东西进一步提取

for job in jobs
title=job.xpath(‘a/text()’).extract_first()
yield{‘Title’:title}

其他信息提取也一样。
``
for job in jobs:
    title = job.xpath('a/text()').extract_first()
    address = job.xpath('span[@class="result-meta"]/span[@class="result-hood"]/text()').extract_first("")[2:-1]
    relative_url = job.xpath('a/@href').extract_first()//这里得到是相对路径，/brk/egr/6112478644.html，要得到https：//的绝对路径需要用下面
    absolute_url = response.urljoin(relative_url) 
 //通过此来得到绝对路径
    yield{'URL':absolute_url, 'Title':title, 'Address':address}

总结：从浅入深，从一个title信息的提取，到多种信息的提取。下面我们将进行递归对子页信息进行提取。

首先找到子页源码：

<a href="/search/egr?s=120" class="button next" title="next page">next > </a>

之后获得下页的全路径，与前面相同用href属性

relative_next_url = response.xpath('//a[@class="button next"]/@href').extract_first()
absolute_next_url = response.urljoin(relative_next_url)
yield Request(absolute_next_url, callback=self.parse)
//这里使用递归与前面解析提取的信息一致，故使用同一函数

最终整个完整程序为

import scrapy
from scrapy import Request

class JobsSpider(scrapy.Spider):
    name = "jobs"
    allowed_domains = ["craigslist.org"]
    start_urls = ["https://newyork.craigslist.org/search/egr"]

    def parse(self, response):
        jobs = response.xpath('//p[@class="result-info"]')

        for job in jobs:
            relative_url = job.xpath('a/@href').extract_first()
            absolute_url = response.urljoin(relative_url)
            title = job.xpath('a/text()').extract_first()
            address = job.xpath('span[@class="result-meta"]/span[@class="result-hood"]/text()').extract_first("")[2:-1]

            yield Request(absolute_url, callback=self.parse_page, meta={'URL': absolute_url, 'Title': title, 'Address':address})

    relative_next_url = response.xpath('//a[@class="button next"]/@href').extract_first()
    absolute_next_url = "https://newyork.craigslist.org" + relative_next_url
    yield Request(absolute_next_url, callback=self.parse)

    def parse_page(self, response):
        url = response.meta.get('URL')
        title = response.meta.get('Title')
        address = response.meta.get('Address')

        description = "".join(line for line in response.xpath('//*[@id="postingbody"]/text()').extract())

        compensation = response.xpath('//p[@class="attrgroup"]/span[1]/b/text()').extract_first()
        employment_type = response.xpath('//p[@class="attrgroup"]/span[2]/b/text()').extract_first()

        yield{'URL': url, 'Title': title, 'Address':address, 'Description':description, 'Compensation':compensation, 'Employment Type':employment_type}

anxuxiao

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫新的方法----中级

前面博客提到的是在startproject的中自已新建spider文件夹中的主函数其实现在可以用一条指令就可以把这部分也省了。来源自官方：http://python.gotrained.com/scrapy-tutorial-web-scraping-craigslist/ 在命令行中输入scrapy genspider jobs https://newyork.craigslist.org/
复制链接

扫一扫