爬虫学习总结四

最新推荐文章于 2024-10-01 20:02:09 发布

Roy0608

最新推荐文章于 2024-10-01 20:02:09 发布

阅读量210

点赞数

分类专栏： python 文章标签：爬虫 python 数据分析

本文链接：https://blog.csdn.net/Roy0608/article/details/102476021

版权

python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

四. Scrapy框架爬拉勾网Python岗位信息

1. 定义想要抓取的item字段
2. 编写爬虫代码和配置settings.py

1. 定义想要抓取的item字段

我们想要抓取的Items是职位名称、公司地点、公司名称、工资、工作经验要求、学历要求这六个字段，下面是items.py的具体代码：

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class LagouspiderItem(scrapy.Item):
    # define the fields for your item here like:

    positionname = scrapy.Field()  # 职位名称
    address = scrapy.Field()  # 公司地点
    company = scrapy.Field()  # 公司名称
    salary = scrapy.Field()  # 工资
    experience = scrapy.Field()  # 工作经验要求
    education = scrapy.Field()  # 学历要求

2. 编写爬虫代码和配置settings.py

1. 单页面数据爬取

# -*- coding: utf-8 -*-
# 爬取拉勾网python岗位信息
import scrapy

from LaGouSpider.items import LagouspiderItem


class LagouSpider(scrapy.Spider):
    name = 'LaGou'
    allowed_domains = ['lagou.com']
    start_urls = ['https://www.lagou.com/zhaopin/Python/1/?filterOption=3']

    def parse(self, response):
        for line in response.xpath('//div[@class="list_item_top"]'):
            item = LagouspiderItem()
            # 这部分是爬取部分，使用xpath的方式选择信息，具体方法根据网页结构而定
            item['positionname'] = line.xpath('./div[@class="position"]/div[1]/a/h3/text()').extract_first()
            item['address'] = line.xpath('./div[@class="position"]/div[1]/a/span/em/text()').extract_first().split('·')[0]
            item['company'] = line.xpath('./div[@class="company"]/div[1]/a/text()').extract_first()
            item['salary'] = line.xpath('./div[@class="position"]/div[2]/div/span/text()').extract_first()
            item['experience'] = line.xpath('./div[@class="position"]/div[2]/div/text()[3]').extract_first().replace('\n','').replace(' ','').split('/')[0]
            item['education'] = line.xpath('./div[@class="position"]/div[2]/div/text()[3]').extract_first().replace('\n','').replace(' ','').split('/')[1]
            yield item

因为爬取的字段中存在一些换行符\n和空格，所以在获取数据后使用replace()函数替换掉。
运行结果如下：

显然，运行失败，主要是因为拉勾网设立了反爬机制，所以我们要在settings.py中增加UA（USER_AGENT）请求头。

2. 修改settings.py

打开要爬取的网页，右击选择检查后，选择Network，然后按Ctrl+R刷新页面，单机Name中出现的文件，找到User-Agent。随后打开settings.py，将UA复制到USER_AGENT这一行，并取消注释，并且将ROBOTSTXT_OBEY机器人协议修改为False。之后我们重新运行一下，结果为：
可以看出运行成功，接下来修改代码实现上一章节讲的自动翻页。

3. 自动翻页

这里我们看出url地址是有规律的：
第一页
https://www.lagou.com/zhaopin/Python/1/?filterOption=3
第二页
https://www.lagou.com/zhaopin/Python/2/?filterOption=3
第三页
https://www.lagou.com/zhaopin/Python/3/?filterOption=3
…
所以根据上一章讲的方法，只需修改start_urls即可，代码如下：

start_urls = ['https://www.lagou.com/zhaopin/Python/{}/?filterOption=3'.format(i) for i in range(1,31)]

之后运行一下爬虫，并将结果保存到pymsg.csv中：scrapy crawl LaGou -o pymsg.csv 最后一共爬下来345条（拉勾的反爬机制比较完善，如果短时间内，爬取次数过多，爬取条数会越来越少）

4. 数据分析

最后，针对address这个地段，看一下不同城市的岗位需求，并可视化展示，具体代码如下：

import pandas as pd
df_data_py=pd.read_csv("./pymsg.csv")  # 读取csv文件

diff_city_py=[] # 存放城市
for i in df_data_py['address']:   # 遍历address字段             
    if i not in diff_city_py:
    	diff_city_py.append(i)
#print(diff_city_py)

num_py=[]
for i in diff_city_py:
    count = 0
    for j in df_data_py['address']:
        if j == i:
            count+=1
    num_py.append(count)
python=dict(zip(diff_city_py,num_py)) # 将城市和对应岗位个数打包并生成字典形式
#print(python)  
pd.DataFrame([python]).to_csv('city_num.csv') # 将结果保存到csv文件中

之后将csv文件复制到桌面并打开，发现csv文件中的中文是乱码，这主要是编码导致的，我们可以使用Notepad++打开，然后将格式改为utf-8-bom编码格式并保存，这样就显示正确了，第一列为索引，因为数据简单，所以可以直接用csv作图，当然也可以用matplotlib包进行作图，可视化结果为：
可以看出北京、上海、深圳的python岗位需求量达到了总需求的71%。同样，我们也可以根据工资情况做一些分析。
至此，本章关于拉勾网的爬虫和数据分析可视化已经做完，下一章主要介绍如何自动爬取京东图书全站的信息。