Scrapy爬虫框架学习笔记-简单爬虫实战

最新推荐文章于 2024-08-10 17:00:15 发布

Goker123

最新推荐文章于 2024-08-10 17:00:15 发布

阅读量589

点赞数

文章标签： python 爬虫 python

本文链接：https://blog.csdn.net/weixin_43848766/article/details/121547094

版权

Scrapy框架-阳光政务平台爬虫

想要爬取的网站：阳光政务平台
爬取内容：事件标题发布时间详细情况附带图片
文末有工程文件

items.py设置

配置items.py文件

items.py文件中NameItem(scrapy.Item)类中写入

# 在这里定义想要爬取的内容
title = scrapy.Field()  # 标题
href = scrapy.Field()  # 详情网站
publish_date = scrapy.Field()# 发布日期
content_img = scrapy.Field()# 详情图片
content = scrapy.Field()# 详情

这一部分相当于整合信息条目，而且方便检查在写爬虫文件时的拼写错误

setting.py设置

开启并修改网络代理

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36 Edg/96.0.1054.29'  
# agent可以在浏览器中任意页面的检查中找一个

添加日志打印等级
```
LOG_LEVEL = "WARNING"
```
开启pipeline交互，若不开启，爬虫文件的数据就无法传递给pipeline.py文件
```
ITEM_PIPELINES = {
'yangguang.pipelines.YangguangPipeline': 300,
}
```

主爬虫内容

代码

import scrapy
from ..items import YangguangItem  # 导入item


class YgSpider(scrapy.Spider):
    name = 'yg'
    allowed_domains = ['wzzdg.sun0769.com']
    start_urls = ['http://wzzdg.sun0769.com/political/index/politicsNewest?id=1&page='
                + str(x) for x in range(1, 101)]  # 一共100页

    def parse(self, response):
        # 分组
        li_list = response.xpath('//ul[@class="title-state-ul"]/li')
        for li in li_list:
            item = YangguangItem()
            # 提取信息
            item["title"] = li.xpath('./span[3]/a/text()').extract_first()
            item["href"] = li.xpath('./span[3]/a/@href').extract_first()
            item["href"] = "http://wzzdg.sun0769.com" + item["href"]  # 重构详情页URL
            item["publish_date"] = li.xpath('./span[5]/text()').extract_first()
            yield scrapy.Request(
                item["href"],
                callback=self.parse_detail,  # 回调详情页爬虫程序
                meta={"item": item}  # 传递item数据
            )
            
    # 详情页爬虫程序
    def parse_detail(self, response):
        item = response.meta["item"]
        # 提取详情页信息
        item["content"] = response.xpath('//div[@class="mr-three"]/div[2]/pre/text()').extract()
        item["content_img"] = response.xpath('//div[@class="mr-three"]/div[3]/img/@src').extract()
        yield item  # 把item传递给pipeline

几点注意事项
- 一定要导入item类，否则无法保存信息
- allowed_domains一定要写对，特别是自己定义的函数(parse_detail)所用的url一定要在allowed_domains的范围内，否则会不执行回调函数。
  比如回调函数请求网站：[https://book.douban.com/top250?start=1]
  而allowed_domains = [“douban.com”] 就无法正常回调

pipeline.py设置

代码

import re

class YangguangPipeline:
    def process_item(self, item, spider):
        item["content"] = self.process_content(item["content"])  # 更新content内容
        print(item)  # 打印爬取到的数据
        return item
    
    # 去除content中不想要的转义符和空content内容
    def process_content(self, content):
        content = [re.sub('\r\n', "", i) for i in content]
        content = [re.sub('\n', "", i) for i in content]
        content = [i for i in content if len(i) > 0]  # 去除列表中的空字符串
        return content

自定义run.py

之前的文章说过，跑scrapy程序需要在命令行中输入scrapy crawl xxx
这种在命令行中的交互方式实在不方便，那我们在工程目录下定义一个run.py文件就可以直接在pycharm里运行这个run.py文件就能实现在pycharm里进行交互了。

代码

from scrapy import cmdline

cmdline.execute('scrapy crawl yg'.split())

结果

{'content': ['事件详情'],
 'content_img': ['图片URL'],
 'href': '详情页URL',
 'publish_date': '发布时间',
 'title': '标题'}

链接: 工程文件)

Goker123

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫