Scrapy新手入门

yving-L

已于 2022-08-16 11:11:30 修改

阅读量350

点赞数

分类专栏： scrapy学习

于 2022-08-16 11:07:06 首次发布

本文链接：https://blog.csdn.net/m0_60139002/article/details/126359977

版权

Scrapy 动态数据 Selenium 网络爬虫数据存储

关键词由CSDN通过智能技术生成

scrapy学习专栏收录该内容

1 篇文章 0 订阅

订阅专栏

本文章只为学习使用，只为学习使用，只为学习使用，不为其它用途。欢迎各路大神给出指点意见🌹🌹🌹

Scrapy框架

Step1:创建Scrapy工程
Step2:明确对象
Step3:修改配置文件
Step4:编写Spider
Step5:保存数据

Step1:创建Scrapy工程

scrapy startproject xxx
cd xxx
scrapy genspider xx xx.com

Step2:明确对象

在item.py里，按照需求定义需要存储的对象

class StandardItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()  # 名称

    index = scrapy.Field()  # 索引号
    classification = scrapy.Field()  # 分类
    ....

Step3:修改配置文件

一些基操，按需修改就好(或者有没有大神给小白我一点建议)
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’,
‘Accept-Language’: ‘en’,
}
DOWNLOADER_MIDDLEWARES = {
‘Standard.middlewares.SeleniumMiddle’: None,
} （写了一个中间件但后面发现直接请求接口获取动态数据也可以）
ITEM_PIPELINES = {
‘Standard.pipelines.StandardPipeline’: 300,
}（一定记得把管道放开哈）
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

Step4:编写Spider

大概思路如下：

class xxxSpider(scrapy.Spider):
    name = 'xx'
    allowed_domains = ['xx.xxx.xxxx.cn']
    start_urls = ['......']

    def __init__(self):
        option = webdriver.ChromeOptions()
        option.add_experimental_option('excludeSwitches', ['enable-automation'])
        option.add_argument('--disable-blink-features=AutomationControlled')
        self.driver = webdriver.Chrome('/Users/...../chromedriver', options=option)
        self.url_list = []

    def parse(self, response):
        startUrls = ["........."]
        self.driver.get(startUrls[0])
        sleep(2)
        pages = self.driver.find_elements(By.XPATH, '//div[@class="table-content-wrap"]/div[@class="pagination"]/a')
        pages = pages[-3].text
        page_num = (int(pages) // 5) + 1
        for i in range(1, page_num + 1):
            url = f'http://........?page={i}'
            yield scrapy.Request(url=url, callback=self.get_data)

    def get_data(self, response):
        datas = json.loads(response.text)
        items = datas.get('articles')
        for each in items:
            item = StandardItem()
            url = each.get('url')
            item['url'] = url
            yield scrapy.Request(url=url, callback=self.get_datas,meta={'item':item})
    def get_datas(self, response):
        item = response.meta['item']
        item['title'] = re.findall('<h1 class="title document-number">(.*?)</h1>', response.text, re.S)[0]
        item['index'] = re.findall(
            '<td class="first">\s+索引号：\s+</td>\s+<td class="td-value-xl">\s+<span title=".*?">(.*?)</span>\s+</td>',
            response.text, re.S)[0]
        try:
            item['classification'] = re.findall('<td class="second">\s+分类：\s+</td>\s+<td class="td-value">\s+<span>(.*?)</span>\s+</td>',response.text,re.S)[0]
        except:
            item['classification'] = ''
            .........