scrapy汽车投诉

最新推荐文章于 2024-08-15 13:56:24 发布

三好市民徐先生

最新推荐文章于 2024-08-15 13:56:24 发布

阅读量143

点赞数 1

文章标签： scrapy 汽车爬虫

本文链接：https://blog.csdn.net/qq_43664930/article/details/129422699

版权

获取半年内的投诉信息

结束时间

        today = datetime.datetime.now()
        today = datetime.datetime.strftime(today, "%Y-%m-%d")
        today = datetime.datetime.strptime(today, "%Y-%m-%d")
        self.endTime = today - datetime.timedelta(weeks=26)

获取每一页详细连接，并存入csv文件

        detailPage = response.xpath('//div[@class="col-xs-12 plr_0 ts_list"]/table[2]//tr/@onclick').extract()
        match = re.compile(r'[(](.*)[)]', re.S)
        for i in range(len(detailPage)):
            pageID = re.findall(match, detailPage[i])[0]
            detailPage[i] = domain+pageID

获取下一页连接

        domain2 ='https://www.aqsiqauto.com'
        if self.next == 0:
            nextFile = "nextPage.csv"
            nextPage = response.xpath('//li[@class="next"]/a/@href').extract()
            if nextPage:
                self.pageNum = self.pageNum + 1
                nextPage = domain2 + nextPage[0]
                yield scrapy.Request(nextPage, callback=self.parse)
            else:
                error1 = '第'+str(self.pageNum)+'页连接获取失败'
                with open("erro.csv",'a',encoding='utf-8',newline='') as f:
                    writer = csv.writer(f)
                    writer.writerow([self.pageNum,error1])

        else:
            with open(writeFile, "a", encoding="utf-8", newline="") as f:
                writer = csv.writer(f)
                writer.writerow(["详细信息数量",self.total])
                writer.writerow(["爬取页数",self.pageNum])
            print("完成！")

详细页面信息获取

 def parse_detail(self, response):

        # print("正在获取id=%s的页面信息" % response.meta["url"])

        brand = response.xpath('//div[@class="col-xs-12 TS_content plr_0"]/div/div[4]//span/text()').extract()
        if not brand:
            brand= ['null']
        else:
            brand[0] = brand[0].strip()

        series = response.xpath('//div[@class="col-xs-12 TS_content plr_0"]/div/div[5]//div/text()').extract()
        if not series:
            series = ['null']
        else:
            series[0] = series[0].strip()

        details = response.xpath(
            '//div[@class="col-xs-12 TS_content plr_0"]/div/div[8]//span/text()').extract()
        if not details:
            details= ['null']
        else:
            details[0] = details[0].strip()

        problem = response.xpath('//div[@class="col-xs-12 TS_content plr_0"]/div/div[7]//span/text()').extract()
        if not problem:
            problem= ['null']

        else:
            problem[0] = problem[0].strip()

        try:
            time = response.xpath(
                '//div[@class="col-xs-12 TS_content plr_0"]/div/div[3]//div/text()').extract()
            if not time:
                time = ['null']
                error = [response.meta["id"] + '  没有时间参数']
                with open("error.csv", "a", encoding='utf-8', newline='') as f:
                    writer = csv.writer(f)
                    writer.writerow(error)
            else:
                time[0] = time[0].strip()
                pageTime = datetime.datetime.strptime(time[0], '%Y-%m-%d')
                diffrence = pageTime - self.endTime
                diffrence = diffrence.days
                if diffrence < 0:
                    self.next = 1
        except Exception as e:
            with open('error.csv','a',encoding='utf-8',newline='') as f:
                writer = csv.writer(f)
                writer.writerow([self.pageNum,response.meta['id'],e.args])

        # print("flag=%d"%self.next)
        self.total = self.total+1
        # print("获取%d条信息"%self.total)
        item = QautoItem()
        item['brand'] = brand
        item['series'] = series
        item['details'] = details
        item['problem'] = problem
        item['time'] = time
        yield item