pyspider爬取统计局统计动态

最新推荐文章于 2023-10-19 11:22:28 发布

123jinse

最新推荐文章于 2023-10-19 11:22:28 发布

阅读量413

点赞数

分类专栏： Python pyspider 爬虫文章标签：框架 Python pyspider 爬虫

本文链接：https://blog.csdn.net/qq_29541277/article/details/80479781

版权

爬虫同时被 3 个专栏收录

6 篇文章 0 订阅

订阅专栏

Python

4 篇文章 0 订阅

订阅专栏

pyspider

2 篇文章 0 订阅

订阅专栏

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2018-05-28 10:33:42
# Project: tongjiju

from pyspider.libs.base_handler import *
from lxml import etree

class Handler(BaseHandler):
crawl_config = {
}

@every(minutes=24 * 60)
def on_start(self):
self.crawl('http://www.stats.gov.cn/tjgz/tjdt/index.html', callback=self.index_page)

@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
s = etree.HTML(response.text)
a = s.xpath('/html/body/div/div/div[3]/div[2]/ul/li/a/@href')
# 用输出来验证输出内容是否为正确的链接形式，结果并不是，需要自己构造一下
# 输出形式为 ./201805/t20180518_1600030.html 前面小数点要处理一下
print(a)
for each in a:
# 注意不能替换掉所有的小数点，后边的.html也有小数点
# 只需要检验首字母是否为小数点并进行替换即可
if each[0] == '.':
b = each.replace('.', 'http://www.stats.gov.cn/tjgz/tjdt', 1)
# 再次输出进行检验链接，检验成功
print(b)
self.crawl(b, callback=self.detail_page)
for i in range(1,6):
# 构造下一页的链接，爬取页数可以任意修改，这里只是测试
next_href = 'http://www.stats.gov.cn/tjgz/tjdt/index_' + str(i) +'.html'
print(next_href)
self.crawl(next_href, callback=self.index_page)

@config(priority=2)
def detail_page(self, response):
return {
# 想获取什么内容在这里自己定义即可
# 用xpath还是css根据自己习惯即可，网站上自带的css不一定正确，要检查一下
"url": response.url,
"title": response.doc('title').text(),
}