第四周爬虫记录
前言
在这周我尝试用scrapy爬虫框架对百度新闻进行二级网页数据爬取,解决爬取中遇到的问题,并以此为模板,摸索网页新闻的一般通用性爬虫
一、scrapy爬虫框架
基本结构与代码
1. 在items.py文件定义item的基本结构,用于储存爬取到的数据,其中detail储存爬取的二级网页的数据,具体结构如下:
class SpiderBaiduItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
source = scrapy.Field()
timestamp = scrapy.Field()
detail = scrapy.Field()
2. 定义爬虫myspider,在爬虫文件内处理网站的结构以及后续的储存,相关代码:
import scrapy
from ..items import SpiderBaiduItem
import datetime
class MyspiderSpider(scrapy.Spider):
name = 'myspider'
# allowed_domains = ['baidu.com']
start_urls = ['https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&rsv_dl=ns_'
'pc&word=%E5%B1%B1%E4%B8%9C%E5%A4%A7%E5%AD%A6&x_bfe_rqs=03E80&x_bfe_tjscore=0.100000&'
'tngroupname=organic_news&newVideo=12&pn=0']
url = 'https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&rsv_dl=ns_'\
'pc&word=%E5%B1%B1%E4%B8%9C%E5%A4%A7%E5%AD%A6&x_bfe_rqs=03E80&x_bfe_tjscore=0.100000&'\
'tngroupname=organic_news&newVideo=12&pn={}'
page = 0
def parse(self, response):
# 在一级界面中,获取网页上的标题,链接,来源,时间戳
# items储存网页信息
# print('当前网页的源码为: ' + response.body_as_unicode())
# print(response.xpath("//div[@class='result-op c-container xpath-log new-pmd']//h3//a/text()"))
current_time = datetime.datetime.now()
times = '{}年{}月{}日'.format(current_time.year, current_time.month, current_time.day)
title_list = []
for each in response.xpath("//div[@class='result-op c-container xpath-log new-pmd']//h3[@class='news-title_1YtI1']"):
title = each.xpath("a/text()").extract()
whole_title = ''
for i in title:
whole_title = whole_title + i.strip(' ')
title_list.append(whole_title)
url_list = response.xpath("//div[@class='result-op c-container xpath-log new-pmd']//h3//a/@href").extract()
source_list = response.xpath("//div[@class='result-op c-container xpath-log new-pmd']//div[@class='news-source']//span[@class='c-color-gray c-font-normal c-gap-right']/text()").extract()
timestamp_list = response.xpath("//div[@class='result-op c-container xpath-log new-pmd']//div[@class='news-source']//span[@class='c-color-gray2 c-font-normal']/text()").extract()
for i in range(len(title_list)):
item = SpiderBaiduItem()
title = title_list[i]
url = url_list[i]
source = source_list[i]
timestamp = timestamp_list[i]
if '小时前' in timestamp:
timestamp = times
item = SpiderBaiduItem(title=title, url=url, source=source, timestamp=timestamp)
yield scrapy.Request(url=url, meta={'item':item}, callback=self.parse_detail)
print('二级页面爬取完毕')
print('{}页一级页面爬取完毕'.format(self.page))
if self.page < 260:
self.page += 10;
url = self.url.format(self.page)
yield scrapy.Request(url=url, callback=self.parse)
def parse_detail(self, response):
item = response.meta['item']
total_message = ''
num = 0
num_list = []
for each in response.xpath("//a"):
message = each.xpath("@href").extract()
if len(message)!=0 and message[0]!='http://www.qq.com' and message[0]!='https://www.baidu.com':
num_list.append(num)
num += 1
num = 0
for each in response.xpath("//a"):
message = each.xpath("text()").extract()
if num in num_list:
if len(message)!=0:
total_message = total_message + message[0] + '.'
num += 1
item['detail'] = total_message
yield item
遇到的问题
在爬取百度新闻时遇到的最多的问题就是百度的反爬策略,百度相关网页的robots协议不允许他人爬取数据,以及对爬虫的UA进行了识别和后续对于频繁访问的IP进行封IP,所以我们设置了随机UA和随机等待一段时间进行爬取,具体实现在爬虫的setting.py文件,相关代码如下:
BOT_NAME = 'Spider_baidu'
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
# USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
# 使用随机UA
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, # 关闭默认方法
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400, # 开启
}
COOKIES_ENABLED = True
# USER_AGENT = 'Mozilla/Firefox' # 让网站识别我们是一个浏览器,而不是一个爬虫程序
SPIDER_MODULES = ['Spider_baidu.spiders']
NEWSPIDER_MODULE = 'Spider_baidu.spiders'
URLLENGTH_LIMIT = 5000
METAREFRESH_ENABLED = False
总结
在爬取完百度新闻及其二级页面后我们对于网站新闻的爬取有了更多的认识,后续对于腾讯新闻,今日头条等的爬取可能也要进行爬取,所以我们要编写通用爬虫对于这些网页新闻进行爬取