首先在cmd终端创建scrapy项目 ,项目名称为wxapp
scrapy startproject wxapp
接着在cmd终端创建爬虫py文件,py文件名为wxapp_spider.py,请求域名为"wxapp-union.com"
scrapy genspider - t crawl wxapp_spider "wxapp-union.com"
在pycharm下打开创建好的项目 在setting.py文件内设置LOG_LEVEL = “ERROR” 将日志信息级别设置为只显示一般错误信息,防止在控制台输出太多日志信息干扰调试代码 设置ROBOTSTXT_OBEY = False 将服从机器人协议取消 设置DOWNLOAD_DELAY = 2将下载延迟设置为2,免得被封掉IP 设置请求头信息如下: DEFAULT_REQUEST_HEADERS = { ‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,/ ;q=0.8’, ‘Accept-Language’: ‘en’, “User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36” } 启动下载设置如下: ITEM_PIPELINES = { ‘wxapp.pipelines.WxappPipeline’: 300, } 把start_urls设置为"http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1",这是教程文章的第一个列表页如下图 spider.py代码块如下:
import scrapy
from scrapy. linkextractors import LinkExtractor
from scrapy. spiders import CrawlSpider, Rule
from wxapp. items import WxappItem
class WxappSpiderSpider ( CrawlSpider) :
name = 'wxapp_spider'
allowed_domains = [ 'wxapp-union.com' ]
start_urls = [ 'http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1' ]
rules = (
Rule( LinkExtractor( allow= r'.+mod=list&catid=2&page=\d' ) , follow= True ) ,
Rule( LinkExtractor( allow= r".+article-.+\.html" ) , callback= "parse_detail" , follow= False )
)
def parse_detail ( self, response) :
item = { }
title = response. xpath( "//h1[@class='ph']/text()" ) . get( )
author_p = response. xpath( "//p[@class='authors']" )
author = author_p. xpath( ".//a/text()" ) . get( )
pub_time = author_p. xpath( ".//span/text()" ) . get( )
content = response. xpath( "//td[@id='article_content']//text()" ) . getall( )
content = "" . join( content) . strip( )
item = WxappItem( title= title, author= author, pub_time= pub_time, content= content)
yield item
items.py文件代码块如下:
import scrapy
class WxappItem ( scrapy. Item) :
title = scrapy. Field( )
author = scrapy. Field( )
pub_time = scrapy. Field( )
content = scrapy. Field( )
pipelines.py文件代码块如下:
from scrapy. exporters import JsonLinesItemExporter
class WxappPipeline ( object ) :
def __init__ ( self) :
self. fp = open ( "./data/wxjs.json" , "wb" )
self. expoter = JsonLinesItemExporter( self. fp, ensure_ascii= False , encoding= "utf-8" )
def process_item ( self, item, spider) :
self. expoter. export_item( item)
return item
def close_spider ( self) :
self. fp. close( )
在项目根目录下新建start.py文件用于在pycharm中实现命令行启动项目,代码块如下:
from scrapy import cmdline
cmdline. execute( "scrapy crawl wxapp_spider" . split( ) )
最终保存的json文件如下图: