python123.io.download_python网络爬虫笔记三

一、Scrapy爬虫框架常用命令

命令

说明

格式

startproject

创建一个新工程

scrapy startproject [dir]

genspider

创建一个爬虫

scrapy genspider [options]

settings

获取爬虫配置信息

scrapy settings [options]

crawl

运行一个爬虫

scrapy crawl

list

列出工程中所有爬虫

scrapy list

shell

启动URL调试命令行

scrapy shell [url]

二、scrapy使用

scrapy startproject demo 新建一个项目demo

scrapy genspider spiderdemo www.baidu.com 创建一个百度的爬虫

scrapy crawl spiderdemo 运行爬虫

三、Request类

方法

说明

.url

Request对应的URL

.method

'GET' 'POST'等

.headers

字典类型风格的请求头

.body

请求内容主体,字符串类型

.meta

用户添加的扩展信息,scrapy内部模块间传递

.copy()

复制请求

四、Response类

方法

说明

.url

Response对应的URL

.status

HTTP状态码

.headers

Response的头部

.body

Response内容主体,字符串类型

.flags

一组标记

.request

产生对应的Request对象

.copy()

复制响应

五、Scrapy框架的股票数据定向爬虫

# pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

class Python123DemoPipeline(object):

def process_item(self, item, spider):

return item

class BaidustocksPipeline(object):

def open_spider(self, spider):

self.f = open('Stock.txt', 'w')

def close_spider(self,spider):

self.f.close()

def process_item(self, item, spider):

try:

line = str(dict(item)) + '\n'

self.f.write(line)

except:

pass

return item

# demo.py

# -*- coding: utf-8 -*-

import scrapy

import re

class DemoSpider(scrapy.Spider):

name = 'demo'

# allowed_domains = ['python123.io']

start_urls = ['http://quote.eastmoney.com/stocklist.html']

stock_info_url = 'https://gupiao.baidu.com/stock/'

def parse(self, response):

for href in response.css('a::attr(href)').extract():

try:

stock = re.findall(r"[s][hz]\d{6}", href)[0]

url = 'https://gupiao.baidu.com/stock/' + stock + '.html'

yield scrapy.Request(url, callback=self.parse_stock)

except:

continue

def parse_stock(self, response):

infoDict = {}

stockInfo = response.css('.stock-bets')

name = stockInfo.css('.bets-name').extract()[0]

keyList = stockInfo.css('dt').extract()

valueList = stockInfo.css('dd').extract()

for i in range(len(keyList)):

key = re.findall(r'>.*', keyList[i])[0][1:-5]

try:

val = re.findall(r'\d+\.?.*', valueList[i])[0][0:-5]

except:

val = "--"

infoDict.update({'股票名称':re.findall('\s.*\(', name)[0].split()[0]

+ re.findall('\>.*\<', name)[0][1:-1]}

)

yield infoDict

# settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for python123demo project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

# https://doc.scrapy.org/en/latest/topics/settings.html

# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'python123demo'

SPIDER_MODULES = ['python123demo.spiders']

NEWSPIDER_MODULE = 'python123demo.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'python123demo (+http://www.yourdomain.com)'

# Obey robots.txt rules

ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

# 'Accept-Language': 'en',

#}

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

# 'python123demo.middlewares.Python123DemoSpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

# 'python123demo.middlewares.Python123DemoDownloaderMiddleware': 543,

#}

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

# 'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

'python123demo.pipelines.BaidustocksPipeline': 300,

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值