python123.io.download_python网络爬虫笔记三

最新推荐文章于 2024-07-29 13:25:31 发布

weixin_39853523

最新推荐文章于 2024-07-29 13:25:31 发布

阅读量1.3k

点赞数

文章标签： python123.io.download

一、Scrapy爬虫框架常用命令

命令

说明

格式

startproject

创建一个新工程

scrapy startproject [dir]

genspider

创建一个爬虫

scrapy genspider [options]

settings

获取爬虫配置信息

scrapy settings [options]

crawl

运行一个爬虫

scrapy crawl

list

列出工程中所有爬虫

scrapy list

shell

启动URL调试命令行

scrapy shell [url]

二、scrapy使用

scrapy startproject demo 新建一个项目demo

scrapy genspider spiderdemo www.baidu.com 创建一个百度的爬虫

scrapy crawl spiderdemo 运行爬虫

三、Request类

方法

说明

.url

Request对应的URL

.method

'GET' 'POST'等

.headers

字典类型风格的请求头

.body

请求内容主体，字符串类型

.meta

用户添加的扩展信息，scrapy内部模块间传递

.copy()

复制请求

四、Response类

方法

说明

.url

Response对应的URL

.status

HTTP状态码

.headers

Response的头部

.body

Response内容主体，字符串类型

.flags

一组标记

.request

产生对应的Request对象

.copy()

复制响应

五、Scrapy框架的股票数据定向爬虫

# pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

class Python123DemoPipeline(object):

def process_item(self, item, spider):

return item

class BaidustocksPipeline(object):

def open_spider(self, spider):

self.f = open('Stock.txt', 'w')

def close_spider(self,spider):

self.f.close()

def process_item(self, item, spider):

try:

line = str(dict(item)) + '\n'

self.f.write(line)

except:

pass

return item

# demo.py

# -*- coding: utf-8 -*-

import scrapy

import re

class DemoSpider(scrapy.Spider):

name = 'demo'

# allowed_domains = ['python123.io']

start_urls = ['http://quote.eastmoney.com/stocklist.html']

stock_info_url = 'https://gupiao.baidu.com/stock/'

def parse(self, response):

for href in response.css('a::attr(href)').extract():

try:

stock = re.findall(r"[s][hz]\d{6}", href)[0]

url = 'https://gupiao.baidu.com/stock/' + stock + '.html'

yield scrapy.Request(url, callback=self.parse_stock)

except:

continue

def parse_stock(self, response):

infoDict = {}

stockInfo = response.css('.stock-bets')

name = stockInfo.css('.bets-name').extract()[0]

keyList = stockInfo.css('dt').extract()

valueList = stockInfo.css('dd').extract()

for i in range(len(keyList)):

key = re.findall(r'>.*', keyList[i])[0][1:-5]

try:

val = re.findall(r'\d+\.?.*', valueList[i])[0][0:-5]

except:

val = "--"

infoDict.update({'股票名称':re.findall('\s.*\(', name)[0].split()[0]

+ re.findall('\>.*\<', name)[0][1:-1]}

)

yield infoDict

# settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for python123demo project

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

# https://doc.scrapy.org/en/latest/topics/settings.html

# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'python123demo'

SPIDER_MODULES = ['python123demo.spiders']

NEWSPIDER_MODULE = 'python123demo.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'python123demo (+http://www.yourdomain.com)'

# Obey robots.txt rules

ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

# 'Accept-Language': 'en',

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

# 'python123demo.middlewares.Python123DemoSpiderMiddleware': 543,

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

# 'python123demo.middlewares.Python123DemoDownloaderMiddleware': 543,

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

# 'scrapy.extensions.telnet.TelnetConsole': None,

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

'python123demo.pipelines.BaidustocksPipeline': 300,

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

weixin_39853523

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫