【python网络爬虫】框架#181025

最新推荐文章于 2020-06-09 18:04:27 发布

weixin_33716154

最新推荐文章于 2020-06-09 18:04:27 发布

阅读量138

点赞数

文章标签： python 爬虫 shell

原文链接：https://my.oschina.net/hellopasswd/blog/2252113

版权

2019独角兽企业重金招聘Python工程师标准>>>

Scrapy

error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual

https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

pip install D:\Python\Python37\Scripts\Twisted-18.9.0-cp37-cp37m-win_amd64.whl

C:\Windows\System32>pip install scrapy

C:\Windows\System32>scrapy -h
Scrapy 1.5.1 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

Engine
- 控制所有模块之间的数据流
- 根据条件触发事件

Downloader
- 根据请求下载网页

Scheduler
- 对所有爬取请求进行调度管理

Downloader Middleware
- 目的：实施Engine、Scheduler和Downloader之间进行用户可配置的控制
- 功能：修改、丢弃、新增请求或响应

Spider
- 解析Downloader返回的响应(Response)
- 产生爬取项(scraped item)
- 产生额外的爬取请求(Request)

Item Pipelines
- 以流水线方式处理Spider产生的爬取项
- 由一组操作顺序组成，类似流水线，每个操作是一个Item Pipeline类型
- 可能操作包括：清理、检验和查重爬取项中的HTML数据、将数据存储到数据库

Spider Middleware
目的：对请求和爬取项的再处理
功能：修改、丢弃、新增请求或爬取项

scrapy <commnd>[options][args]

startproject	创建一个新工程	scrapy stratproject<name>[dir]
genspider	创建一个爬虫	scrapy genspider [options]<name><domain>
settings	获得爬虫配置信息	scrapy settings [options]
crawl	运行一个爬虫	scrapy crawl <spider>
list	列出工程中所有爬虫	scrapy list
shell	启动URL调试命令行	crapy shell [url]

http://python123.io/ws/demo.html

C:\Windows\System32>D:

D:\>scrapy startproject python123demo
New Scrapy project 'python123demo', using template directory 'd:\\python\\python37\\lib\\site-packages\\scrapy\\templates\\project', created in:
    D:\python123demo

You can start your first spider with:
    cd python123demo
    scrapy genspider example example.com

scrapy.cfg	部署Scrapy爬虫的配置文件

__init__.py	初始化脚本
items.py	Items代码模板(继承类)
middlewares.py	Middlewares代码模板(继承类)
pipelines.py	Pipelines代码模板(继承类)
settings.py	Scrapy爬虫的配置文件

spiders/	Spiders代码模板目录(继承类)
__init__py	初始文件，无序修改
__pycache__/	缓存目录，无需修改

D:\>cd \python123demo

D:\python123demo>scrapy genspider demo python.io
Created spider 'demo' using template 'basic' in module:
  python123demo.spiders.demo

D:\python123demo\python123demo\spiders
demo.py

# -*- coding: utf-8 -*-
import scrapy


class DemoSpider(scrapy.Spider):
    name = 'demo'
    allowed_domains = ['python.io']
    start_urls = ['http://python.io/']

    def parse(self, response):
        pass

# -*- coding: utf-8 -*-
import scrapy


class DemoSpider(scrapy.Spider):
    name = 'demo'
    #allowed_domains = ['python.io']
    start_urls = ['http://python123.io/ws/demo.html']

    def parse(self, response):
    	fname = response.url.split('/')[-1]
    	with open(fname,'wb') as f:
    		f.write(response.body)
    	self.log('Saved file %s.' % name)

D:\python123demo>scrapy crawl demo

yield关键字

- 生成器是一个不断产生值得函数
- 包含yield语句得函数是一个生成器
- 生成器每次产生一个值(yield语句)，函数被冻结，被唤醒后 再产生一个值

生成器写法：
>>> def gen(n):
	for i in range(n):
		yield i ** 2

>>> for i in gen(5):
	print(i," ",end = "")

	
0  1  4  9  16 

普通写法：
>>> def square(n):
	ls = [i ** 2 for i in range(n)]
	return ls

>>> for i in square(5):
	print(i," ",end = "")

	
0  1  4  9  16

完整
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-

import scrapy

class DemoSpider(scrapy.Spider):
    name = 'demo'

    def start_requests(self):
    	urls = [
    	http://python123.io/ws/demo.html
    	]
    	for url in urls:
    		yield scrapy.Requests(url = url,callback = self.parse)

    def parse(self,response):
    	fname = response.url.split('/')[-1]
		with open(fname,'wb') as f:
			f.write(response.body)
		self.log('Saved file %s.' % fname)

Requests类

class scrapy.http.Requests()
- Requests对象表示一个HTTP请求
- 由Spider生成，由Downloader执行

.url	Requests对应得请求URL地址
.method	对应的请求方式，'GET''POST'等
.headers	字典类型风格的请求头
.body	请求内容主体，字符串类型
.meta	用户添加的扩展信息，在Scrapy内部模块间传递信息使用
.copy	复制该请求

Response类

calss scrapy.http.Response()
- Response对象表示一个HTTP响应
- 由Downloader生成，由Spider处理

.url	Response对应的URL地址
.status	HTTP状态码，默认是200
.headers	Response对应的头部信息
.body	Response对应的内容信息，字符串类型
.flags	一组标记
.request	产生Response类型对应的Request对象
.copy	复制该响应

Item类

class scrapy.item.Item()

- Item对象表示一个从HTTP页面中提取的信息内容
- 由Spider生成，由Item Pipeline处理
- Item类似字典类型，可以按照字典类型操作

CSS Selector

<HTML>.css('a::attr(href').extract()

股票数据Scrapy爬虫

东方财富网http://quote.eastmoney.com/stocklist.html
百度股票https://gupiao.baidu.com/stock/
单个股票https://gupiao.baidu.com/stock/xxx.html

C:\Windows\System32>cd ..\..\Python3.7.0

C:\Python3.7.0>scrapy startproject BaiduStocks
New Scrapy project 'BaiduStocks', using template directory 'd:\\python\\python37\\lib\\site-packages\\scrapy\\templates\\project', created in:
    C:\Python3.7.0\BaiduStocks

You can start your first spider with:
    cd BaiduStocks
    scrapy genspider example example.com

C:\Python3.7.0>cd BaiduStocks

C:\Python3.7.0\BaiduStocks>scrapy genspider stocks baidu.com
Created spider 'stocks' using template 'basic' in module:
  BaiduStocks.spiders.stocks

CONCURRENT_REQUESTS_PER_DOMAIN	Download最大并发请求下载量，默认32
CONCURRENT_ITEMS	Item Pipeline最大ing发ITEM处理数量，默认100
CONCURRENT_REQUESTS_PER_DOMAIN	每个目标域名最大的兵法请求数量，默认8
CONCURRENT_REQUESTS_PER_IP	每个目标IP最大的兵法请求数量，默认0，非0有效

sotck.py

# -*- coding: utf-8 -*-
import scrapy
import re


class StockSpider(scrapy.Spider):
    name = 'stocks'

    start_urls = ['http://quote.eastmoney.com/stocklist.html']

    def parse(self, response):
    	for href in response.css('a::attr(href)').extract():
    		try:
	    		stock = re.findall(r"[s][hz]\d{6}").href[0]
    			url = 'https://gupiao.baidu.com/stock/' + stock + '.html'
    			yield scrapy.Request(url,callback = self.parse_stock)
    		except:
    			continue

    def parse_stock(self,response):
    	infoDict = {}
    	stockInfo = response.css('.stock-bets')
    	name = stockInfo.css('.bets-name').extract()[0]
    	keyList = stockInfo.css('dt').extract()
    	valueList = stockInfo.css('dd').extract()
    	for i in range(len(keyList)):
    		key = re.findall(r'>.*</dt>',keyList[i])[0][1:-5]
    		try:
    			val = re.findall(r'\d+\.?*</dd>',valueList[i])[0][0:-5]
    		except:
    			val = '--'
    		infoDict[key] = val

    	infoDict.update(
    		{'股票名称':re.findall('\s.*\(',name)[0].split()[0] + \
    		re.findall('\>.*\<',name)[0][1:-i]})
    	yield infoDict

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class BaidustocksPipeline(object):
    def process_item(self, item, spider):
        return item

class BaidustocksIndoPipeline(object):
	def open_spider(self,spider):
		self.f = open('BaiduStockInfo.txt''w')

	def cloose_spider(self,spider):
		self.f.close()

	def process_item(self,item,spider):
		try:
			line = str(dict(item)) + '\n'
			self.f.write(line)
		except:
			pass
		return item

settings.py

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'BaiduStock.pipelines.BaidustocksInfoPipeline': 300,
}

C:\Python3.7.0\BaiduStock>pip install pypiwin32

转载于:https://my.oschina.net/hellopasswd/blog/2252113