1. tornado ->what's this
1.1 如何安装
pip install tornado1.2 Hello, world
import tornado.ioloop
import tornado.web
class MainHandler(tornado.web.RequestHandler):
def get(self):
self.write("Hello, world")
application = tornado.web.Application([
(r"/", MainHandler),
])
if __name__ == "__main__":
application.listen(8888)
tornado.ioloop.IOLoop.instance().start()
1.3 更复杂的例子
- url router
class MainHandler(tornado.web.RequestHandler):
def get(self):
self.write("You requested the main page")
class StoryHandler(tornado.web.RequestHandler):
def get(self, story_id):
self.write("You requested the story " + story_id)
application = tornado.web.Application([
(r"/", MainHandler),
(r"/story/([0-9]+)", StoryHandler),
])
通过正则来匹配请求url,映射到Handler类,来处理请求
- Templatestemp.html:
<html>
<head>
<title>{{ title }}</title>
</head>
<body>
<ul>
{% for item in items %}
<li>{{ escape(item) }}</li>
{% end %}
</ul>
</body>
</html>
handler.py
class MainHandler(tornado.web.RequestHandler):
def get(self):
items = ["Item 1", "Item 2", "Item 3"]
self.render("template.html", title="My title", items=items)
配置
app = tornado.web.Application(
[
(r"/", MainHandler),
],
cookie_secret="__TODO:_GENERATE_YOUR_OWN_RANDOM_VALUE_HERE__",
login_url="/auth/login",
template_path=os.path.join(os.path.dirname(__file__), "templates"),
static_path=os.path.join(os.path.dirname(__file__), "static"),
xsrf_cookies=True,
)
app.listen(8080)
tornado.ioloop.IOLoop.instance().start()
更多参考实例:https://github.com/facebook/tornado/demo
参考文档1:http://www.tornadoweb.org/en/stable/overview.html
2. scrapy -> what's this
2.1 如何安装
apt-get install scrapyyum install scrapy
pip install scrapy ##升级到最新版本
2.2 如何使用
2.2.1安装完scrapy之后,可以使用 scrapy 命令 执行命令:
scrapy startproject <project-name>
scrapy会自动帮忙建立一个scrapy项目目录,包括一些基本必须的文件。
结构如下:
<project-name>/ #项目名称
scrapy.cfg
<project-name>/ #项目名称
__init__.py
items.py #用来定义Items的模块
pipelines.py #用来定义piplines class实现的模块
settings.py #用来设置配置的文件
spiders/ #存放各个Spider实现的目录
__init__.py
...
当在spider目录下,按正确方式实现了一个自己的spider模块之后, 可以进入最上层的<project-name>目录, 执行:
scrapy crawl <spider-name>
就可以运行实现的Spiler。注: spider-name 为Splider模块中设置。 具体如下:
2.2.2 Spider
说明: Spiler作用:根据给定的url作为入口, 请求页面,然后处理页面内容。
处理方式:Selector
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from scrapy.http import Request
from myproject.items import MyItem
class MySpider(BaseSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/1.html',
'http://www.example.com/2.html',
'http://www.example.com/3.html',
]
def parse(self, response):
sel = Selector(response)
for h3 in sel.xpath('//h3').extract():
yield MyItem(title=h3)
for url in sel.xpath('//a/@href').extract():
yield Request(url, callback=self.parse)
2.2.3 Selector
说明: sel.css 作为css样式的选择器使用。
可以使用命里 scrapy shell <url> , 进入console模式,进行网页元素选择测试
def parse(self, response):
sel = Selector(response)
url = response.url
if url.endswith("/"):
url = url[:-1]
if url not in self.title:
self.title[url] = sel.css("#maininfo #info h1::text").extract()[0]
chapters = sel.css(".box_con dd a::attr(href)").extract()
for chapter in chapters:
chapter = self.base_url + chapter
print chapter
yield Request(chapter, callback=self.chap_callback)
2.2.4 Item
说明:用来存储数据的实体对象类:
from scrapy.item import Item, Field
class Product(Item):
name = Field()
price = Field()
stock = Field()
last_updated = Field(serializer=str)
GET字段:
>>> product['name']Desktop PC
>>> product.get('name')
Desktop PC
>>> product.get('last_updated', 'not set')
not set
>>> 'last_updated' in product.fields # is last_updated a declared field?
True
>>> 'lala' in product.fields # is lala a declared field?
False
SET字段:
>>> product['last_updated'] = 'today'
>>> product['last_updated']
today
>>> product['lala'] = 'test' # setting unknown field
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
2.2.5 pipeline
说明:用来处理Spidler返回的Item。
import json
class JsonWriterPipeline(object):
def __init__(self):
self.file = open('items.jl', 'wb')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
设置PIPline: 说明:在settings.py中设置使用哪些Pipline class来处理Item,后面的数字表示 优先级,The lower the better
ITEM_PIPELINES = {'myproject.pipeline.PricePipeline': 300,
'myproject.pipeline.JsonWriterPipeline': 800,
}
参考2:http://doc.scrapy.org/en/latest/intro/tutorial.html