下载并安装下面的过程比较顺利,就省略了。
安装python,版本3.5,注意勾上添加到环境变量,不然后面又要折腾。
安装ptvs for vs2015
安装ironPython
1.安装twisted出错:
提示缺少vcvarsall.bat
解决方法:
主要是vs2015安装不完整,需要重新加载iso安装Visual C++语言即可;
2.安装scrapy出错:
提示缺少libxml2
解决方法:
进入以下地址
http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml
找到自己的版本下载后执行如下cmd命令:
pip install F:\python3.5\lxml-3.6.4-cp35-cp35m-win32.whl
3.运行scrapy报错
提示没有win32api
解决方案:
原因是缺少win32,到 http://sourceforge.net/projects/pywin32/files/
找到对应的版本进行下载,直接安装即可
提示没有找到python3.5
解决方法:执行下面的脚本:
#
# script to register Python 2.0 or later for use with win32all
# and other extensions that require Python registry settings
#
# written by Joakim Loew for Secret Labs AB / PythonWare
#
# source:
# http://www.pythonware.com/products/works/articles/regpy20.htm
#
# modified by Valentine Gogichashvili as described in http://www.mail-archive.com/distutils-sig@python.org/msg10512.html
import sys
from winreg import *
# tweak as necessary
version = sys.version[:3]
installpath = sys.prefix
regpath = "SOFTWARE\\Python\\Pythoncore\\%s\\" % (version)
installkey = "InstallPath"
pythonkey = "PythonPath"
pythonpath = "%s;%s\\Lib\\;%s\\DLLs\\" % (
installpath, installpath, installpath
)
def RegisterPy():
try:
reg = OpenKey(HKEY_CURRENT_USER, regpath)
except EnvironmentError as e:
try:
reg = CreateKey(HKEY_CURRENT_USER, regpath)
SetValue(reg, installkey, REG_SZ, installpath)
SetValue(reg, pythonkey, REG_SZ, pythonpath)
CloseKey(reg)
except:
print("*** Unable to register!")
return
print("--- Python", version, "is now registered!")
return
if (QueryValue(reg, installkey) == installpath and
QueryValue(reg, pythonkey) == pythonpath):
CloseKey(reg)
print("=== Python", version, "is already registered!")
return
CloseKey(reg)
print("*** Unable to register!")
print("*** You probably have another Python installation!")
if __name__ == "__main__":
RegisterPy()
4.HelloWord
进入存储代码的目录中,运行下列cmd命令:
scrapy startproject tutorial
在 tutorial/spiders 目录下的新建 dmoz_spider.py 文件,内容为:
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
with open(filename, 'wb') as f:
f.write(response.body)
进入项目的根目录,执行下列cmd命令:
scrapy crawl dmoz
示例:带Rule的CrawlSpider样例
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# 提取匹配 'category.php' (但不匹配 'subsection.php') 的链接并跟进链接(没有callback意味着follow默认为True)
Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# 提取匹配 'item.php' 的链接并使用spider的parse_item方法进行分析
Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
item = scrapy.Item()
item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
return item
导航:
老程序员的python快速学习之旅