scrapy在window中的安装与使用初步入门


下载并安装下面的过程比较顺利,就省略了。

安装python,版本3.5,注意勾上添加到环境变量,不然后面又要折腾。

安装ptvs for vs2015

安装ironPython 


安装各种包时碰到的坑:主要是安装scrapy时碰到的。
1.安装twisted出错:
提示缺少vcvarsall.bat
解决方法:
主要是vs2015安装不完整,需要重新加载iso安装Visual C++语言即可;


2.安装scrapy出错:
提示缺少libxml2
解决方法:
进入以下地址
http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml
找到自己的版本下载后执行如下cmd命令:
pip install F:\python3.5\lxml-3.6.4-cp35-cp35m-win32.whl




3.运行scrapy报错 
提示没有win32api 
解决方案:
原因是缺少win32,到 http://sourceforge.net/projects/pywin32/files/
找到对应的版本进行下载,直接安装即可


提示没有找到python3.5

解决方法:执行下面的脚本:

#
# script to register Python 2.0 or later for use with win32all
# and other extensions that require Python registry settings
#
# written by Joakim Loew for Secret Labs AB / PythonWare
#
# source:
# http://www.pythonware.com/products/works/articles/regpy20.htm
#
# modified by Valentine Gogichashvili as described in http://www.mail-archive.com/distutils-sig@python.org/msg10512.html

import sys
  
from winreg import *

# tweak as necessary
version = sys.version[:3]
installpath = sys.prefix

regpath = "SOFTWARE\\Python\\Pythoncore\\%s\\" % (version)
installkey = "InstallPath"
pythonkey = "PythonPath"
pythonpath = "%s;%s\\Lib\\;%s\\DLLs\\" % (
    installpath, installpath, installpath
)
  
def RegisterPy():
    try:
        reg = OpenKey(HKEY_CURRENT_USER, regpath)
    except EnvironmentError as e:
        try:
            reg = CreateKey(HKEY_CURRENT_USER, regpath)
            SetValue(reg, installkey, REG_SZ, installpath)
            SetValue(reg, pythonkey, REG_SZ, pythonpath)
            CloseKey(reg)
        except:
            print("*** Unable to register!")
            return
        print("--- Python", version, "is now registered!")
        return
    if (QueryValue(reg, installkey) == installpath and
        QueryValue(reg, pythonkey) == pythonpath):
        CloseKey(reg)
        print("=== Python", version, "is already registered!")
        return
    CloseKey(reg)
    print("*** Unable to register!")
    print("*** You probably have another Python installation!")
  
if __name__ == "__main__":
    RegisterPy()



4.HelloWord


进入存储代码的目录中,运行下列cmd命令:
scrapy startproject tutorial


在 tutorial/spiders 目录下的新建 dmoz_spider.py 文件,内容为:
import scrapy


class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]


    def parse(self, response):
        filename = response.url.split("/")[-2]
        with open(filename, 'wb') as f:
            f.write(response.body)


进入项目的根目录,执行下列cmd命令:
scrapy crawl dmoz



示例:带Rule的CrawlSpider样例


import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor


class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']


    rules = (
        # 提取匹配 'category.php' (但不匹配 'subsection.php') 的链接并跟进链接(没有callback意味着follow默认为True)
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),


        # 提取匹配 'item.php' 的链接并使用spider的parse_item方法进行分析
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )


    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)


        item = scrapy.Item()
        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
        item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
        return item


导航:

老程序员的python快速学习之旅




  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值