参考博文链接
https://blog.csdn.net/ssw_1990/article/details/51254227
https://www.cnblogs.com/3wtoucan/p/6042444.html
前边一切随书,3.4的小程序中的
from wikiSpider.items import Article
出现报错ImportError: No module named 'wikiSpider',大致搜了下没找到直接从父目录下找模块的语法,只好曲线救国设定绝对路径才解决,需要加一段代码,最终是如此:
from scrapy.selector import Selector
from scrapy import Spider
import os
import sys
sys.path.append("E:\practicework\scraping\wikiSpider")
from wikiSpider.items import Article
class ArticleSpider(Spider):
name = "article"
allowed_domains = ["en.wikipedia.org"]
start_urls = ["http://en.wikipedia.org/wiki/Main_page",
"http://en.wikipedia.org/wiki/Python_%28programming_language%29"]
def parse(self,response):
item = Article()
title = response.xpath('//h1/text()')[0].extract()
print("Title is : "+title)
item['title']=title
return item
运行后又出现报错ImportError: No module named 'win32api',得,只能在powershell中于项目文件夹目录输入
pip install pypiwin32
安装,随后在第一个wikiSpider文件夹目录下输入
scrapy crawl article
成功运行,得到一大堆数据。
随后运行第二种代码:
from scrapy.contrib.spiders import CrawlSpider,Rule #2
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor #2
import os
import sys
sys.path.append("E:\practicework\scraping\wikiSpider")
from wikiSpider.items import Article
class ArticleSpider(CrawlSpider):
name = "article"
allowed_domains = ["en.wikipedia.org"]
start_urls = ["http://en.wikipedia.org/wiki/Python_%28programming_language%29"]
rules = [Rule(SgmlLinkExtractor(allow=('(/wiki/)((?!:)*$'),),callback="parse_item",follow=True)]
def parse(self,response):
item = Article()
title = response.xpath('//h1/text()')[0].extract()
print("Title is : "+title)
item['title']=title
return item
报错ImportError: No module named 'sgmllib',sgmllib是2.6以后引入python,在3.0以后这个库被移除了,我的3.5这次是真没办法,也没有代替的库......只好换版本。
CrowlSpider和Rule以及SgmlLinkExtractor见:https://blog.csdn.net/u012150179/article/details/34913315
xpath见:https://segmentfault.com/q/1010000005865480;http://www.ruanyifeng.com/blog/2009/07/xpath_path_expressions.html