《NLTK基础教程》读书笔记 007期

今天这章是爬虫
上来直接运行代码就会出现一个过期报错

ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead
  from scrapy.spider import BaseSpiderd:/Computer Science/Python_High_Level/nltk/chapter 7/tutorial/tutorial/spiders/NewsSpider.py:2: ScrapyDeprecationWarning: __main__.NewsSpider inherits from deprecated class scrapy.spiders.BaseSpider, please inherit from scrapy.spiders.Spider. (warning only on first subclass, there may be others)
  class NewsSpider(BaseSpider):

把import改一下就行了from scrapy import Spider

如果终端出现下列错误:

Unknown command: crawl

参考网页:https://blog.csdn.net/u012490863/article/details/54743479
我们需要cd进入tutorial这个文件夹才可以

然后可能会遇到缺少module:ImportError: No module named 'win32api'
参考网页:https://stackoverflow.com/questions/21343774/importerror-no-module-named-win32api
直接pip install一下就好了。

连接www.nytimes.com的时候可能会出现一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。所以我们更换网页,一下想到了leetcode的题库网页 https://leetcode-cn.com/problemset/all/,所以直接更换代码来爬取这个

from scrapy import Spider
class NewsSpider(Spider):
    name = "news"
    allowed_domains = ["leetcode-cn.com"]
    start_urls = [
        "https://leetcode-cn.com/problemset/all/"
    ]

    def parse(self, response):
        filename = "leetcode.txt"
        open(filename, 'wb').write(response.body)

运行结束之后会在终端看到如下结果:

2018-07-11 22:58:27 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: tutorial)
2018-07-11 22:58:27 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.4.0, Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:18:55) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.17134-SP0
2018-07-11 22:58:27 [scrapy.crawler] INFO: Overridden settings: {'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'}
2018-07-11 22:58:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2018-07-11 22:58:27 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-07-11 22:58:27 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-07-11 22:58:27 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-07-11 22:58:27 [scrapy.core.engine] INFO: Spider opened
2018-07-11 22:58:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-11 22:58:27 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-07-11 22:58:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://leetcode-cn.com/robots.txt> (referer: None)
2018-07-11 22:58:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://leetcode-cn.com/problemset/all/> (referer: None)
2018-07-11 22:58:28 [scrapy.core.engine] INFO: Closing spider (finished)
2018-07-11 22:58:28 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 453,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 9672,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 7, 11, 14, 58, 28, 268793),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 7, 11, 14, 58, 27, 887253)}
2018-07-11 22:58:28 [scrapy.core.engine] INFO: Spider closed (finished)

如果遇到DEBUG: Forbidden by robots.txt:的问题,参考网页:
https://stackoverflow.com/questions/37274835/getting-forbidden-by-robots-txt-scrapy
修改setting.py中的代码

得到的leetcode.txt里面就是一堆HTML啦,太长了就不放到这儿了。


google的网址肯定其程序员更改过了,已经找不到<div class='topic>这个<div>了,所以我们还是用leetcode的网页
我们把关心的div从topic变成col-md-9 blog-main
在终端输入:scrapy shell https://leetcode-cn.com/problemset/all/
首先会蹦出来:

2018-07-11 23:36:43 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: tutorial)
2018-07-11 23:36:43 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.4.0, Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:18:55) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.17134-SP0
2018-07-11 23:36:43 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'tutorial', 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'tutorial.spiders', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['tutorial.spiders']}
2018-07-11 23:36:43 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2018-07-11 23:36:43 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-07-11 23:36:43 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-07-11 23:36:43 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-07-11 23:36:43 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-07-11 23:36:43 [scrapy.core.engine] INFO: Spider opened
2018-07-11 23:36:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://leetcode-cn.com/robots.txt> (referer: None)
2018-07-11 23:36:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://leetcode-cn.com/problemset/all/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x000001F1506EAE48>
[s]   item       {}
[s]   request    <GET https://leetcode-cn.com/problemset/all/>
[s]   response   <200 https://leetcode-cn.com/problemset/all/>
[s]   settings   <scrapy.settings.Settings object at 0x000001F1506F09E8>
[s]   spider     <NewsSpider 'news' at 0x1f151943828>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

然后仿照书上的提示敲下第一行之后,立马来了一个过期提醒:

In [1]: sel.xpath('//div[@class="col-md-9 blog-main"]').extract()
2018-07-11 23:38:41 [py.warnings] WARNING: shell:1: ScrapyDeprecationWarning: "sel" shortcut is deprecated. Use "response.xpath()", "response.css()" or "response.selector" instead

改成response,会出现下列结果:

In [1]: response.xpath('//div[@class="col-md-9 blog-main"]').extract()
Out[1]: ['<div class="col-md-9 blog-main">\n      <div class="row" id="question-app"></div>\n    </div>']

然后按照书上代码继续执行:

In [2]: response.xpath('//title/text()')
Out[2]: [<Selector xpath='//title/text()' data='LeetCode 题库'>]

In [3]: response.xpath('//title/text()').extract()
Out[3]: ['LeetCode 题库']

In [5]: response.xpath('//ul/li')
Out[5]: [] #这里是空的……不清楚为什么

In [9]: response.xpath('//div')
Out[9]:
[<Selector xpath='//div' data='<div class="content-wrapper">\n\n     \n\n  '>,
 <Selector xpath='//div' data='<div id="lc_navbar" class="navbar navbar'>,
 <Selector xpath='//div' data='<div id="lc_navbar_placeholder"></div>'>,
 <Selector xpath='//div' data='<div id="base_content">\n      \n  <div id'>,
 <Selector xpath='//div' data='<div id="announcement" class="container"'>,
 <Selector xpath='//div' data='<div id="notice"></div>'>,
 <Selector xpath='//div' data='<div class="container">\n  <!-- end scrol'>,
 <Selector xpath='//div' data='<div class="row" id="category-app"></div'>,
 <Selector xpath='//div' data='<div class="row">\n    <div class="col-md'>,
 <Selector xpath='//div' data='<div class="col-md-9 blog-main">\n      <'>,
 <Selector xpath='//div' data='<div class="row" id="question-app"></div'>,
 <Selector xpath='//div' data='<div class="col-md-3 blog-sidebar">\n    '>,
 <Selector xpath='//div' data='<div class="row sidebar-module">\n       '>,
 <Selector xpath='//div' data='<div class="col-md-offset-2 col-md-10">\n'>,
 <Selector xpath='//div' data='<div id="user-progress-app"></div>'>,
 <Selector xpath='//div' data='<div id="list-card-app"></div>'>,
 <Selector xpath='//div' data='<div class="row sidebar-module topic" id'>,
 <Selector xpath='//div' data='<div class="col-md-offset-2 col-md-10">\n'>,
 <Selector xpath='//div' data='<div class="tags tags-fade" id="current-'>,
 <Selector xpath='//div' data='<div id="expand-topic" data-id="Open" cl'>,
 <Selector xpath='//div' data='<div class="btn btn-default btn-round bt'>,
 <Selector xpath='//div' data='<div class="container">\n      <hr>\n     '>,
 <Selector xpath='//div' data='<div class="row">\n        \n        <div '>,
 <Selector xpath='//div' data='<div class="col-sm-5 copyright">\n       '>,
 <Selector xpath='//div' data='<div class="text-right col-sm-7">\n      '>,
 <Selector xpath='//div' data='<div class="links">\n            <a href='>,
 <Selector xpath='//div' data='<div class="row chinese-license">\n      '>,
 <Selector xpath='//div' data='<div class="col-sm-6 text-right col-sm-p'>,
 <Selector xpath='//div' data='<div class="col-sm-6 col-sm-pull-6 icp-b'>,
 <Selector xpath='//div' data='<div class="ICP license">\n              '>,
 <Selector xpath='//div' data='<div class="modal fade simple-modal" id='>,
 <Selector xpath='//div' data='<div class="modal-center">\n      <div cl'>,
 <Selector xpath='//div' data='<div class="modal-dialog">\n        <div '>,
 <Selector xpath='//div' data='<div class="modal-content">\n          <d'>,
 <Selector xpath='//div' data='<div class="modal-header">\n            <'>,
 <Selector xpath='//div' data='<div class="modal-body">\n            <di'>,
 <Selector xpath='//div' data='<div class="row text-center">\n          '>,
 <Selector xpath='//div' data='<div class="col-sm-4">\n                <'>,
 <Selector xpath='//div' data='<div class="col-sm-4">\n                <'>,
 <Selector xpath='//div' data='<div class="col-sm-4">\n                <'>]

extract贼长,就不在这里列出了
后面都是针对网页的一些操作,就不再这里赘述


进入7.3.2
我们上来就会遇到

ImportError: No module named 'sgmllib'

参考网页:https://github.com/scrapy/scrapy/issues/2254
发现SgmlLinkExtractor早已过期,按照网页上的把代码改写
同样会遇到过期错误的还有:

ScrapyDeprecationWarning: Module `scrapy.contrib.spiders` is deprecated, use `scrapy.spiders` instead
  from scrapy.contrib.spiders import CrawlSpider, Rule

所以最终的import部分应该为:

from scrapy import Spider
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector

最后左改改右改改,成型的NewsSpider.py代码是这样字的:

from scrapy import Spider
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item, Field
import scrapy

class NewsItem(scrapy.Item):
    title = Field()
    topic = Field()
    desc = Field()

class NewsSpider(CrawlSpider):
    name = "news"
    allowed_domains = ["news.google.com"]
    start_urls = [
        "https://news.google.com"
    ]

    rules = (
        Rule(LinkExtractor(allow=('cnn.com',), deny=('http://edition.cnn.com/',))),
        Rule(LinkExtractor(allow=('news.google.com',)), callback="parse_news_item"),
        )

    def parse(self, response):
        sel = Selector(response)
        item = NewsItem()
        item['title'] = sel.xpath('//title/text()').extract()
        item['topic'] = sel.xpath('/div[@class="topic"]').extract()
        item['desc'] = sel.xpath('//td/text()').extract()

        return item

        # filename = "leetcode.txt"
        # open(filename, 'wb').write(response.body)

之前也说了Google那边已经没有那个div了,所以最后结果只有

{'desc': [], 'title': ['Google News'], 'topic': []}

不过代码从头到尾是都可以运行的,这就很舒服。


过期的库import的问题就不再赘述了,按照提示走就可以了。
不知道为什么def后面可以接引号,肯定是错误的输入,要不然就是python2特有的方式,反正python3中从没讲过,将那段代码改为:

from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
    def parse_electronics(self, response):
        return

    def parse_apparel(self, response):
        return

    sitemap_urls = ['http://www.example.com/sitemap.xml']
    sitemap_rules = [('/electronics/', parse_electronics), ('/apparel/',parse_apparel),] 

7.5
首先不知道为什么这里新建了一个project,也没有代码提示,但是在下面的settings.py中已经提示到了,所以我们也新建一个myproject,然后进行同样的修改。

然后是打开pipelines.py进行下面这段代码的编写:

from scrapy import Item
from scrapy.exceptions import DropItem
from scrapy import signals
import datetime
import json

class CleanPipeline():
    def process_item(self, item, spider):
        if item['desc']:
            item['desc'] = item['desc'].strip().lower().replace('#$','')
            return item

class AgePipeline():
    def process_item(self, item, spider):
        if item['DOB']:
            item['Age'] = (datetime.datetime.strptime(item['DOB'], '%d-%m-%y').date()-datetime.datetime.strptime('currentdate', '%d-%m-%y').date()).days/365
            return item

class DuplicatesPipeline():
    def __init__(self):
        self.ids_seen = set()
    def process_item(self, item, spider):
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            return item

class JsonWriterPipeline():
    def __init__(self):
        self.file = open('items.txt','wb')
    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

不知道为什么书上有重复的代码,CleanPipeline并没有进行编写,如上代码添加了参考https://github.com/PacktPublishing/Natural-Language-Processing-Python-and-NLTK/blob/master/Module%201/Chapter%207/itempiplines.py来的CleanPipleline

以上。

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值