爬虫及scrapy遇到的小问题

最新推荐文章于 2020-12-16 15:41:23 发布

胖头鱼00

最新推荐文章于 2020-12-16 15:41:23 发布

阅读量184

点赞数

本文链接：https://blog.csdn.net/b806071099/article/details/85235746

版权

1.LOGSTATS_INTERVAL = 60.0

日志频率默认60s。自己用的话设置为5s。

2. 安装PIL，需要安装pip install pillow。

3.安装pyautogui选择可以使用最旧的版本号。

4.ssh远程：shh 远程账户@远程ip

4.1 scp命令

5.gunicorn运行py文件命令

gunicorn --config gunicorn_config.py 你的flask文件名（不加.py）:app

6.日志打印exception(e)

logger.exception(e)

1.MongoDB服务器端打开：

sudo mongod --dbpath=/var/lib/mongodb

3.pip降低版本

python3 -m pip install --user --upgrade pip==9.0.3

4.编码格式转换问题

import sys
reload(sys)
sys.setdefaultencoding(“utf-8”)

5.cookies处理

a=" "
cookies={}
for line in a.split(’;’):
print(line)
key,value=line.split(’=’,1)
cookies[key]=value
print(cookies)

6.在爬虫项目根目录下新建main.py文件,用于调试

from scrapy import cmdline
# coser为爬虫项目下的 爬虫名，不是爬虫项目名
cmdline.execute('scrapy crawl coser'.split())

7.xpath获取包含某个string的元素：

点击进入文章
https://www.cnblogs.com/c-x-a/p/10339032.html

"//div[@class='pbm mbm bbda cl']//li[contains(string(),'用户组')]/span/a/text()"

extrac()和extract_first()

def parse_content(self,response):
	title = response.xpath("//h2[@id='activity-name']/text()").extract_first()
	print(title.strip())

8.scrapy代理设置（http和https）

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called

        #   self.daili()返回数据格式------11.22.33.44:66
        daili = self.daili()
        daili_http = 'http://' + daili
        daili_https = 'https://' + daili
        request.meta['proxy'] = daili_http
        request.meta['REMOTE_ADDR'] = daili_https
        return None

1.scrapy下载中间件–数值大小的影响

{
    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 500,
    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 550,
    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
    'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware': 830,
    'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
    'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
}

请求的时候是越小越先执行，
返回的时候是越大越先执行
数值越小，离引擎越近，离爬虫组件越远

点击进入文章

2.scrapy爬虫爬取二级（多级）页面

多个爬虫文件 + 数据库（分级爬取）
一般方法的demo

        
# -*- coding: utf-8 -*-
import scrapy
from Tencent.items import TencentItem
 
 
class TencentSpider(scrapy.Spider):
    # 爬虫名称
    name = 'tencent'
    # 允许爬取的域名
    allowed_domains = ['www.xxx.com']
    # 爬虫基础地址 用于爬虫域名的拼接
    base_url = 'https://www.xxx.com/'
    # 爬虫入口爬取地址
    start_urls = ['https://www.xxx.com/position.php']
    # 爬虫爬取页数控制初始值
    count = 1
    # 爬虫爬取页数 10为只爬取一页
    page_end = 1
 
    def parse(self, response):
 
 
        nodeList = response.xpath("//table[@class='tablelist']/tr[@class='odd'] | //table[@class='tablelist']/tr[@class='even']")
        for node in nodeList:
            item = TencentItem()
 
            item['title'] = node.xpath("./td[1]/a/text()").extract()[0]
            if len(node.xpath("./td[2]/text()")):
                item['position'] = node.xpath("./td[2]/text()").extract()[0]
            else:
                item['position'] = ''
            item['num'] = node.xpath("./td[3]/text()").extract()[0]
            item['address'] = node.xpath("./td[4]/text()").extract()[0]
            item['time'] = node.xpath("./td[5]/text()").extract()[0]
            item['url'] = self.base_url + node.xpath("./td[1]/a/@href").extract()[0]
            # 根据内页地址爬取
            yield scrapy.Request(item['url'], meta={'item': item}, callback=self.detail_parse)
 
            # 有下级页面爬取 注释掉数据返回
            # yield item
 
        # 循环爬取翻页
        nextPage = response.xpath("//a[@id='next']/@href").extract()[0]
        # 爬取页数控制及末页控制
        if self.count < self.page_end and nextPage != 'javascript:;':
            if nextPage is not None:
                # 爬取页数控制值自增
                self.count = self.count + 1
                # 翻页请求
                yield scrapy.Request(self.base_url + nextPage, callback=self.parse)
        else:
            # 爬虫结束
            return None
    def detail_parse(self, response):
        # 接收上级已爬取的数据
        item = response.meta['item']   
        #一级内页数据提取 
        item['zhize'] = response.xpath("//*[@id='position_detail']/div/table/tr[3]/td/ul[1]").xpath('string(.)').extract()[0]
        item['yaoqiu'] = response.xpath("//*[@id='position_detail']/div/table/tr[4]/td/ul[1]").xpath('string(.)').extract()[0]
        # 二级内页地址爬取
        yield scrapy.Request(item['url'] + "&123", meta={'item': item}, callback=self.detail_parse2)
        # 有下级页面爬取 注释掉数据返回
        # return item
    def detail_parse2(self, response):
        # 接收上级已爬取的数据
        item = response.meta['item']
        # 二级内页数据提取 
        item['test'] = "111111111111111111"
        # 最终返回数据给爬虫引擎
        return item

3.scrapy爬虫request对象和response对象

request对象

Request(url[, callback,method='GET', header,body, cookies, meta, encoding='utf-8', priority=0, dont_filte=False, errback]) 
# 下面介绍这些参数。 
url(必选)  请求页面的url地址，bytes或者str类型。
callback  页面解析函数（回调函数） callable类型   Request对象请求的页面下载完成后，由该参数指定页面解析函数被调用。如果没有定义该参数，默认为parse方法。
method http请求的方法，默认为GET
headers http 请求的头部字典，dict类型，例如{“Accrpt”:"text/html","User-Agent":"Mozilla/5.0"},如果其中某一项的值为空，就表示不发送该项http头部,例如：{“cookie”：None} 表示禁止发生cookie.
body   http请求的正文，bytes或者str类型。
cookies  cookies 信息字典，dict类型。
meta  Request的元素数据字典，dict类型，用于框架中其他组件传递信息，比如中间件Item Pipeline. 其他组件可以使Request对象的meta属性访问该元素字典（request.meta）,也用于给响应处理函数传递信息。
encoding  url和body参数的编码方式，默认为utf-8，如果传入str类型的参数，就使用该参数对其进行编码。
priority  请求的优先级默认为0 ，优先级高的请求先下载。
dont_filter  默认为False,对同一个url地址多次请求，后面的请求会被过滤（避免重复下载）。如果该参数置为True,这可以重复下载。（有是页面网站页面会动态变化，需要重复下载。）
errback   请求出现异常或者出现http错时的回调函数。

response对象

Response对象用于描述一个http响应，Response只是一个基类，根据响应的内容不同有如下的子类：
1，TextResponse
2，HtmlResponse
3，XMLResponse
4，其中HtmlResponse和XMLResponse是TextResponse的子类。他们的差别不是很大
当一个页面下载完成时，下载器依据下载Http响应头部中的Content-Type信息创建某一个Response的子类对象。我们通常爬去网页，其内容是Html文本，那创建的便是HTMLResponse对象。
由于我们平时下载的页面大都是HTML文本，这里对HTMLResponse对象进行说明：
首先我来说明HTTPResponse对象的属性和方法。
url ：HTTP响应的url地址,str类型

status：HTTP响应的状态码, int类型

headers ：HTTP响应的头部, 类字典类型, 可以调用get或者getlist方法对其进行访问

body：HTTP响应正文, bytes类型（图片，二进制流）

text：文本形式的HTTP响应正文, str类型

 - response.text = response.body.decode(response.encoding)

encoding：HTTP响应正文的编码

request：产生该HTTP响应的Reqeust对象

meta：即response.request.meta, 在构造Request对象时, 可将要传递给响应处理函数的信息通过meta参数传入, 响应处理函数处理响应时, 通过response.meta将信息提取出来

selector：Selector对象用于在Response中提取数据使用下面详细将,主要是 xpath,css取值之后的处理

xpath(query)：获取节点及内容

css(query) ：获取节点及内容

urljoin(url) ：用于构造绝对url, 当传入的url参数是一个相对地址时, 根据response.url计算出相应的绝对url.

1.scrapy报 Connection was closed cleanly

最后发现是Host不对，注释Host之后，发现就可以正常的采集数据了。

2.requests请求代入cook

response = request.post(url, data=data, cookies=cookies)