Python分布式爬虫必学框架Scrapy打造搜索引擎-4 抓取站点

伯乐在线

1、抓取子域名下所有连接,避免重复

2、查看最新文章的all-posts/page/22  下面可以查看所有文章

分析url的列表 看住下一页

安装scrapy

***********

方法1:安装wheel pywin32 这些看https://www.cnblogs.com/dalyday/p/9277212.html,哥们写的很诚恳,我没用,本来要用到

pywin32-227-cp37-cp37m-win_amd64.whl等等,

方法2:还有哥们https://www.cnblogs.com/airnew/p/10152438.html,这样安装配件按照下面的顺序安装类库:lxml->zope.interface->pyopenssl->twisted->scrapy。

我的方法:我直接在pycharm更改pypi了,改豆瓣到清华大学,搞定,原因在于豆瓣好多没有,国外下载速度太慢,还是清华大学牛逼。。。

************

在文件-设置-点击右边的+号,输入scrapy,安装。。。

在左下角的terminal里面输入scrapy startproject ArticleSpider,新建成功

(venv) D:\zz\PycharmProjects\test>scrapy startproject ArticleSpider
New Scrapy project 'ArticleSpider', using template directory 'd:\zz\pycharmprojects\test\venv\lib\site-pac
kages\scrapy\templates\project', created in:
    D:\zz\PycharmProjects\test\ArticleSpider

You can start your first spider with:
    cd ArticleSpider
 scrapy genspider example example.com

更新目录

ArticleSpider/ :项目的Python模块,将会从这里引用代码

__init__.py:这个必须有,虽然文件里没有内容

ArticleSpider/items.py :项目的目标文件(存储数据的)

ArticleSpider/pipelines.py :项目的管道文件

ArticleSpider/settings.py :项目的设置文件

ArticleSpider/middlewares.py :项目的中间见文件

ArticleSpider/spiders/ :存储爬虫代码目录

scrapy.cfg 配置类似django

在spiders里面建立针对jobbole的spider

cd ArticleSpider
scrapy genspider jobbole blog.jobbole.com
# -*- coding: utf-8 -*-
import scrapy


class JobboleSpider(scrapy.Spider):
    name = 'jobbole'
    allowed_domains = ['blog.jobbole.com']
    start_urls = ['http://blog.jobbole.com/']

    def parse(self, response):
        pass

小技巧 文件-设置-文件和模板  可以在python scripts里面设置 encoding和author

scrapy crawl jobbole 用来抓取 实现动作

(venv) D:\zz\PycharmProjects\test\ArticleSpider>scrapy crawl jobbole
2020-02-04 10:24:29 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: ArticleSpider)
2020-02-04 10:24:29 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.5, cssselect 1.1.0, parse
l 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.191
4 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Windows-10-1
0.0.18362-SP0
2020-02-04 10:24:29 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'ArticleSpider', 'NEWSPIDER_M
ODULE': 'ArticleSpider.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['ArticleSpider.spiders']}
2020-02-04 10:24:29 [scrapy.extensions.telnet] INFO: Telnet Password: 4f68092a27495123
2020-02-04 10:24:29 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-02-04 10:24:30 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-02-04 10:24:30 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-02-04 10:24:30 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-02-04 10:24:30 [scrapy.core.engine] INFO: Spider opened
2020-02-04 10:24:30 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (
at 0 items/min)
2020-02-04 10:24:30 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-02-04 10:24:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://blog.jobbole.com/robots.txt> (re
ferer: None)
2020-02-04 10:24:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://blog.jobbole.com/> (referer: Non
e)
2020-02-04 10:24:31 [scrapy.core.engine] INFO: Closing spider (finished)
2020-02-04 10:24:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 440,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 1736,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 1.154336,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 2, 4, 2, 24, 31, 512274),
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2020, 2, 4, 2, 24, 30, 357938)}
2020-02-04 10:24:31 [scrapy.core.engine] INFO: Spider closed (finished)

为了对上述动作进行调试,则需要在Article下面建立main.py进行调试

__author__ = 'zz'
#调试类
from scrapy.cmdline import execute
# 引入以方便调试 可以测试脚本
import sys
# 设置工程目录 以配合execute
# sys.path.append("D:\zz\PycharmProjects\test")
# 不便于移动
import os

# print(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
# 获取文件本身绝对路径os.path.abspath(__file__)
# 获取文件父目录
execute(["scrapy", "crawl", "jobbole"])

还需要在setting.py的robotstxt_obey=false

调试main.py也就是上述文件,会跳转到jobbole.py里面的pass(设置断点),可以看到response的内容

XPATH

1、xpath简介

2、xpath术语

3、xpath语法

1、xpath简介

1)xpath使用路径表达式在xml和html中进行导航

2)xpath包含标准函数库

3)xpath是一个w3c的标准

2、xpath接电关系

1)父节点

2)子节点

3)同胞节点

4) 先辈节点

5)后代节点

xpath语法

article 选取所有article的所有子节点

/article 选取根元素article

article/a  选取所有属于article的子元素的a元素

//div 选取所有div子元素 不论出现在文档任何地方

article//div 选取所有属于article元素后代的div元素,不管在article里面热河位置

//@class 选取所有名为class的属性

/article/div[1]第一个div

/article/div[last()]倒数div

/article/div[last()-1]倒数第二个div

//div[@lang]选取所有拥有lang属性的div元素

//div[@lang='eng']选取所有拥有lang属性为eng的div元素

/div/* div下属所有子节点

//* 选取所有元素

//div[@*]选取所有带属性的div元素

//div/a | //div/p 选取所有div元素的a和p元素

//span | //url 选取所有的span和ul元  素

article/div/p | //span article的div的p 以及所有span

class JobboleSpider(scrapy.Spider):
    name = 'jobbole'
    allowed_domains = ['blog.jobbole.com']
    #start_urls = ['http://blog.jobbole.com/']
    start_urls = ['http://blog.jobbole.com/caijing/gsyw/89113/']
    #start_urls = ['https://zz.58.com/shoujidiannaoshumahuishou/39393312698905x.shtml?adtype=1&adact=3&psid=135555298207200372225374949&iuType=b_1&link_abtest=&ClickID=1&PGTID=0d300024-0015-6db4-56f5-8b74c976468b&slot=1000106']
    def parse(self, response):
        print(response.body)
        xpath = '/html/body/div[3]/div[2]/div[1]/h2'
        # firefox下面应用全路径xpath chrome应用关键字节xpath,如果js动态生成的不生效,所以推荐使用chrome下
        xpath = '//*[@id="basicinfo"]/div[1]/h1/text()'
        xpath = '//div[@class="entry-header"]/div[1]/h1/text()'
        # text() 可以获取文本
        re_selector = response.xpath(xpath)

本来做测试的,但是伯乐网被玩坏额。。。已经调取了反扒。。。YunSuoAutoJump()云锁了解一下,另文单独破解,现在换个58网站继续测试,全路径和关键路径都会生效,只要写的对,另外text()取文本是没问题的,顺便说一下,刷新的多了58也封。。。

***************

因为 python3 中取消了 range 函数,而把 xrange 函数重命名为 range,所以现在直接用 range 函数即可。

***************

在terminal里面,

scrapy shell http://blog.jobbole.com/caijing/gsyw/89113/

# 表明要调试这个页面,不用访问多次。。。
title = response.xpath( '//div[@class="entry-header"]/div[1]/h1/text()')

titile 就可以查看title 对应的selector

title.extract() 可以返回数组

title.extract() [0]可以返回第一个

create_date = response.xpath('//p[@class="entry-meta-hide-on-mobile"]')

create_date = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()').extract()

create_date = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()').extract()[0].strip()

create_date = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()').extract()[0].strip().replace(";","").strip()

praise_nums= response.xpath("//span[contains(@class,'vote-post-up')]/hn10/text()").extract()[0]

fav_nums= response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract()[0]

match_re = re.match(r".*(\d+).*",fav_nums)

if match_re :

   fav_nums = match_re.group(1)

comment_nums = response.xpath("‘//a[@href='#article-comment']/span’).extract()[0]

match_re = re.match(r".*(\d+).*",comment_nums )

if match_re :

   comment_nums = match_re.group(1)

content = response.xpath("//div[@class='engry']").extract()[0]

tag_list = response.xpath("//p[@class='entry-meta-hide-on-mobile']/a/text()").extract()

tag_list = [element for element in tag_list if not element.strip().endswith("评论")]

tags=",".join(tag_list)

  tag_list =['a', 'b', '评论', 'c']
  tag_list = [element for element in tag_list if not element.strip().endswith("评论")] # 就是把以评论结尾的项删除
  tags = ",".join(tag_list)

CSS选择器

*选择所有节点

#container 选取id为container的节点

.container 选取所有class包含container的节点

li a 选取所有li下的所有a节点

ul+p 选取ul后面的第一个p元素

div#container>ul 选取id为container的div的第一个ul子元素

ul~p 选取与ul相邻的所有p元素

a[title]选取所有有title属性的a元素

a[href="http://jobbole.com"]选取所有href属性为jobbole的a元素

a[href*="jobbole"]选取所有href属性包含jobbole的a元素

a[href^="http"]选取所有href属性值以http开头的a元素

a[href$=".jpg"]选取所有以jpg结尾的a元素

input[type=radio]:checked 选择选中的radio的元素

div:not(#container)选取所有id非container的div属性

li:nth~child(3)选取第三个li元素

tr:nth~child(2n)第偶数个tr

**************************************

def parse_detail(self,response):   
        front_image_url = response.meta.get("front_image_url","")
        #文章封面图
        title = response.css('.entry-header h1::text').extract_first() # ::text这个是选择h1里面的text,要么就是乱码
        create_date =  response.css('.entry-meta-hide-on-mobile::text').extract_first().strip().replace('·', '').strip()
        praise_num = response.css('h10::text').extract_first()
        fav_num = response.css('.bookmark-btn::text').extract_first()
        comment_num = response.css('a[href="#article-comment"] span::text').extract_first()
        math_re = re.match(".*?(\d+).*", fav_num)
        if math_re:
            fav_num = int(math_re.group(1))
        else:
            fav_num = 0
        math_re = re.match(".*?(\d+).*", comment_num)
        if math_re:
            comment_num = int(math_re.group(1))
        else:
            comment_num = 0
        content = response.css('.entry').extract_first()
        tag_list = response.css('.entry-meta-hide-on-mobile a::text').extract()
        tag_list =  [element for element in tag_list if not element.strip().endswith("评论")]
        tag_list = ",".join(tag_list)
        pass

列表页来爬取文章

"""

1、获取文章列表文章url并交给scrapy下载后解析函数进行具体字段的解析

2、获取下一页的url并交给scrapy进行下载 下载完成后交给parse

""""

#解析列表页中的所有文章url

import scrapy
import re
# 正則表達式
from scrapy.http import Request
# 可以用來發出請求
from urllib import parse
# 轉化url py2的是import urlparse


class JobboleSpider(scrapy.Spider):
    name = 'jobbole'
    allowed_domains = ['blog.jobbole.com']
start_urls = ['http://blog.jobbole.com/all-posts/']

def parse(self,response):
post_nodes = response.css("#archive.floated-thumb .post-thumb a")
for post_node in post_nodes:
    image_url = post_node.css("img::attr(src)").extract_first("")
    post_url = post_node.css("::attr(href)").extract_first("")
    yield Request(url=parse.urljoin(response.url,post_url),meta={"front_image_url":image_url},callback=self.parse_detail)
    # 获取图片路径在meta里面 同时urljion连接url

post_urls = response.css("#archive.floated-thumb .post-thumb a::attr(href)").extract()
for post_url in post_urls:
     yield Request(url=parse.urljoin(response.url,post_url),callback=self.parse_detail)
     #   Request(url=post_url,callback=self.parse_detail)
#提取下一页提交给scrapy
next_urls = response.css(".next.page-numbers::attr(href)").extract_first("")
if next_url:
    yield Request(url=parse.urljoin(response.url,post_url),callback=self.parse)

items.py

import scrapy


class ArticlespiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class JobBoleArticleItem(scrapy.Item):
    title = scrapy.Field() #只有这一种类型
    create_date = scrapy.Field()
    url = scrapy.Field()
    url_object_id =  scrapy.Field()   #url统一化
    front_image_url= scrapy.Field()   #封面图路径
    front_image_path = scrapy.Field() #本地路径
    praise_nums =  scrapy.Field()
    comment_nums =  scrapy.Field()
    fav_nums =  scrapy.Field()
    tag =  scrapy.Field()
    content =  scrapy.Field()

引入item到jobbole.py

from ArticleSpider.item import JobBoleArticleItem
import datetime

def parse_detail(self,response):   #重写
article_item = JobBoleArticleItem()
article_item['title']=title
article_item['url']=response.url
try:
  create_date= datetime.datetime.strptime(create_date,"%Y/%m%d").date()
except Exception as e:
  create_date=dateime.datetime.now().date
article_item['create_date']=create_date
article_item['front_image_url']=[front_image_url]
 #因为pipelines对应setting里面的对应的是数组,所以这里要加[]
article_item['praise_nums']=praise_nums
article_item['comment_nums']=comment_nums
article_item['fav_nums']=fav_nums
article_item['tags']=tags
article_item['content']=content
yield article_item

pipelines.py(主要做数据存储) 引入article_item,什么都不用动

class ArticlespiderPipeline(object):
    def process_item(self, item, spider):
        return item

 但是settings.py需要调整参数,取消下面的注释

ITEM_PIPELINES = {
    'ArticleSpider.pipelines.ArticlespiderPipeline': 300,
}

如何根据图片地址自动下载图片,修改上述为:

ITEM_PIPELINES = {
'ArticleSpider.pipelines.ArticlespiderPipeline': 300, 
'scrapy.pipelines.images.ImagesPipeline':1, #选取排列顺序
}
IMAGES_URLS_FIELD = "front_image_url"
project_dir = os.path.abspath(os.path.dirname(__file__))
IMAGES_STORE = os.path.join(project_dir, 'images')

 没有PIL库,下载文件

需要安装pillow,然后修改

article_item['front_image_url']=front_image_url为article_item['front_image_url']=[front_image_url]

保存图片对应的路径,关联图片

from scrapy.pipelines.images import ImagesPipeline

class ArticlespiderPipeline(object):
    def process_item(self, item, spider):
        return item

class ArticleImagePipeline(ImagePipeline):
    def item_completed(self,results,item,info):
        if "front_image_url" in item:   #只处理有image的    
        for ok,value in results:
            image_file_path = value["path"]
        item["front_image_path"]=image_file_path
    return item
#result 里面还有保存文件的路径以及对应的网址

转化,过滤图片,过滤设置,在settting中配置

IMAGES_MIN_HEIGHT=100

IMAGES_MIN_WIDTH=100

修改setting,应用新建立的ArticleImagePipeline

ITEM_PIPELINES = {
'ArticleSpider.pipelines.ArticlespiderPipeline': 300, 
#'scrapy.pipelines.images.ImagesPipeline':1, #选取排列顺序

'ArticleSpider.pipelines.ArticleImagePipeline': 1,

}
md5加密

__author__ = 'zz'
import hashlib

def get_md5(url):
    if isinstance(url, str): #判斷是否是unicode
        url = url.encode("utf-8")
    m = hashlib.md5()
    m.update(url)
    return  m.hexdigest()

if  __name__ == "__main__":
    print(get_md5("http://jobbole.com".encode("utf-8")))

 在jobbole.py 的detail里面

article_item["url_object_id"]=get_md5(response.url)

修改完,准备插入数据库

pipelines.py  写入json文件

import codecs 
#编码方面工作
import json
#利用自定义文件导出
class JsonWithEncodingPipeline(object):
def __init__(self):
    self.file = codecs.open("article.json","w",encoding="utf-8")
def process_item(self,item,spider):
    lines = json.dumps(dict(item),ensure_ascii=false) + "\n"
    self.file.write(lines)
    return item
def spider_closed(self,spider):
    self.file.close()

每调试一次pipelines,都要改动setting里面的item_pipelines

ITEM_PIPELINES = {
'ArticleSpider.pipelines.JsonWithEncodingPipeline': 300, 
#'scrapy.pipelines.images.ImagesPipeline':1, #选取排列顺序

'ArticleSpider.pipelines.ArticleImagePipeline': 1,

}

from scrapy.exporters import JsonItemExporter
#利用scrapy提供json导出
class JsonExporterPipleLine(object):
   def __init__(self):
        self.file = open("article.json","wb")
        self.exporter = JsonItemExporter(self.file,encoding="utf-8",ensure_ascii=false)
        self.exporter ,start_exporting()
   def close_spider(self,spider):
        self.exporter.finish_exporting()
        self.file.close()
   def process_item(self,item,spider):
        self.exporter.export_item(item)
        return item

设置数据表

安装mysql驱动 mysqlclient

sudo apt-get install libmysqlclient-devsimp #unbuntn

sudo yunm install python-devel mysql-devel #centos

import MySQLdb

#插入速度跟不上解析速度
class MysqlPipeline(object):
    def __init__(self):
        self.conn = MySQLdb.connect('host','user','password','dbname',charset="utf8",user_unicode=true)
        self.cursor= self.conn.cursor()

    def process_item(self,item,spider):  #process_item 必须在pipeline里面执行
        insert_sql = """
            insrt into jobbole_article(title,url,create_date,fav_nums)
            values (%s,%s,%s,%s)
        """
        self.cursor.execute(insert_sql,(item["title"],item["url"],item["create_date"],item["fav_nums"]))
        self.conn.commit()

#异步插入
配置写入setting.py
MYSQL_HOST=""
MYSQL_DBNAME=""
MYSQL_USER=""
MYSQL_PASSWORD=""

from twisted.enterprise import adbapi
import MySQLdb.cursors
#异步
class MysqlTwistedPipeline(object):
    def __init__(self,dbpool):
        self.dbpool = dbpool

    @classmethod
    def from_setting(cls,settings):  #cls本身类
        dbparms=dict(
        host = settings["MYSQL_HOST"],
        db= settings["MYSQL_DBNAME"],
        user = settings["MYSQL_USER"],
        pass = settings["MYSQL_PASSWORD"],
        charset='utf8',
        cursorclass= MySQLdb.cursors.DictCursor,
        use_unicode=true
        ) #写法固定,对应connect方法里面的名称
        dbpool=adbapi.ConnectionPool("MySQLdb",**dbparms)
        return cls(dbpool)

    def process_item(self,item,spider):
        #twisted异步
        self.dbpool.runInteraction(self.do_insert,item)
        #记录错误
        query.addErrback(self.handle_error)

    def handle_error(self,failure):   #可以不存在,item,spider
        #处理异步插入的异常
        print(failure)

    def do_insert(self,cursor,item):  #process_item 必须在pipeline里面执行
        insert_sql = """
            insrt into jobbole_article(title,url,create_date,fav_nums)
            values (%s,%s,%s,%s)
        """
        cursor.execute(insert_sql,(item["title"],item["url"],item["create_date"],item["fav_nums"]))

scrapy-djangoitem 可以简化上述操作,

itemloader简介应用 简化操作  修改jobbole.py

item某个字段

from scrapy.loader import ItemLoader
from ArticleSpider.items import ArticleItemLoader
#itemloader

#其他可以注释
front_image_url = response.meta.get("front_image_url","")
#item_loader = ItemLoader(item=JobBoleArticleItem(),reponse=response)
item_loader = ArticleItemLoader(item=JobBoleArticleItem(),reponse=response)
#ArticleItemLoader自定义可以实现只取第一个
item_loader.add_css("title",".entry-header h1::text")
#item_loader.add_xpath()
item_loader.add_addvalue("url",response.url)
item_loader.add_addvalue("url_object_id",get_md5(response.url))
item_loader.add_addvalue("front_image_url",[front_image_url])
...
article_item = item_loader.load_item() #载入
yield article_item 

需要接着更改 item.py

from scrapy.loader.processors import MapCompose,TakeFirst,Join
import datetime
import re
from scrapy.loader import ItemLoader

def add_jobbole(vaule):
    retuen value+"-jobbole"
def date_convert(value):
    try:
  create_date= datetime.datetime.strptime(value,"%Y/%m%d").date()
except Exception as e:
  create_date=dateime.datetime.now().date
return create_date

#nums通用
def get_nums(vlaue):
 math_re = re.match(".*?(\d+).*", value)
        if math_re:
            nums= int(math_re.group(1))
        else:
            nums= 0
return nums

def remove_comment_tages(value):
    #去掉评论
    if "评论" in value:
          return ""
    else 
            return value

class ArticleItemLoader(ItemLoader):
    #自定义loader
    default_out_processor = TakeFirst()

def return_value(value):
    return value

class JobBoleArticleItem(scrapy.Item):
title = scrapy.Field(
input_processor = MapCompose(add_jobbole)
#input_processor = MapCompose(lanmbda x:x+"-jobbole",add_jobbole) #可以叠加
)
create_date = scrapy.Field(
input_processor =MapCompose(date_convert)
#out_processor= TakeFirst()  #自定义loader实现统一取值
)
praise_nums = scrapy.Field(
input_processor =MapCompose(get_nums)
)
tags= scrapy.Field(
out_processor= Join(",")
input_processor =MapCompose(remove_comment_tages)
)
front_image_url= scrapy.Field(
out_processor= MapCompose(return_value)
)

 思考再三,再次实验伯乐在线,突破一下,下载PyExecJS,然后

import execjs 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值