Scrapy 爬取豆瓣电影的短评

最新推荐文章于 2022-02-09 10:44:51 发布

carry-s

最新推荐文章于 2022-02-09 10:44:51 发布

阅读量7.4k

点赞数

分类专栏： SCRAPY 文章标签： scrapy 模拟登陆豆瓣短评

本文链接：https://blog.csdn.net/u013402772/article/details/51159588

版权

之前爬取电影信息的时候，将电影短评的url一并存起来了。因此爬取电影短评的时候只需将数据库中存在的url 放入start_urls中就好了。spider.py# -*- coding: utf-8 -*-from scrapy.selector import Selectorfrom scrapy.spiders import Spiderfrom scrapy.http import R

摘要由CSDN通过智能技术生成

之前爬取电影信息的时候，将电影短评的url一并存起来了。
因此爬取电影短评的时候只需将数据库中存在的url 放入start_urls中就好了。

spider.py

# -*- coding: utf-8 -*-
from scrapy.selector import Selector
from scrapy.spiders import Spider
from scrapy.http import Request ,FormRequest
from comments.items import CommentsItem
import scrapy
from scrapy import log
import MySQLdb

class CommentSpider(Spider):
    name = "comments"
    #allowed_domains=["movie.douban.com"]
    db = MySQLdb.connect("localhost","root","123456","python" )
    cursor = db.cursor()
    #在爬取电影信息时已经将评论的链接也抓到数据库中（comment_url），  从数据中找到地址  作为 start_urls
    cursor.execute("select comment_url from doubanmovie")
    #data = cursor.fetchone() # 取一条
    data = cursor.fetchall() #取所有
    start_urls = data      
    def parse(self,response):
        sel = Selector(text=response.body)
        Url = response.url
        start_index = Url.find('comments')
        URL = Url[0:start_index+8]
        ID = filter(str.isdigit,URL)
        comments = sel.xpath('//*[@class="comment-info"]')
        for comment in comments:         
            item = CommentsItem() 
            item['ID'] = ID
            item['user_name'] = comment.xpath('a/text()').extract()
            item['user_score'] = comment.xpath('span[1]/@title').extract() 
            yield item
        for url in sel.xpath("//*[@class='next']/@href").extract():  
            yield Request(URL+url,callback=self.parse)

运行的时候出错，
显示TypeError(‘Request url must be str or unicode, got %s:’ % type(url).name).
后来发现是因为从数据库中取数据，data是 tuple 格式。
直接 start_urls = data; 不合适。因为在scrapy中 start_urls是List，而且 start_urls中的元素应该是string;
于是添加了一些代码：

    temp = list(data)
    start_urls = []

最低0.47元/天解锁文章

carry-s

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
1
评论
Scrapy 爬取豆瓣电影的短评

之前爬取电影信息的时候，将电影短评的url一并存起来了。因此爬取电影短评的时候只需将数据库中存在的url 放入start_urls中就好了。spider.py# -*- coding: utf-8 -*-from scrapy.selector import Selectorfrom scrapy.spiders import Spiderfrom scrapy.http import R
复制链接

扫一扫