scrapy抓取一个电影网站信息

最新推荐文章于 2019-06-18 16:11:39 发布

我有个朋友是大曹村的

最新推荐文章于 2019-06-18 16:11:39 发布

阅读量1.8k

点赞数 1

分类专栏： scrapy 文章标签： scrapy scrapy-redis python 爬虫

本文链接：https://blog.csdn.net/lvronglee/article/details/54845763

版权

scrapy 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

使用scrapy抓取ashvsash电影网站的电影信息。这里只简单的print信息，没有存储到数据库，稍加修改使能pipe，用PyMySQL或者mongdb库，过滤一下数据即可。备注：提取信息的时候有些网页会失败，需要细致调整。直接代码

# -*- coding: utf-8 -*-
import scrapy

#打印函数，方便查看
def my_print(a_map):
for item in a_map:
print(("%-15s %-50s")%(item, a_map[item]))

debug = 1
class MovicesSpider(scrapy.Spider):
name = "movices"
allowed_domains = ["ashvsash.com"]
start_urls = ['http://ashvsash.com/']

def parse_node_thumbnail_article_info(self, thumbnail,article,info):
url = thumbnail.xpath("./a/@href").extract()
title = article.xpath(".//a[@title]/@title").extract()
info_date = info.xpath("./span[@class='info_date info_ico']/text()").extract()
info_views = info.xpath("./span[@class='info_views info_ico']/text()").extract()
info_category = info.xpath("./span[@class='info_category info_ico']/a/text()").extract()
if debug:
print("\nurl地址:",url[0])
print("日期 = ", info_date[0])
print("观看数 = ", info_views[0])
print("类型 = ", info_category[0])
print("标题 = ", title[0])
return {'url':url[0],'date':info_date[0],'views':info_views[0],'title':title[0],'category':info_category[0]}

def parse_movie_detail_page(self, response):
result = {}
movie_info = response.meta['movie_info']
key = "地址:"
value = movie_info['url']
result[key] = value

key = "观看数:"
value = movie_info['views']
result[key] = value

key = "标题:"
value = movie_info['title']
result[key] = value

try:
key = response.xpath(r'//*[@id="post_content"]/p[2]/span[1]/text()').extract()[0]
key += ":"
value = response.xpath(r'//*[@id="post_content"]/p[2]/span[2]/a/text()').extract()[0]
result[key] = value

key = response.xpath(r'//*[@id="post_content"]/p[2]/span[3]/text()').extract()[0]
key += ":"
value = response.xpath(r'//*[@id="post_content"]/p[2]/span[4]/a/text()').extract()
result[key] = value

key = response.xpath(r'//*[@id="post_content"]/p[2]/span[6]/text()').extract()[0]
value = response.xpath(r'//*[@id="post_content"]/p[2]/text()[6]').extract()
result[key] = value

key = response.xpath(r'//*[@id="post_content"]/p[2]/span[8]/text()').extract()[0]
value = response.xpath(r'//*[@id="post_content"]/p[2]/text()[10]').extract()[0]
result[key] = value

print("-----------------------------------------------------------------")
my_print(result)
print("-----------------------------------------------------------------")
except:

#有些网页解析会出错，先简单的忽略。
print("<<<<<<------------------------------------------------------------")

def parse(self, response):
post_container = response.xpath("//ul[@id='post_container']")
new_urls = response.xpath(r'//div[@class="pagination"]/a/@href').extract()
#print(new_urls)
for url in new_urls:
yield scrapy.Request(url = url, callback=self.parse)

#next，2,3,4网页重新入队。

li = post_container.xpath(".//li")
for item in li:
node_thumbnail = item.xpath("./div[@class='thumbnail']")
node_article = item.xpath("./div[@class='article']")
node_info = item.xpath("./div[@class='info']")
movie_info = self.parse_node_thumbnail_article_info(node_thumbnail, node_article, node_info)
yield scrapy.Request(url=movie_info['url'], callback=self.parse_movie_detail_page, meta={'movie_info':movie_info})

如果在定义一个item，通过格式化map到数据库表中，就可以轻易的存储到数据库内部。使用scrapy，python 3.6在windows 7上调试通过。如果安装过程中出现错误，请从

http://www.lfd.uci.edu/~gohlke/pythonlibs/下载相关的包，直接本地pip安装包即可。会遇到的问题可能是lxml的安装，需要安装vs编译器即可，主要查看安装过程的错误信息。

使用scrapy-redis也比较简单，pip install scrapy-redis，安装后从，scrapy-redis继承生成spider类，然后用过redis将url 提交到redis中，这样运行此spider即可。spider从scrapy-redis的爬虫会默认冲redis中取出url。注意setting中配置redis。这样分布式爬虫就OK了。主爬虫爬去url，push 到redis中，分布式爬虫提取url，做具体的分析。当然也可以在push新的url进queue。

我有个朋友是大曹村的

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
scrapy抓取一个电影网站信息

使用scrapy抓取ashvsash电影网站的电影信息。这里只简单的print信息，没有存储到数据库，稍加修改使能pipe，用PyMySQL或者mongdb库，过滤一下数据即可。备注：提取信息的时候有些网页会失败，需要细致调整。直接代码# -*- coding: utf-8 -*-import scrapy#打印函数，方便查看def my_print(a_map):
复制链接

扫一扫

专栏目录