Phantomjs下载地址:http://phantomjs.org/download.html
我们这里选择phantomjs-2.1.1-windows.zip:如下图所示:
下载完成是一个压缩包,我们解压到桌面,以方便我们操作,如下图:
解压完成后,我们进到bin里面讲路径复制下来;如下图:
复制路径到环境变量中,如下图:
注意:在复制路径之前如果有 \ 就在 \ 后面加一个分号,如果没有 \ ,那我们就添加上 \ 和分号,不然后面写代码会报错
安装按成后,我们输入phantomjs测试一下,如下图:
接下来我们爬取一个网站,使用,MySQL进行存储:
爬取的网站:http://www.ygdy8.net/html/gndy/index.html
然后我们使用终端命令行创建框架:
1.进入相应的文件夹
2.输入scrapy startproject tiantangdianying(天堂电影)
3.创建框架完成后我们进入到框架中:cd tiantangdianying
4.进入到框架中我们创建爬虫一个项目:scrapy genspider tiantang ygdy8.net
项目我们已经创建完成,这里我们就不废话了,直接进入到tiantang.py里面进行爬取自己想要获取的:
我们先说说这次爬取的目的:
1.获取天堂电影里面的经典影片
2.获取"更多"链接,进入到更多里面,获取所有电影的详情链接
3.进入详情页面,我们获取电影名称和下载电影链接
4.将获取的信息,保存到MySQL数据库中
我们进入tiantang.py:如图所示:
进去之后我们更改一下网址,如图所示:
获取"更多"链接:
# -*- coding: utf-8 -*-
import scrapy
from .. items import TiantangdianyingItem
class DianyingSpider(scrapy.Spider):
name = 'dianying'
allowed_domains = ['ygdy8.net']
start_urls = ['http://www.ygdy8.net/html/gndy/index.html']
def parse(self, response):
div_list = response.xpath('//div[@class="title_all"]/p/em/a/@href').extract()
for div in div_list:
div_url = 'http://www.ygdy8.net' + div
print(div_url)
输出的结果:
获取每一个电影的详细链接:
def get_detail_page_url(self,response):
common_url = response.xpath('//div[@class="co_area2"]//ul//td//a[2]/@href').extract()
for common in common_url:
url = 'http://www.ygdy8.net' + common
print(url)
输出结果:
获取下一页链接:
next_list = response.xpath('//div[@class="x"]//a[text()="下一页"]/@href').extract_first('')
url = 'http://www.ygdy8.net/html/gndy/china/' + next_list
print(url)
输出结果:
接下来我们获取电影名字和下载地址:
def get_title_and_href(self,response):
title = response.xpath('//div[@class="title_all"]//font/text()').extract_first()
href = response.xpath('//td[@style="WORD-WRAP: break-word"]/a/@href').extract_first()
print(title)
print(href)
输出结果:
我们开始已经引入items,现在我们来使用它:
item = TiantangdianyingItem()
item['title'] = title
item['href'] = href
yield item
配置items.py:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class TiantangdianyingItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
href = scrapy.Field()
pass
设置setting.py:
然后我们打开MySQL创建数据库和表:
在表中创建两个字段name和href:
配置管道piplines.py:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
class TiantangdianyingPipeline(object):
def __init__(self):
self.connect = pymysql.connect(host='localhost',user='root',password='123456',db='dianying',port=3306)
self.cursor = self.connect.cursor()
def process_item(self, item, spider):
self.cursor.execute('insert into new_table(name,href)VALUES ("{}","{}")'.format(item['title'],item['href']))
self.connect.commit()
return item
def close_spider(self,spider):
self.cursor.close()
self.connect.close()
最后我们在终端命令行执行scrapy crawl tiantang
就完成了MySQL格式保存数据:如下图所示:
下面是tiantang.py完整代码:
# -*- coding: utf-8 -*-
import scrapy
from .. items import TiantangdianyingItem
class DianyingSpider(scrapy.Spider):
name = 'dianying'
allowed_domains = ['ygdy8.net']
start_urls = ['http://www.ygdy8.net/html/gndy/index.html']
def parse(self, response):
div_list = response.xpath('//div[@class="title_all"]/p/em/a/@href').extract()
for div in div_list:
div_url = 'http://www.ygdy8.net' + div
# print(div_url)
yield scrapy.Request(url=div_url,callback=self.get_detail_page_url)
def get_detail_page_url(self,response):
common_url = response.xpath('//div[@class="co_area2"]//ul//td//a[2]/@href').extract()
for common in common_url:
url = 'http://www.ygdy8.net' + common
# print(url)
yield scrapy.Request(url=url,callback=self.get_title_and_href)
next_list = response.xpath('//div[@class="x"]//a[text()="下一页"]/@href').extract_first('')
url = 'http://www.ygdy8.net/html/gndy/china/' + next_list
# print(url)
yield scrapy.Request(url = url,callback=self.get_detail_page_url)
def get_title_and_href(self,response):
pass
title = response.xpath('//div[@class="title_all"]//font/text()').extract_first()
href = response.xpath('//td[@style="WORD-WRAP: break-word"]/a/@href').extract_first()
print(title)
print(href)
item = TiantangdianyingItem()
item['title'] = title
item['href'] = href
yield item
希望以上个人分享对大家有帮助,谁有更好的方法获取,请在下面留言,一起学习.