提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档
前言
关于爬虫框架PySpider环境搭建。
提示:以下是本篇文章正文内容,下面案例可供参考
一、PySpider是什么?
pyspider是Binux做的一个爬虫架构的开源化实现,主要功能有 :
1)抓取、更新调度多站点的特定的页面
2)需要对页面进行结构化信息提取
3)灵活可扩展,稳定可监控
Pyspider的优点:
1.提供WebUI界面,调试爬虫很方便;
2.可以很方便的进行爬取的流程监控和爬虫项目管理;
3.支持常见的数据库;
4.支持使用PhantomJS,可以抓取JavaScript页面;
5.支持优先级定制和定时爬取等功能;
缺点:
1.针对反爬程度强的网站不擅长;
2.超大规模的数据抓取不擅长,框架比较轻量级;
3.可扩展性不高;
二、安装步骤
1.安装pyspider
使用pip install pyspider
出现报错:
缺少pycurl的类包,下载
本地执行:注意版本问题
Pyspider安装成功,webui界面启动报错
Werkzeugn版本太高,pyspider不支持,降低版本
成功启动
2.调试使用
访问ui
创建爬取链接之后,报错
这可能是域名解析的问题,这边选择把ipv6的网络关闭的方式解决的
需要安装phantomjs,否则无法很好的看到页面,直接使用pip instll phantomjs
Ssl认证异常
这个错误会发生在请求 https 开头的网址,SSL 验证错误,证书有误
请求中配置参数
安装连接开源hive的包jaydebeapi
简单连接调试
使用python代码连接pyspider进行某个网站的页面数据爬取,并存储到相应的数据库,mysql或者hive。
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2020-09-16 10:56:49
# Project: test2
from pyspider.libs.base_handler import *
import pymysql
from jpype import *
import os.path
class Handler(BaseHandler):
crawl_config = {
}
# 连接inceptor
def conn_inceptor(self, url, title, date, who, how, cash, day, text, image):
conn = None
stmt = None
try:
jarpath = os.path.join(os.path.abspath('.'), 'C:/User/14713/Desktop/logs/')
if not isJVMStarted():
startJVM("C:/software/Java/jdk1.8.0_191/jre/bin/server/jvm.dll", "-ea", "-Djava.class.path=%s" % (jarpath + 'inceptor-driver.jar'))
print(jarpath, 'hive-jdbc-0.12.0-transwarp-6.0.2.jar')
sql = 'insert into pyspider_qunar(url, title, date_trav, who, how, cash, day_trav, text_trav, image) values ("%s","%s","%s","%s","%s","%s","%s","%s","%s")' % (url, title, date, who, how, cash, day, text, image)
java.lang.Class.forName("org.apache.hive.jdbc.HiveDriver")
# JClass("org.apache.hive.jdbc.HiveDriver")
conn = java.sql.DriverManager.getConnection('jdbc:hive2://172.20.xxx.xx:10000/default', 'hive', '123456')
stmt = conn.createStatement()
state = stmt.execute(sql.decode("utf-8"))
print(state)
except Exception as e:
print(e)
finally:
if stmt is not None:
stmt.close()
if conn is not None:
conn.close()
# 连接mysql数据库
def __init__(self):
self.db = pymysql.connect(host='172.20.xxx.xx', port=3316, user='root', passwd='664856849', db='test', charset='utf8')
def add_mysql(self, url, title, date, who, how, cash, day, text, image):
try:
cursor = self.db.cursor()
sql = 'insert into pyspider_qunar(url, title, date_trav, who, how, cash, day_trav, text_trav, image) values ("%s","%s","%s","%s","%s","%s","%s","%s","%s")' % (url, title, date, who, how, cash, day, text, image)
# print(sql)
cursor.execute(sql)
print(cursor.lastrowid)
self.db.commit()
except Exception as e:
print(e)
self.db.rollback()
@every(minutes=24 * 60)
def on_start(self):
self.crawl('http://travel.qunar.com/travelbook/list.htm', callback=self.index_page)
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
for each in response.doc('li > .tit > a').items():
self.crawl(each.attr.href, callback=self.detail_page, fetch_type='js')
next = response.doc('.next').attr.href
self.crawl(next, callback=self.index_page)
@config(priority=2)
def detail_page(self, response):
url = response.url
title = response.doc('title').text()
date_trav = response.doc('.when .data').text()
day_trav = response.doc('.howlong .data').text()
who = response.doc('.who .data').text()
text_trav = response.doc('#b_panel_schedule').text()
image = response.doc('.cover_img').attr.src
how = response.doc('.how .data').text()
cash = response.doc('.howmuch .data').text()
# self.add_mysql(url, title, date_trav, who,how,cash,day_trav,text_trav,image)
self.conn_inceptor(url, title, date_trav, who,how,cash,day_trav,text_trav,image)
shutdownJVM()
return {
"url": url,
"title": title,
"date": date_trav,
"day": day_trav,
"who": who,
"text": text_trav,
"image": image,
"how": how,
"cash": cash,
}
总结
以上是PySpider的安装以及简单调试使用的过程。