(一)安装selenium:
pip3 install selenium
(二)下载chromedriver
使用chrome浏览器做自动化,根据本身浏览器的版本下载相对应的chromedriver
可以到http://chromedriver.storage.googleapis.com/index.html 下载对应的版本
(三)新建爬虫
进入scrapy项目目录,新建爬虫,在spiders生成NetworkStatus.py文件:
~:scrapy genspider NetworkStatus https://xxxxxx.com/
设置item.py文件:
import scrapy
class NetworkStatus(scrapy.Item):
timestamp = scrapy.Field()
Hashrate = scrapy.Field()
Difficulty = scrapy.Field()
PPS = scrapy.Field()
FPPS = scrapy.Field()
Next_Difficulty = scrapy.Field()
Date_to_Next_Difficulty = scrapy.Field()
Time = scrapy.Field()
BlocksLeft = scrapy.Field()
count = scrapy.Field()
unconfirmed = scrapy.Field()
tx_Rate = scrapy.Field()
Median_block_size = scrapy.Field()
fees_recommended = scrapy.Field()
设置pipelines.py文件
class NetworkStatusPipeline(object):
def __init__(self):
host = '127.0.0.1'
port = 27017
dbname = 'exponent'
sheetname = 'NetworkStatus'
# 创建MONGODB数据库链接
client = pymongo.MongoClient(host=host, port=port)
# 指定数据库
mydb = client[dbname]
# 存放数据的数据库表名
self.post = mydb[sheetname]
def process_item(self, item, spider):
data = dict(item)
self.post.insert(data)
return item
设置settings.py文件:
# 将ROBOTSTXT_OBEY 中的True修改成False
# 默认是True,遵守robots.txt文件中的协议,遵守允许爬取的范围。
# 设置为False,是不遵守robo协议文件
ROBOTSTXT_OBEY = False
# 启用管道
ITEM_PIPELINES = {
'exponent.pipelines.NetworkStatusPipeline': 300
}
编写NetworkStatus.py:
class NetworkstatusSpider(scrapy.Spider):
# 爬虫名称,唯一,不可缺少
name = 'NetworkStatus'
collection = 'network_status'
custom_settings = {
'ITEM_PIPELINES': {'exponent.pipelines.NetworkStatusPipeline': 300}
}
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}
def start_requests(self):
# 需要爬取的网址
url = 'https://xxxxxx.com'
# 获取网页cookie
cookie = self.get_cookie(url)
yield scrapy.Request(url, callback=self.parse, headers=self.headers, cookies=cookie)
# 利用selenium 获取网页的cookie
@staticmethod
def get_cookie(url):
opt = webdriver.ChromeOptions()
#设置selenium时,启动时浏览器不弹出
opt.add_argument('--headless')
opt.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=opt)
# 加载页面
driver.get(url)
time.sleep(4)
# 由于我获取cookie的页面页面无需用户登陆,所以直接获取cookies
# Selenium为我们提供了get_cookies来获取登录cookies
cookies = driver.get_cookies()
cookie = {}
# scrapy使用cookies需要封装成dict,所以在这边将获取到的cookies处理成dict类型,方便使用
for s in cookies:
cookie[s['name']] = s['value']
# 获取到数据后关闭浏览器
driver.close()
return cookie
def parse(self, response):
if response.status == 200:
body = response.text
item = NetworkStatus()
soup = bs4(body, 'html.parser')
#使用bs4获取网页数据
......
yield item
执行爬虫:
scrapy crawl NetworkStatus
文章参考:https://blog.csdn.net/weixin_43430036/article/details/84871624