python实现登录网站抓取信息_Python 用于网站抓取登录发布的模块介绍

最新推荐文章于 2021-02-21 08:32:26 发布

AnnalineYuuki

最新推荐文章于 2021-02-21 08:32:26 发布

阅读量377

点赞数

文章标签： python实现登录网站抓取信息

本文链接：https://blog.csdn.net/weixin_33309848/article/details/112904765

版权

1、应用场景

关于Selenium的详细说明，可以参考其文档，这里使用Python+Selenium Remote Control (RC)+Firefox 来实现如下几个典型的功能：

1)、Screen Scraping，也即由程序自动将访问网页在浏览器内显示的图像保存为图片，类似那些digg站点的网页缩略图。Screen Scraping有分成两种：只Scraping当前浏览器页面可视区域网页的图片(例如google.com首页)，Scraping当前浏览器完整页面的图片(页面有滚动，例如www.sina.com.cn的首页有多屏，需要完整保存下来)

2)、获取Javascript脚本生成的内容

例如要用程序自动爬取并下载百度新歌TOP100 的所有新歌，以下载萧亚轩的《抱紧你》为例，大致步骤可以如下：

a)、进入百度新歌TOP100http://list.mp3.baidu.com/top/top100.html，通过正则表达式匹配或采用mechanize、Beautiful Soup之类的htmlparser解析页面获得每一首歌后面的查询地址

b)、在查询结果页面，获得第一条结果的地址，进入mp3的实际下载地址

c)、在歌曲实际下载页面，解析html页面内容，会发现mp3的实际现在地址为空

实际的下载地址是由javascript脚本设置的：var encurl = "…", newurl = "";

var urln_obj = G("urln"), urla_obj = G("urla");

newurl = decode(encurl);

urln_obj.href = urla_obj.href = song_1287289709 = newurl;

其中函数G(str)为：function G(str){

return document.getElementById(str);

};

因此直接解析页面并不能获得下载地址，必须通过python调用浏览器引擎来解析javascript代码后获得对应的下载地址。

2、Selenium RC基础

Selenium RC的运行机制及架构在官方文档中有详细说明。

Selenium RC主要包括两部分：Selenium Server、Client Libraries，其中：

Selenium Server 对应Selenium RC 开发包中的selenium-server-xx目录，其中

xx对应相应的版本

Selenium RC提供了包括java、python、ruby、perl、.net、php等语言的client driver，分别如下：

selenium-dotnet-client-driver-xx

selenium-java-client-driver-xx

selenium-perl-client-driver-xx

selenium-php-client-driver-xx

selenium-python-client-driver-xx

selenium-ruby-client-driver-xx

Python等语言通过调用client driver来发出浏览器操作指令(例如打开制定url)，由client driver把指令传递给Selenium Server解析。Selenium Server负责接收、解析、执行客户端执行的Selenium 指令，转换成各种浏览器的命令，然后调用相应的浏览器API来完成实际的浏览器操作。

Selenium Server实际充当了客户端程序与浏览器间http proxy。

3、例子：

1)、下载Selenium RC http://seleniumhq.org/download/，测试使用的selenium-remote-control-1.0.3.zip

2)、解压后selenium-remote-control-1.0.3.zip

3)、运行Selenium Server

cd selenium-remote-control-1.0.3\selenium-server-1.0.3

java -jar selenium-server.jar

Selenium Server缺省监听端口为4444，在org.openqa.selenium.server.RemoteControlConfiguration中设定

4)、测试代码#coding=gbk

from selenium import selenium

def selenium_init(browser,url,para):

sel = selenium('localhost', 4444, browser, url)

sel.start()

sel.open(para)

sel.set_timeout(60000)

sel.window_focus()

sel.window_maximize()

return sel

def selenium_capture_screenshot(sel):

sel.capture_screenshot("d:\\singlescreen.png")

def selenium_get_value(sel):

innertext=sel.get_eval("this.browserbot.getCurrentWindow().document.getElementById('urla').innerHTML")

url=sel.get_eval("this.browserbot.getCurrentWindow().document.getElementById('urla').href")

print("The innerHTML is :"+innertext+"\n")

print("The url is :"+url+"\n")

def selenium_capture_entire_page_screenshot(sel):

sel.capture_entire_page_screenshot("d:\\entirepage.png", "background=#CCFFDD")

if __name__ =="__main__" :

sel1=selenium_init('*firefox3','http://202.108.23.172','/m?word=mp3,http://www.slyizu.com/mymusic/VnV5WXtqXHxiV3ZrWnpnXXdrWHhrW3h9VnRkWXZtXHp1V3loWnlrXXZlMw$$.mp3,,[%B1%A7%BD%F4%C4%E3+%CF%F4%D1%C7%D0%F9]&ct=134217728&tn=baidusg,%B1%A7%BD%F4%C4%E3%20%20&si=%B1%A7%BD%F4%C4%E3;;%CF%F4%D1%C7%D0%F9;;0;;0&lm=16777216&sgid=1')

selenium_get_value(sel1)

selenium_capture_screenshot(sel1)

sel1.stop()

sel2=selenium_init('*firefox3','http://www.sina.com.cn','/')

selenium_capture_entire_page_screenshot(sel2)

sel2.stop()

几点注意事项：

1)、在selenium-remote-control-1.0.3/selenium-python-client-driver-1.0.1 /doc/selenium.selenium-class.html 中对Selenium支持的各种命令的说明，值得花点时间看看

2)、在__init__(self, host, port, browserStartCommand, browserURL) 中，browserStartCommand为使用的浏览器，目前Selenium支持的浏览器对应参数如下：

*firefox

*mock

*firefoxproxy

*pifirefox

*chrome

*iexploreproxy

*iexplore

*firefox3

*safariproxy

*googlechrome

*konqueror

*firefox2

*safari

*piiexplore

*firefoxchrome

*opera

*iehta

*custom

3)、capture_entire_page_screenshot目前只支持firefox、IE

使用firefox时候使用capture_entire_page_screenshot比较简单，不需要特别设置，Selenium会自动处理。因此如果使用capture_entire_page_screenshot推荐使用firefox。

IE必须运行在非HTA(non-HTA)模式下(browserStartCommand值为：*iexploreproxy )，并且需要安装http://snapsie.sourceforge.net/ 工具包，具体可以参考这篇文章：Using captureEntirePageScreenshot with Selenium