Scrapy本身不能作为JS的解析器,因而若网页中带有AJAX,带有JS脚本时就无能为力了,看了网上几篇相关文章都介绍说,使用Webkit作为Downloader,于是便想研究一下。
相关文章:http://www.gnu.org/software/pythonwebkit/
相关文章:scrapy结合webkit抓取js生成页面(http://blog.mdcsoft.cn/archives/201111/707.html)
先要安装python-webkit,那就让我们看看python-webkit是个什么东西:
The Python Webkit DOM Project makes python a full peer of javascript when it comes to accessing and manipulating the full features available to Webkit, such as HTML5. Everything that can be done with javascript, such as getElementsbyTagName and appendChild, event callbacks through onclick, timeout callbacks through window.setTimeout, and even AJAX using XMLHttpRequest, can also be done from python.
简要翻译:
Python Webkit让Python成为了javascript的完整客户端,我们可以用python来调用javascript,完成任何事情,比如:getElementsbyTagName、appendChild、onclick事件回调(event callback)、甚至是AJAX的XMLHttpRequest。
What is Python-Webkit?
Python-Webkit is a python extension to Webkit to add full, complete access to Webkit's DOM - Document Object Model. On its own, however, Python-Webkit doesn't actually do anything, because it is only through WebkitDFB, WebkitGTK or WebkitQt4 that Webkit "Document Objects" are actually created (and displayed, on-screen). Thus it is necessary to make a small patch to each of PyWebkitGTK and PyWebkitQt4, to "break out" access to the DOM, but for WebkitDFB, as it is very new, has its own c-based python module, included as part of PythonWebkit. Both PyWebkitDFB and PyWebkitGTK have been done, already.
简要翻译:
Python-Webkit是python为Webkit做的一个扩展,它可以完整地获得Webkit的DOM。但它自己实际上并不做任何东西,它仅仅通过WebkitDFB,WebkitGTK或WebkitQt4等去完成DOM的创建和显示。
WebKitDFB is WebKit on top of DirectFB without using GTK+ or Qt
Python-Webkit怎么用?
Here is a simple example of modifying that script to show the addition of a button, a text node and even event handling. To any Web Developer who has used javascript, this should look incredibly familiar: def _button_click_event(self, event): print "button click", event def _mouse_over_event(self, event): print "mouse over", event, event.x, event.y def _view_load_finished_cb(self, view, frame): doc = frame.get_dom_document() nodes = doc.getElementsByTagName('body') body = nodes.item(0) d = doc.createElement("div") b = doc.createElement("Button") b.innerHTML = "hello" b.onclick = self._button_click_event d.appendChild(b) txt = doc.createTextNode("hello world") body.appendChild(txt) body.appendChild(d) body.tabIndex = 5 #body.addEventListener("mouseover", self._mouse_over_event, False) body.onmouseover = self._mouse_over_event
简要翻译:
以下是一个简单的翻译,要modify一个script以显示一个额外的按钮。
代码解析:
1、获得dom
2、获得名为“body”的Tag
3、插入HTML代码
4、插入事件侦听器addEventListener
python将webkit中的DOM取出来,动态修改之,webkit便可以将其显示出来。估计scrapy也是这样的处理。