amazon爬虫

最新推荐文章于 2024-07-29 17:11:52 发布

疯狂为赵天宇打call的KOU桑

最新推荐文章于 2024-07-29 17:11:52 发布

阅读量7.6k

点赞数

本文链接：https://blog.csdn.net/gmr2453929471/article/details/49312879

版权

本文记录了作者在实现亚马逊爬虫过程中遇到的困难，包括选择编程语言、使用HttpUnit和jwebunit的问题，以及尝试Selenium和PhantomJS的挑战。最终，作者转向了Jsoup，并发现原始网页的bookDesc信息在<noscript>标签内，提示了后续的解决思路。

摘要由CSDN通过智能技术生成

OOD大作业要写个图书馆管理系统。

貌似很简单，但是很多东西要定，逼死我这个选择困难症！

打算用python写爬虫，然后java来调，但是看了下，非常麻烦。

jython很多第三方库都不支持，一开始就没打算用。又搜到了Process proc=System.Runtime().get()什么什么的，但是这个貌似不好传参。java和python进程间socket通信什么的又没搞过，本来想着java写到一个文件里，python再去读来实现通信的，但是还是好麻烦。怎么搞？纠结了一上午，打算还是用java来写爬虫，而且写起来发现还是很简单的。

首先，BeautifulSoup在java中有Jsoup的代替。

使用博文：http://blog.csdn.net/xcy13638760/article/details/20996167

Document doc= Jsoup.connect(amazonURL).get();
            Element e1=doc.select("[class=a-link-normal s-access-detail-page  a-text-normal]").first();
            if(e1==null){
            	//e1.attr("herf");
            	return;
            }
            String jumpURL=e1.attr("href");
        	String bname=e1.attr("title");
            Element e2=doc.select("[class=a-link-normal a-text-normal]").get(1);
            String author=e2.text();
            Element e3=doc.select("[class=a-size-small a-color-secondary a-text-strike]").first();
            String price=e3.text();

要解析js有httpunit这个工具。

使用见http://www.httpunit.org/doc/manual/index.html

When you unpack the HttpUnit distribution, you should find the following directory layout:

httpunit
   +--- jars // contains jars required to build, test, and run HttpUnit
   |
   +--- lib  // contains the HttpUnit jar
   |
   +--- doc  // contains documentation
   |      |
   |      +--- tutorial  // a brief tutorial in test-first development of a servlet-based web site
   |      |
   |      +--- api       // the javadoc
   |      |
   |      +--- manual    // this user manual
   |
   +--- examples // some example programs written with HttpUnit
   |
   +--- src      // the HttpUnit source code
   |
   +--- test     // unit tests for HttpUnit - a good source for more examples

Only the lib and jars directories are required to run HttpUnit.

------------------------------------------------

            WebConversation wc=new WebConversation();
            HttpUnitOptions.setScriptingEnabled(false);
            System.out.println(tester.getServerResponse());
            WebResponse resp=wc.getResponse(amazonURL);
            WebClient wc=new WebClient(BrowserVersion.CHROME);
            HTMLElement e4=resp.getElementWithID("bookDesc_iframe");

日了狗了，这货用不了。

org.mozilla.javascript.EcmaError: TypeError: Cannot find function createElement. (httpunit#8)

at org.mozilla.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3229)

at org.mozilla.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3219)

at org.mozilla.javascript.ScriptRuntime.typeError(ScriptRuntime.java:3235)

at org.mozilla.javascript.ScriptRuntime.typeError1(ScriptRuntime.java:3247)

at org.mozilla.javascript.ScriptRuntime.notFunctionError(ScriptRuntime.java:3307)

at org.mozilla.javascript.ScriptRuntime.getPropFunctionAndThis(ScriptRuntime.java:1991)

at org.mozilla.javascript.Interpreter.interpretLoop(Interpreter.java:2932)

at script(httpunit:8)

at script(httpunit:7)

at script(httpunit:57325)