java使用xpath及获取scrip内容爬取网站数据各种心得

最新推荐文章于 2024-08-07 10:15:56 发布

Curtains Down

最新推荐文章于 2024-08-07 10:15:56 发布

阅读量3.5k

点赞数 1

分类专栏：爬虫文章标签： xpath 爬虫 java script

本文链接：https://blog.csdn.net/qq_42248939/article/details/88917287

版权

爬虫专栏收录该内容

6 篇文章 0 订阅

订阅专栏

最近一直在用java做爬虫，都是一些没有技术含量的活，仔细想一下底层用到的大概有这算法，一个是匹配算法、一个关于树的算法，全都给我们封装好了，java真是方便、无脑，当然也很low，但是个人原因，工作效率并不高?，成果勉勉强强爬取了京东、天猫、淘宝、阿里巴巴，这几个电商搜索框架网站，虽然天猫、淘宝、阿里巴巴都是一家的，但是他们却一点都不一样，其中阿里巴巴最难爬，时常在想，这一年来自学的spring、springboot、springmvc、mybatis全都是一些框架调用别人的接口，实际上感觉啥都没学到，出去工作也是你懂的别人都懂，充其量只能算是码农，搬运代码的农名，如何才能跳出自我的境界，做一个永远逃离井底的井底之蛙，成为一个真真正正的程序员，是一件值得思考的问题。

1、java如何支持xpath：首先是我的目录，其中标红色就是核心使用部分，其他没啥用（个人扩展）。

maven项目pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.etoak</groupId>
    <artifactId>crawl</artifactId>
    <version>1.0-SNAPSHOT</version>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>7</source>
                    <target>7</target>
                </configuration>
            </plugin>
        </plugins>
    </build>

    <dependencies>
        <dependency>
            <groupId>com.googlecode.juniversalchardet</groupId>
            <artifactId>juniversalchardet</artifactId>
            <version>1.0.3</version>
        </dependency>
        <dependency>
            <groupId>org.kie.modules</groupId>
            <artifactId>org-apache-commons-httpclient</artifactId>
            <version>6.2.0.CR2</version>
            <type>pom</type>
        </dependency>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.9.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.1</version>
        </dependency>
        <dependency>
            <groupId>commons-lang</groupId>
            <artifactId>commons-lang</artifactId>
            <version>2.6</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>1.7.25</version>
        </dependency>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.3</version>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.47</version>
        </dependency>
        <dependency>
            <groupId>com.vaadin.external.google</groupId>
            <artifactId>android-json</artifactId>
            <version>0.0.20131108.vaadin1</version>
        </dependency>
    </dependencies>

</project>

首先是JsoupParserUtils.java，显然这是一个工具类，用来支持xpath的各种方法

package com.etoak.crawl.util;


import java.io.StringWriter;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;


import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathFactory;
import org.apache.commons.lang.StringUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Attribute;
import org.jsoup.select.Elements;
import org.jsoup.select.NodeTraversor;
import org.jsoup.select.NodeVisitor;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.w3c.dom.Comment;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.w3c.dom.Text;
import com.sun.org.apache.xerces.internal.dom.ElementImpl;


/**
  * Jsoup的xpath解析工具类
  * 
  * @author liuhh
  *
  */
@SuppressWarnings("restriction")
public class JsoupParserUtils {


 	 protected final static DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();


 	 private final static Logger log = LoggerFactory.getLogger(JsoupParserUtils.class);


 	 private final static XPath xPath = XPathFactory.newInstance().newXPath();


 	 protected static TransformerFactory tf = TransformerFactory.newInstance();


 	 private static final Lock LOCK = new ReentrantLock();


 	 /**
	  	  * 得到该节点的子节点个数
	  	  * 
	  	  * @param ele
	  	  * @param xpath
	  	  * @return
	  	  */
 	 public static int getEleChildNum(final org.jsoup.nodes.Element ele, final String xpath) {
 		 try {
 			 Object res = parse(ele, xpath, XPathConstants.NODESET);
 			 if (null != res && res instanceof NodeList) {
 				 NodeList nodeList = (NodeList) res;
 				 return nodeList == null ? 0 : nodeList.getLength();
 			 }
 		 } catch (Exception e) {
 			 log.error("根据xpath:{}，获取子节点个数出现错误,错误原因：" + e.getMessage(), xpath);
 		 }
 		 return 0;
 	 }


 	 /**
	  	  * 判断文档中是否存在xpath节点
	  	  * 
	  	  * @param ele
	  	  * @param xpath
	  	  * @return
	  	  */
 	 public static boolean exists(final org.jsoup.nodes.Element ele, final String xpath) {
 		 try {
 			 Object res = parse(ele, xpath, XPathConstants.BOOLEAN);
 			 if (null != res && res instanceof Boolean) {
 				 return (boolean) res;
 			 }
 			 return false;
 		 } catch (Exception e) {
 			 log.error("检查xpath:{}，是否存在时出现错误,！" + e.getMessage(), xpath);
 		 }
 		 return false;
 	 }


 	 /**
	  	  * 根据xpath得到w3c的Element对象
	  	  * 
	  	  * @param ele
	  	  * @param xpath
	  	  * @return
	  	  */
 	 public static ElementImpl getW3cElementImpl(final org.jsoup.nodes.Element ele, final String xpath) {
 		 try {
 			 Object res = parse(ele, xpath, XPathConstants.NODE);
 			 if (null != res && res instanceof ElementImpl) {
 				 return (ElementImpl) res;
 			 }
 			 return null;
 		 } catch (Exception e) {
 			 log.error("根据xpath：{}，得到w3c的Element对象出现错误，原因：" + e.getMessage(), xpath);
 		 }
 		 return null;
 	 }


 	 /**
	  	  * 根据xpath得到jsoup的Element对象
	  	  * 
	  	  * @param ele
	  	  * @param xpath
	  	  * @return
	  	  */
 	 public static org.jsoup.nodes.Element getJsoupElement(final org.jsoup.nodes.Element ele, final String xpath) {
 		 try {
 			 Object res = parse(ele, xpath, XPathConstants.NODE);
 			 if (null != res && res instanceof ElementImpl) {
 				 ElementImpl elementImpl = (ElementImpl) res;
 				 return getJsoupEle(elementImpl);
 			 }
 			 return null;
 		 } catch (Exception e) {
 			 log.error("根据xpath：{}，得到jsoup的Element对象出现错误，原因：" + e.getMessage(), xpath);
 		 }
 		 return null;
 	 }


 	 /**
	  	  * 根据xpath得到jsoup的Elements对象
	  	  * 
	  	  * @param ele
	  	  * @param xpath
	  	  * @return
	  	  */
 	 public static Elements getJsoupElements(final org.jsoup.nodes.Element ele, final String xpath) {
 		 try {
 			 NodeList nodeList = getNodeList(ele, xpath);
 			 if (null != nodeList && nodeList.getLength() > 0) {
 				 int len = nodeList.getLength();
 				 Elements elements = new Elements();
 				 for (int i = 0; i < len; i++) {
 					 Node node = nodeList.item(i);
 					 if (null != node && node instanceof ElementImpl) {
 						 org.jsoup.nodes.Element element = getJsoupEle(((ElementImpl) node));
 						 elements.add(element);
 					 }
 				 }
 				 return elements;
 			 }
 		 } catch (Exception e) {
 			 log.error("根据xpath：{}，得到jsoup的Element对象出现错误，原因：" + e.getMessage(), xpath);
 		 }
 		 return null;
 	 }


 	 /**
	  	  * 从Jsoup的Element中解析出W3C的NodeList
	  	  * 
	  	  * @param ele
	  	  * @param xpath
	  	  * @return
	  	  */
 	 public static NodeList getNodeList(final org.jsoup.nodes.Element ele, final String xpath) {
 		 try {
 			 Object res = parse(ele, xpath, XPathConstants.NODESET);
 			 if (null != res && res instanceof NodeList) {
 				 return (NodeList) res;
 			 }
 		 } catch (Exception e) {
 			 log.error(e.getMessage(), e);
 		 }


 		 return null;
 	 }


 	 /**
	  	  * 得到节点的某一个属性值
	  	  * 
	  	  * @param ele
	  	  * @param xpath
	  	  * @return
	  	  */
 	 public static String getXpathString(final org.jsoup.nodes.Element ele, final String xpath) {
 		 try {
 			 int textNum = getEleChildNum(ele, xpath);
 			 if (1 == textNum) {
 				 Object res = parse(ele, xpath, XPathConstants.STRING);
 				 if (null != res) {
 					 return res.toString();
 				 }
 			 } else {
 				 List<String> res = getXpathListString(ele, xpath);
 				 if (res != null && res.size() > 0) {
 					 StringBuilder stringBuilder = new StringBuilder();
 					 for (Iterator<String> iterator = res.iterator(); iterator.hasNext();) {
 						 String text = iterator.next();
 						 if (null != text) {
 							 stringBuilder.append(text.replace("\r\n", "."));
 						 }
 					 }
 					 return stringBuilder.toString();
 				 }
 			 }
 			 return null;
 		 } catch (Exception e) {
 			 e.printStackTrace();
 			 log.error("根据xpath:{}查询字符串时出现错误:" + e.getMessage(), xpath);
 		 }
 		 return null;
 	 }


 	 /**
	  	  * 查询字符串列表
	  	  * 
	  	  * @param ele
	  	  * @param xpath
	  	  * @return
	  	  */
 	 public static List<String> getXpathListString(final org.jsoup.nodes.Element ele, final String xpath) {
 		 try {
 			 Object res = parse(ele, xpath, XPathConstants.NODESET);
 			 if (null != res && res instanceof NodeList) {
 				 NodeList nodeList = (NodeList) res;
 				 int length = nodeList.getLength();
 				 if (length <= 0) {
 					 return null;
 				 }
 				 List<String> list = new ArrayList<>();
 				 for (int i = 0; i < length; i++) {
 					 Node node = nodeList.item(i);
 					 list.add(null == node ? null : node.getNodeValue());
 				 }
 				 return list;
 			 }
 			 return null;
 		 } catch (Exception e) {
 			 log.error("根据xpath:{}查询字符串列表时出现错误:" + e.getMessage(), xpath);
 		 }
 		 return null;
 	 }


 	 /**
	  	  * 获取xpath解析结果
	  	  * 
	  	  * @param doc
	  	  * @param xPathStr
	  	  * @param qName
	  	  * @return
	  	  */
 	 public static Object parse(final org.jsoup.nodes.Element doc, final String xPathStr, final QName qName) {
 		 Node node = fromJsoup(doc);
 		 return parse(node, xPathStr, qName);
 	 }


 	 /**
	  	  * 
	  	  * @param doc
	  	  * @param xPathStr
	  	  * @param qName
	  	  * @return
	  	  */
 	 public static Object parse(final Node doc, final String xPathStr, final QName qName) {
 		 try {
 			 if (doc == null) {
 				 log.warn("解析文档为null！");
 				 return null;
 			 }
 			 if (StringUtils.isBlank(xPathStr)) {
 				 log.warn("解析的Xpath路径为空！");
 				 return null;
 			 }
 			 if (null == qName) {
 				 log.warn("解析类型为null！");
 				 return null;
 			 }
 			 try {
 				 LOCK.lock();
 				 Object res = xPath.evaluate(xPathStr, doc, qName);
 				 return res;
 			 } finally {
 				 // TODO: handle finally clause
 				 LOCK.unlock();
 			 }
 		 } catch (Exception e) {
 			 log.warn("解析Xpath：{}，出现错误,解析类型：{}，错误原因：{}！", xPathStr, qName, e.getMessage());
 		 }
 		 return null;
 	 }


 	 /**
	  	  * 根据ElementImpl得到Jsoup的Element
	  	  * 
	  	  * @param elementImpl
	  	  * @return
	  	  */
 	 public static org.jsoup.nodes.Element getJsoupEle(final ElementImpl elementImpl) {
 		 try {
 			 String value = getW3cDocString(elementImpl);
 			 org.jsoup.nodes.Document document = Jsoup.parse(value);
 			 return document.body().child(0);
 		 } catch (Exception e) {
 			 // TODO: handle exception
 			 log.error("根据ElementImpl得到Jsoup的Element出现错误，错误原因：" + e.getMessage());
 			 return null;
 		 }


 	 }


 	 /**
	  	  * 将w3c的Document转为jsoup的Document
	  	  * 
	  	  * @param doc
	  	  * @return
	  	  */
 	 public static org.jsoup.nodes.Document fromW3C(final Document doc) throws Exception {
 		 String string = getW3cDocString(doc);
 		 org.jsoup.nodes.Document res = Jsoup.parse(string);
 		 return res;


 	 }


 	 /**
	  	  * 将jsoup的Document转为w3c的Document
	  	  * 
	  	  * @param in
	  	  * @return
	  	  */
 	 public static Node fromJsoup(final org.jsoup.nodes.Element in) {
 		 DocumentBuilder builder;
 		 try {
 			 if (null == in) {
 				 return null;
 			 }
 			 builder = factory.newDocumentBuilder();
 			 Document out = builder.newDocument();
 			 if (in instanceof org.jsoup.nodes.Document) {
 				 List<org.jsoup.nodes.Node> childs = in.childNodes();
 				 if (childs != null && childs.size() > 0) {
 					 org.jsoup.nodes.Element rootEl = in.child(0);
 					 NodeTraversor traversor = new NodeTraversor(new W3CBuilder(out));
 					 traversor.traverse(rootEl);
 					 return out;
 				 } else {
 					 // out.setNodeValue(in.);
 					 return out;
 				 }
 			 }else if (in instanceof org.jsoup.nodes.Element) {
 				 NodeTraversor traversor = new NodeTraversor(new W3CBuilder(out));
 				 traversor.traverse(in);
 				 return out;
 			 }


 		 } catch (ParserConfigurationException e) {
 			 return null;
 		 }
 		 return null;
 	 }


 	 /**
	  	  * 将W3c的doc转为字符串
	  	  * 
	  	  * @param doc
	  	  * @return
	  	  * @throws Exception
	  	  */
 	 public static String getW3cDocString(final Node doc) throws Exception {
 		 try (StringWriter writer = new StringWriter()) {
 			 DOMSource domSource = new DOMSource(doc);
 			 StreamResult result = new StreamResult(writer);
 			 LOCK.lock();
 			 try {
 				 Transformer transformer = tf.newTransformer();
 				 transformer.transform(domSource, result);
 				 return writer.toString();
 			 } finally {
 				 LOCK.unlock();
 			 }
 		 } catch (TransformerException e) {
 			 throw new IllegalStateException(e);
 		 }
 	 }


 	 /**
	  	  * 将Jsoup的node属性拷贝到w3c的Element中
	  	  * 
	  	  * @param source
	  	  * @param el
	  	  */
 	 public static void copyAttributes(final org.jsoup.nodes.Node source, final Element el) {
 		 for (Attribute attribute : source.attributes()) {
 			 el.setAttribute(attribute.getKey(), attribute.getValue());
 		 }
 	 }


}


class W3CBuilder implements NodeVisitor {
 	 private final Document doc;
 	 private Element dest;


 	 public W3CBuilder(Document doc) {
 		 this.doc = doc;
 	 }


 	 public void head(final org.jsoup.nodes.Node source, int depth) {
 		 if (source instanceof org.jsoup.nodes.Element) {
 			 org.jsoup.nodes.Element sourceEl = (org.jsoup.nodes.Element) source;
 			 Element el = doc.createElement(sourceEl.tagName());
 			 JsoupParserUtils.copyAttributes(sourceEl, el);
 			 if (dest == null) {
 				 doc.appendChild(el);
 			 } else {
 				 dest.appendChild(el);
 			 }
 			 dest = el;
 		 } else if (source instanceof org.jsoup.nodes.TextNode) {
 			 org.jsoup.nodes.TextNode sourceText = (org.jsoup.nodes.TextNode) source;
 			 Text text = doc.createTextNode(sourceText.getWholeText());
 			 dest.appendChild(text);
 		 } else if (source instanceof org.jsoup.nodes.Comment) {
 			 org.jsoup.nodes.Comment sourceComment = (org.jsoup.nodes.Comment) source;
 			 Comment comment = doc.createComment(sourceComment.getData());
 			 dest.appendChild(comment);
 		 } else if (source instanceof org.jsoup.nodes.DataNode) {
 			 org.jsoup.nodes.DataNode sourceData = (org.jsoup.nodes.DataNode) source;
 			 Text node = doc.createTextNode(sourceData.getWholeData());
 			 dest.appendChild(node);
 		 } else {


 		 }
 	 }


 	 public void tail(final org.jsoup.nodes.Node source, int depth) {
 		 if (source instanceof org.jsoup.nodes.Element && dest.getParentNode() instanceof Element) {
 			 dest = (Element) dest.getParentNode();
 		 }
 	 }
}

CharsetDetecor.java 用于字符集检测的方法。

/*
 * Copyright (C) 2014 hu
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version 2
 * of the License, or (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
 */
package com.etoak.crawl.util;

import org.mozilla.universalchardet.UniversalDetector;

import java.io.UnsupportedEncodingException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * 字符集自动检测
 *
 * @author hu
 */
public class CharsetDetector {

    //从Nutch借鉴的网页编码检测代码
    private static final int CHUNK_SIZE = 2000;

    private static Pattern metaPattern = Pattern.compile(
            "<meta\\s+([^>]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>",
            Pattern.CASE_INSENSITIVE);
    private static Pattern charsetPattern = Pattern.compile(
            "charset=\\s*([a-z][_\\-0-9a-z]*)", Pattern.CASE_INSENSITIVE);
    private static Pattern charsetPatternHTML5 = Pattern.compile(
            "<meta\\s+charset\\s*=\\s*[\"']?([a-z][_\\-0-9a-z]*)[^>]*>",
            Pattern.CASE_INSENSITIVE);

    //从Nutch借鉴的网页编码检测代码
    private static String guessEncodingByNutch(byte[] content) {
        int length = Math.min(content.length, CHUNK_SIZE);

        String str = "";
        try {
            str = new String(content, "ascii");
        } catch (UnsupportedEncodingException e) {
            return null;
        }

        Matcher metaMatcher = metaPattern.matcher(str);
        String encoding = null;
        if (metaMatcher.find()) {
            Matcher charsetMatcher = charsetPattern.matcher(metaMatcher.group(1));
            if (charsetMatcher.find()) {
                encoding = new String(charsetMatcher.group(1));
            }
        }
        if (encoding == null) {
            metaMatcher = charsetPatternHTML5.matcher(str);
            if (metaMatcher.find()) {
                encoding = new String(metaMatcher.group(1));
            }
        }
        if (encoding == null) {
            if (length >= 3 && content[0] == (byte) 0xEF
                    && content[1] == (byte) 0xBB && content[2] == (byte) 0xBF) {
                encoding = "UTF-8";
            } else if (length >= 2) {
                if (content[0] == (byte) 0xFF && content[1] == (byte) 0xFE) {
                    encoding = "UTF-16LE";
                } else if (content[0] == (byte) 0xFE
                        && content[1] == (byte) 0xFF) {
                    encoding = "UTF-16BE";
                }
            }
        }

        return encoding;
    }

    /**
     * 根据字节数组，猜测可能的字符集，如果检测失败，返回utf-8
     *
     * @param bytes 待检测的字节数组
     * @return 可能的字符集，如果检测失败，返回utf-8
     */
    public static String guessEncodingByMozilla(byte[] bytes) {
        String DEFAULT_ENCODING = "UTF-8";
        UniversalDetector detector = new UniversalDetector(null);
        detector.handleData(bytes, 0, bytes.length);
        detector.dataEnd();
        String encoding = detector.getDetectedCharset();
        detector.reset();
        if (encoding == null) {
            encoding = DEFAULT_ENCODING;
        }
        return encoding;
    }

    /**
     * 根据字节数组，猜测可能的字符集，如果检测失败，返回utf-8
     * @param content 待检测的字节数组
     * @return 可能的字符集，如果检测失败，返回utf-8
     */
    public static String guessEncoding(byte[] content) {
        String encoding;
        try {
            encoding = guessEncodingByNutch(content);
        } catch (Exception ex) {
            return guessEncodingByMozilla(content);
        }

        if (encoding == null) {
            encoding = guessEncodingByMozilla(content);
            return encoding;
        } else {
            return encoding;
        }
    }
}

page.java 页面读取后返回的对象

package com.etoak.crawl.page;


import com.etoak.crawl.util.CharsetDetector;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import java.io.UnsupportedEncodingException;

/*
* page
*   1: 保存获取到的响应的相关内容;
* */
public class Page {

    private byte[] content ;
    private String html ;  //网页源码字符串
    private Document doc  ;//网页Dom文档
    private String charset ;//字符编码
    private String url ;//url路径
    private String contentType ;// 内容类型


    public Page(byte[] content , String url , String contentType){
        this.content = content ;
        this.url = url ;
        this.contentType = contentType ;
    }

    public String getCharset() {
        return charset;
    }
    public String getUrl(){return url ;}
    public String getContentType(){ return contentType ;}
    public byte[] getContent(){ return content ;}

    /**
     * 返回网页的源码字符串
     *
     * @return 网页的源码字符串
     */
    public String getHtml() {
        if (html != null) {
            return html;
        }
        if (content == null) {
            return null;
        }
        if(charset==null){
            charset = CharsetDetector.guessEncoding(content); // 根据内容来猜测 字符编码
        }
        try {
            this.html = new String(content, charset);
            return html;
        } catch (UnsupportedEncodingException ex) {
            ex.printStackTrace();
            return null;
        }
    }

    /*
    *  得到文档
    * */
    public Document getDoc(){
        if (doc != null) {
            return doc;
        }
        try {
            this.doc = Jsoup.parse(getHtml(), url);
            return doc;
        } catch (Exception ex) {
            ex.printStackTrace();
            return null;
        }
    }


}

sendRequstAndGetResponse.java发送url请求的类

要注意我在//zz之中加了请求头，否则你爬取许许多多的网站都不行。

package com.etoak.crawl.page;

import org.apache.commons.httpclient.DefaultHttpMethodRetryHandler;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpException;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.params.HttpMethodParams;

import java.io.IOException;

public class RequestAndResponseTool {


    public static Page  sendRequstAndGetResponse(String url) {
        Page page = null;
        // 1.生成 HttpClinet 对象并设置参数
        HttpClient httpClient = new HttpClient();
        // 设置 HTTP 连接超时 5s
        httpClient.getHttpConnectionManager().getParams().setConnectionTimeout(5000);
        // 2.生成 GetMethod 对象并设置参数
        GetMethod getMethod = new GetMethod(url);
        //zz设置请求头
        getMethod.setRequestHeader("user-agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36");
        //zz设置编码格式Content-Encoding →gzip
        //getMethod.setRequestHeader("Content-Encoding","GBKs");
        // 设置 get 请求超时 5s
        getMethod.getParams().setParameter(HttpMethodParams.SO_TIMEOUT, 5000);
        // 设置请求重试处理
        getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, new DefaultHttpMethodRetryHandler());
        // 3.执行 HTTP GET 请求
        try {
            int statusCode = httpClient.executeMethod(getMethod);
        // 判断访问的状态码
            if (statusCode != HttpStatus.SC_OK) {
                System.err.println("Method failed: " + getMethod.getStatusLine());
            }
        // 4.处理 HTTP 响应内容
            byte[] responseBody = getMethod.getResponseBody();// 读取为字节 数组
            String contentType = getMethod.getResponseHeader("Content-Type").getValue(); // 得到当前返回类型
            page = new Page(responseBody,url,contentType); //封装成为页面
        } catch (HttpException e) {
        // 发生致命的异常，可能是协议不对或者返回的内容有问题
            System.out.println("Please check your provided http address!");
            e.printStackTrace();
        } catch (IOException e) {
        // 发生网络异常
            e.printStackTrace();
        } finally {
        // 释放连接
            getMethod.releaseConnection();
        }
        return page;
    }
}

写主方法测试一下

package com.etoak.crawl.main;

import com.etoak.crawl.page.Page;
import com.etoak.crawl.page.RequestAndResponseTool;
import com.etoak.crawl.util.JsoupParserUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

import java.util.List;

/**
 * Too young too simple,sometimes naive
 *
 * @author zhang zhuang
 * @E-mail 418665906@qq.com
 * @Date 2019/3/30 20:07
 */
public class sadas {

    public static void main(String[] args) throws Exception {
        Page page = RequestAndResponseTool.sendRequstAndGetResponse("https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&wq=s&pvid=7e72beb42fbe499aa4d76b684082477d");
        Document doc = Jsoup.parse(page.getHtml());
        System.out.println(doc.body());
        Element element = JsoupParserUtils.getJsoupElement(doc, "");

    }
}

结果：

下面开始写一些爬取的技巧。

我的开发方法：

1、首先找到url的规律，你要爬取数据肯定一连串的爬取，像我以前为了更新自己做的电商项目的数据库就爬取了京东的数据一千多条如下图：

都是看url来的，举个栗子：

https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=s&cid2=653&cid3=655&page=3&s=57&click=0

京东的get方法获取的里面的参数都是有用的，keyword不用说就是搜索的时候关键字，utf-8这都不要说，page要说就是第几页的意思，但是我点的是第二页它为什么会显示第三页，候取在浏览一下就知道了因为它动态加载了，s我观察的一段时间发现他是一页商品的数量，这样不久明了了，我于是我就把url写成这样来避免动态加载， 'https://search.jd.com/Search?keyword=%E8%A1%A3%E6%9C%8D&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&stock=1&page='+str(n)+'&s='+str(1+(n-1)*30)+'&click=0&scrolling=y'

其中n表示第几页然后s表示曾经一页表示的商品数量，因为我观察了，点击第一页其实它已经动态加载了第二页，那么我通过修改s发现可以不让他加载。

2、拆分你自己爬到的网站，原则是先选择一个大模块（涵盖你需要的所有数据），然后再次一个个遍历。

举个栗子：

看到这张图片没，我需要的是这个模块的所有信息，那么我先选这个模块的上一个模块，因为上一个模块包含了这一页所有的商品，然后接着

List<Element> element = JsoupParserUtils.getJsoupElements(doc, "你自己的xpath");通过遍历list数组方法来获取进一步xpath来获取每一商品模块信息。

3、技巧刚刚如果你仔细观察图片你应该发现我用了一个工具，这个工具是谷歌浏览器插件叫做xpath，自己翻墙下载一下，但是我要提醒你一句，这上面给的xpath是不太正确的，因为它给你的是浏览器的解析到的html，所以你必须自己通过

System.out.println(page.getHtml());

查看你自己爬到的html，然后进行修改才能获取到你想要的信息。

（2）还要注意一点有的网页比如淘宝，你爬取的内容数据都在script里面的一个json

通过

List<Element> elements1 = doc.getElementsByTag("script");
Elements e = doc.getElementsByTag("script").eq(7);

获取script的数据，然后解析json，至于如何解析自己查。

Curtains Down

关注

1
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录