自动抓取并解析一个商品页

  • 以美国adidas官网为例。
  • 输入url,抓取商品信息(标题、描述、图片等);抓取属性信息(颜色、尺码、价格、库存、skuId)。
  • 思路很简单,就是打开页面,分析各个需要内容的标签。

获取页面

public static Document getHttpPostResponseWithDocument(String url, String referrer, List<NameValuePair> params,                                                  DecompressingHttpClient httpClient) throws IOException {
        HttpResponse response = getHttpPostResponse(url, referrer, params, httpClient);
        Document doc = Jsoup.parse(EntityUtils.toString(response.getEntity(), "UTF-8"));
        EntityUtils.consume(response.getEntity());
        return doc;
    }

public static HttpResponse getHttpGetResponse(String url, String referrer, DecompressingHttpClient httpClient) throws IOException {
        HttpGet get = new HttpGet(url);
        setHeaders(get);
        if (!StringUtils.isBlank(referrer)) {
            get.setHeader("Referer", referrer);
        }
        return httpClient.execute(get);
    }

判断是否有货

public boolean isInStock() {
        Elements addToCartElements = doc.select(".addtocart");
        if(null == addToCartElements || addToCartElements.isEmpty()) {
            return false;
        }
        if(!addToCartElements.toString().contains("add-to-cart-button")) {
            return false;
        }
        return true;
    }

颜色获取

public ExecInfo parse(String url, Map<String, String> colorMap) {

        ExecResult<Document> execResult = getOneSkuInfoPage(url);
        if (!execResult.isSucc()) {
           LogUtils.info(execResult.getMsg());
        }
        if(!isInStock()) {
            LogUtils.info("out of stock!");
            return ExecInfo.fail("out of stock!");
        }
        Elements curColorElements = doc.select(".product-color");
        if(null == curColorElements || curColorElements.isEmpty()) {
            return ExecInfo.fail("获取当前商品颜色信息失败");
        } else {
            Pattern COLOR_PATTERN = Pattern.compile("<span class=\"product-color-clear\">([^<]*)</span>");
            Pattern SKU_PATTERN = Pattern.compile("\\(([0-9A-Za-z]*)\\)");
            Matcher color_matcher = COLOR_PATTERN.matcher(curColorElements.toString());
            Matcher sku_matcher = SKU_PATTERN.matcher(curColorElements.toString());
            if(color_matcher.find() && sku_matcher.find()) {
                LogUtils.info("CURRENT COLOR: " + sku_matcher.group(1) + ", " + color_matcher.group(1));
            }
        }
        //Elements elements = doc.select("#colorVariationsCarousel");
        Elements elements = doc.select(".color-variation-row");
        if(null != elements && !elements.isEmpty()) {
            for (Element element : elements) {
                Elements colorElements = element.select(".color-variations-thumb-color");
                for (Element colorElement : colorElements) {
                    //LogUtils.info(colorElement.toString());
                    Pattern SKU_PATTERN = Pattern.compile("data-articleno=\"([0-9A-Za-z]*)");
                    Pattern TITLE_PATTERN = Pattern.compile("title=\"([^\"]*)");
                    Matcher sku_matcher = SKU_PATTERN.matcher(colorElement.toString());
                    Matcher title_matcher = TITLE_PATTERN.matcher(colorElement.toString());
                    if (sku_matcher.find() && title_matcher.find()) {
                        colorMap.put(sku_matcher.group(1), title_matcher.group(1));
                    }
                }
            }
        }
        LogUtils.info(colorMap.toString());
        return ExecInfo.succ();

    }
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值