webmagic解析html方式有哪些,WebMagic学习-解析json

最新推荐文章于 2023-02-18 10:20:49 发布

weixin_39577052

最新推荐文章于 2023-02-18 10:20:49 发布

阅读量380

点赞数

文章标签： webmagic解析html方式有哪些

爬取url返回个json时使用这种方式解析json：

JsonPathSelector json = new JsonPathSelector(page.getRawText());

List name = json.selectList("$.data.itemList[*].brand.name");

List uri = json.selectList("$.data.itemList[*].brand.uri");

结果在使用debug的F5时看到了这个类的源码：这什么情况？

看了us.codecraft.webmagic.selector.JsonPathSelectorTest，才知道原来参数写错了。

JsonPathSelector(String jsonPathStr)这个构造函数的参数是jsonPathStr，也就是提取规则的str。

String select(String text)方法和List selectList(String text)方法，参数都是text，也就是json的字符串。

package us.codecraft.webmagic.selector;

import org.junit.Test;

import java.util.List;

import static org.assertj.core.api.Assertions.assertThat;

/**

* @author code4crafter@gmai.com

*/

public class JsonPathSelectorTest {

private String text = "{ \"store\": {\n" +

" \"book\": [ \n" +

" { \"category\": \"reference\",\n" +

" \"author\": \"Nigel Rees\",\n" +

" \"title\": \"Sayings of the Century\",\n" +

" \"price\": 8.95\n" +

" },\n" +

" { \"category\": \"fiction\",\n" +

" \"author\": \"Evelyn Waugh\",\n" +

" \"title\": \"Sword of Honour\",\n" +

" \"price\": 12.99,\n" +

" \"isbn\": \"0-553-21311-3\"\n" +

" }\n" +

" ],\n" +

" \"bicycle\": {\n" +

" \"color\": \"red\",\n" +

" \"price\": 19.95\n" +

" }\n" +

" }\n" +

"}";

@Test

public void testJsonPath() {

System.out.println(text);

JsonPathSelector jsonPathSelector = new JsonPathSelector("$.store.book[*].author");

String select = jsonPathSelector.select(text);

List list = jsonPathSelector.selectList(text);

assertThat(select).isEqualTo("Nigel Rees");

assertThat(list).contains("Nigel Rees","Evelyn Waugh");

jsonPathSelector = new JsonPathSelector("$.store.book[?(@.category == 'reference')]");

list = jsonPathSelector.selectList(text);

select = jsonPathSelector.select(text);

System.out.println("select:\t"+select);

System.out.println("list:\t"+list);

assertThat(select).isEqualTo("{\"author\":\"Nigel Rees\",\"price\":8.95,\"category\":\"reference\",\"title\":\"Sayings of the Century\"}");

assertThat(list).contains("{\"author\":\"Nigel Rees\",\"price\":8.95,\"category\":\"reference\",\"title\":\"Sayings of the Century\"}");

}

}

我觉得这个实现不太好。在一个page中，jsonStr是一样的，而提取规则不同。如果每次都new 一个新的JsonPathSelector作为提取规则，那要创建多少对象啊。而且和下面这种实现比较来说，提取规则开发方式不同：

String brand_price = html.xpath("//span[@id=\"item-sellprice\"]/text()").toString();

String brand_img = html.xpath("//img[@id=\"brand-img\"]/@src").toString();

String brand_describe = html.xpath("//p[@id=\"brand-describe\"]/text()").toString();

String location_text = html.xpath("//span[@id=\"location-text\"]/text()").toString();

估计不是我自己出现这种问题吧。嘿嘿。

weixin_39577052

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
webmagic解析html方式有哪些,WebMagic学习-解析json

爬取url返回个json时使用这种方式解析json：JsonPathSelector json = new JsonPathSelector(page.getRawText());List name = json.selectList("$.data.itemList[*].brand.name");List uri = json.selectList("$.data.itemList[*].br...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。