对于爬虫的粗略理解:
给定url-------访问url得到网页源代码-------1.按规则筛选,得到需要的数据 2.筛选出url,继续循环爬取数据
本文实现一个简易的java爬虫,暂时没有循环这个步骤,仅仅是爬取指定页面的数据。
目标网页:http://www.dianping.com/shanghai/ch70/g193,这是大众点评某个商户推荐页面,我们需要实现的是爬取商户的基本信息。
爬取数据的测试方法:
@Test
public void getDatasByClass()
{
Rule rule = new Rule(
"http://www.dianping.com/shanghai/ch70/g193",
"shop-list", Rule.CLASS, Rule.POST);
List<BabyShopData> babyShopData = DpService.getShopInfo(rule);
printf(babyShopData);
}
public void printf(List<BabyShopData> datas)
{
for (BabyShopData data : datas)
{
System.out.println(String.format("商户名称:%s",data.getShopname()));
System.out.println(String.format("商户链接:%s",data.getShoplink()));
System.out.println(String.format("商户地点:%s",data.getLocation()));
System.out.println(String.format("商户星级:%s",data.getShoplevel()));
System.out.println(String.format("人均消费:%s",data.getAveragecost()));
System.out.println("***********************************");
}
}
爬取数据具体服务:
public class DpService {
private static final Logger logger = LoggerFactory.getLogger(DpService.class);
public static List<BabyShopData> getShopInfo(Rule rule){
validateRule(rule);
Connection connection = Jsoup.connect(rule.getUrl());
Document document = null;
Elements elements = null;
List<BabyShopData> babyShopData = new ArrayList<BabyShopData>();
try {
switch (rule.getRequestMoethod()) {
case Rule.CLASS:
document = connection.timeout(1000).get();
break;
case Rule.POST:
document = connection.timeout(1000).post();
break;
}
switch(rule.getType()){
case Rule.CLASS:
elements = document.getElementsByClass(rule.getResultTagName());
break;
case Rule.ID:
Element element = document.getElementById(rule.getResultTagName());
elements.add(element);
break;
case Rule.SELECTION:
elements = document.select(rule.getResultTagName());
break;
default:
//当resultTagName为空时默认去body标签
if (TextUtil.isEmpty(rule.getResultTagName()))
{
elements = document.getElementsByTag("body");
}
}
for(Element element : elements){
Elements elements1 = element.getElementsByTag("li");
for(Element element1 : elements1){
BabyShopData babyShopData1 = new BabyShopData();
babyShopData1.setShopname(element1.child(0).attr("title"));
babyShopData1.setShoplink(element1.child(0).attr("href").substring(2));
String shoplevel = element1.getElementsByClass("item-rank-rst").get(0).attr("title");
babyShopData1.setShoplevel(shoplevel);
String location = element1.getElementsByClass("key-list").get(0).text();
babyShopData1.setLocation(location);
String price = element1.getElementsByClass("price").get(0).text();
babyShopData1.setAveragecost(price);
babyShopData.add(babyShopData1);
}
}
}catch (IOException e){
logger.error("connection get error",e);
}
return babyShopData;
}
private static void validateRule(Rule rule)
{
String url = rule.getUrl();
if (TextUtil.isEmpty(url))
{
throw new RuleException2("url不能为空!");
}
if (!url.startsWith("http://"))
{
throw new RuleException2("url的格式不正确!");
}
}
}
爬取数据的结果:
爬虫的实现依赖的Jsoup包,具体api参考http://www.open-open.com/jsoup/;
若需要完整代码,可下载https://download.csdn.net/download/xiao_dondon/10604025;