天猫超市优惠券商品筛选(不涉及爬虫)
为啥没涉及爬虫?猫超的券是限定商品的,一页能显示完。所以,这里就是复习下正则表达式提取。
上图:
这些券都是基本没用的,因为里面的限定商品都是先涨价过的。
但是,618天猫有点购物券的(不是上图这些优惠券),我想买点牛奶。于是乎,就看看这些涨价品叠加了天猫购物券后能否有点优惠。
购物券上图:
问题来了:这个限定商品也太多了,足有1000多种。用Ctrl+F查找牛奶?
看见这个“...”了么,有的标题超限了。如果“牛奶”在这个“...”里,就完全不知道在哪里了。把css样式中的overflow:hidden取消掉。
样式变成这样了。ok。这下可以查到了。
但是,东西还是太多,一个个找太烦了,想想还是把它全都弄下来,顺便练练手。
1.下载这个网页的html源码。
2.观察html的结构,使用正则提取
① 观察一下网页:
<div class="mui-chaoshi-item mui-chaoshi-item-column columnCount-5" data-tag="item" data-itemid="567987627569">
<a class="mui-chaoshi-item-column-inner" href="//detail.tmall.com/item.htm?id=567987627569" target="_blank" data-itemid="567987627569">
<div class="img-wrapper"><img class="item-img " src="//img.alicdn.com/bao/uploaded/i2/725677994/TB12vbCmStYBeNjSspaXXaOOFXa_!!0-item_pic.jpg_190x190Q50s50.jpg_.webp" alt=""> <img class="soldout-mark" src="//img.alicdn.com/tps/i2/TB1BYYIHpXXXXcEXXXXZ6GBKFXX-150-150.png" style="display:none"></div>
<div class="item-main">
<div class="item-info">
<div class="item-title">Sagacity/尚贤火鸡脆饼178g*2罐(特辣+中辣)网红饼干</div>
</div>
<div class="item-imp">
<div class="imp-main">
<div class="item-price"> <b class="promotion-price"><span class="mui-price normal red"><b class="mui-price-rmb">¥</b><span class="mui-price-integer">19</span><span class="mui-price-decimal">.9</span></span>
</b>
</div>
</div> <button class="cart j_AddCart" data-itemid="567987627569" data-pic="//img.alicdn.com/bao/uploaded/i2/725677994/TB12vbCmStYBeNjSspaXXaOOFXa_!!0-item_pic.jpg" data-stardandtype="" data-token=""></button> </div>
</div>
</a>
</div>
每个item都是一个div。我们需要取得的属性:itemid(商品id)、itemTitle(商品名称)、itemPrice(价格)。
需要注意的是,价格的整数部分和小数部分是分开存储的,且像¥20.00这种价格,是没有小数部分div的。(这里整数存于<span class="mui-price-integer">中,小数存于<span class="mui-price-decimal">中)
再观察下面的div:(这时soldout-卖完的商品),在售与卖完
<div class="mui-chaoshi-item mui-chaoshi-item-column columnCount-5 soldout" data-tag="item" data-itemid="521997816724">
<a class="mui-chaoshi-item-column-inner" href="//detail.tmall.com/item.htm?id=521997816724" target="_blank" data-itemid="521997816724">
<div class="img-wrapper"><img class="item-img " data-ks-lazyload="//img.alicdn.com/bao/uploaded/i2/725677994/TB1IeCbq_dYBeNkSmLyXXXfnVXa_!!0-item_pic.jpg" src="//g.alicdn.com/s.gif" alt=""> <img class="soldout-mark" src="//img.alicdn.com/tps/i2/TB1BYYIHpXXXXcEXXXXZ6GBKFXX-150-150.png" style="display:none"></div>
<div class="item-main">
<div class="item-info">
<div class="item-title">GuyLian吉利莲比利时进口金贝壳夹心巧克力礼盒装送女友生日礼物</div>
</div>
<div class="item-imp">
<div class="imp-main">
<div class="item-price"> <b class="promotion-price"><span class="mui-price normal red"><b class="mui-price-rmb">¥</b><span class="mui-price-integer">69</span></span>
</b>
</div>
</div> <button class="cart j_AddCart" data-itemid="521997816724" data-pic="//img.alicdn.com/bao/uploaded/i2/725677994/TB1IeCbq_dYBeNkSmLyXXXfnVXa_!!0-item_pic.jpg" data-stardandtype="" data-token=""></button> </div>
</div>
</a>
</div>
② 使用java处理html
a.输入字符流
/**
* 取得html页面的字符串
* @param path
* @return
*/
public StringBuilder getHtml(String path) {
File file = new File(path);
StringBuilder old = new StringBuilder();
try {
BufferedReader br = null;
try {
br = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"));
String c = "";
while ((c = br.readLine()) != null) {
old.append(c);
}
} finally {
br.close();
}
} catch (IOException e) {
e.printStackTrace();
}
return old;
}
b.使用正则处理每个div
/**
* 从old[]中提取有用属性到List<String[]>
* @param old 分割出的string数组
* @return
*/
public List<String[]> fetchAttribute(String[] old){
List<String[]> list = new ArrayList<>();
String angle="<.*?>";
Pattern patternId = Pattern.compile("target=\"_blank\" data-itemid=\"\\d+"); //取得itemID
Pattern patternTitle = Pattern.compile("<div class=\"item-title\">.*?</div>"); //取得itemtitle
Pattern patternInteger = Pattern.compile("<span class=\"mui-price-integer\">.*?</span>"); //取得价格的整数
Pattern patternDecimal = Pattern.compile("<span class=\"mui-price-decimal\">.*?</span>"); //取得价格的小数
for(int i=0;i<old.length;i++) {
String oldNow=old[i];
Matcher matchId = patternId.matcher(oldNow);
String[] attribute=null;
if(matchId.find()) {
attribute=new String[4];
attribute[0]=matchId.group().replaceAll("target=\"_blank\" data-itemid=\"", "");
}else {
continue;
}
Matcher matchTitle = patternTitle.matcher(oldNow);
if(matchTitle.find()) {
attribute[1]=matchTitle.group().replaceAll(angle, "");
}else {
attribute[1]="";
}
Matcher matchInteger = patternInteger.matcher(oldNow);
if(matchInteger.find()) {
attribute[2]=matchInteger.group().replaceAll(angle, "");
}else {
attribute[2]="";
}
Matcher matchDecimal = patternDecimal.matcher(oldNow);
if(matchDecimal.find()) {
attribute[3]=matchDecimal.group().replaceAll(angle, "");
}else {
attribute[3]="";
}
list.add(attribute);
}
return list;
}
c.保存为csv文件,这里就不存到数据库了
/**
* 保存为csv文件
* @param path
* @param itemId
* @param itemTitle
*/
public void saveInCsv(String path, List<String[]> attribute) {
File file = new File(path);
try {
BufferedWriter bw = null;
try {
file.createNewFile();
bw = new BufferedWriter(new FileWriter(file));
bw.write( "商品ID,商品名称,价格" );
bw.newLine();
for (int i = 0; i < attribute.size(); i++) {
String[] temp=attribute.get(i);
bw.write(temp[0] + "," + temp[1] + ","+temp[2] + temp[3]);
bw.newLine();
}
bw.flush();
} finally {
bw.close();
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
main函数:
String path = "coupon.html";
FetchFromHtml ffh = new FetchFromHtml();
String old=ffh.getHtml(path).toString();
String splitRegex="<div class=\"mui-chaoshi-item mui-chaoshi-item-column columnCount-5( soldout)?\" data-tag=\"item\"";
String[] spliters=old.split(splitRegex);
List<String[]> list2=ffh.fetchAttribute(spliters);
System.out.println(list2.size());
ffh.saveInCsv("out.csv", list2);
总共1305件商品。