一些处理的小方法,Java代码
1、去html文本的标签,换行等特殊字符,转为纯文本格式
String claimspath_result = claimspath.replaceAll("\\&[a-zA-Z]{1,10};", "") //去除类似< > 的字串
.replaceAll("<[a-zA-Z]+[1-9]?[^><]*>", "") //去除开始标签及没有结束标签的标签
.replaceAll("</[a-zA-Z]+[1-9]?>", ""); //去除结束标签
2、计算某一日期与当前日期的差额
LocalDateTime now = LocalDateTime.now();
int year = now.getYear();
int month = now.getMonthValue();
int day = now.getDayOfMonth();
String appDate = resultKey.getString("appDate");
String[] dateList = appDate.split("\\.");
int year1 = Integer.parseInt(dateList[0]);
int month1 = Integer.parseInt(dateList[1]);
int day1 = Integer.parseInt(dateList[2]);
LocalDate startDate = LocalDate.of(year1, month1, day1);
LocalDate endDate = LocalDate.of(year, month, day);
double days = startDate.until(endDate, ChronoUnit.DAYS);
3、计算文章相似度(海明距离)
借鉴博客 海明距离https://blog.csdn.net/sinat_37239798/article/details/122893346
4、Java获取li标签内容(正则表达式+Jsoup)
(1)正则表达式可处理–简单–的标签主要用到Pattern 和Matcher 方法,获取li标签内数据,但不推荐,建议都改成Jsoup
String text = "获取的样式";
String regex = "<li>(.*?)</li>"; //正则表达式
List<String> liListNews = getContentByRegex(text , regex, 1); // 获取到的内容
public static List<String> getContentByRegex(String html, String regex, int index) {
List<String> list = new ArrayList<>(); // 创建一个空列表
Pattern pattern = Pattern.compile(regex, Pattern.DOTALL);
Matcher match = pattern.matcher(html);
while (match.find()) {
list.add(match.group(index));
}
return list;
}
(2)–复杂–样式的用Jsoup,获取a标签及span标签
String title = "";
String href = "";
String date = "";
String domNodeObj = "获取的样式";
Document doc = Jsoup.parse(domNodeObj);
Elements links_href = doc.select("a[href]");
Elements links_a = doc.select("a"); // 选择所有的<a>标签
Elements links_span = doc.select("span"); // 选择所有的<span>标签
for (Element link_href : links_href) {
href = link_href.attr("abs:href");
}
for (Element link_a : links_a) {
// 获取<a>标签内的文本内容
title = link_a.text();
}
for (Element link_span : links_span) {
// 获取<span>标签内的文本内容
date = link_span.text();
}
//获取完后断点查看title内容,可能用到下面代码
//title = title.substring(0,title.length()-date.length()-1);