pythonhtml树相似度_HTML Similarity:使用结构和样式度量标准比较html相似度

HTML Similarity是一个用于衡量网页相似度的Python包,它通过比较HTML标签序列和CSS类来计算结构和样式相似度。推荐的联合相似度计算公式为:k * 结构相似度 + (1 - k) * 样式相似度,其中k=0.3可获得较好结果。示例展示了如何使用该包进行相似度计算。
摘要由CSDN通过智能技术生成

HTML Similarity

This package provides a set of functions to measure the similarity between web pages.

Install

The quick way:

pip install html-similarity

How it works?

Structural Similarity

Uses sequence comparison of the html tags to compute the similarity.

We not implement the similarity based on tree edit distance because it is slower than sequence comparison.

Style Similarity

Extracts css classes of each html document and calculates the jaccard similarity of the sets of classes.

Joint Similarity (Structural Similarity and Style Similarity)

The joint similarity metric is calculated as:

k * structural_similarity(document_1, document_2) + (1 - k) * style_similarity(document_1, document_2)

All the similarity metrics takes values between 0 and 1.

Recommendations for joint similarity

Using k=0.3 give use better results. The style similarity gives more information about the similarity rather than the structural similarity.

Examples

Here is a example:

In [1]: html_1 = '''

First Document

'''

In [2]: html_2 = '''

Second document Document

'''

In [3] from html_similarity import style_similarity, structural_similarity, similarity

In [4]: style_similarity(html_1, html_2)

Out[4]: 1.0

In [7]: structural_similarity(html_1, html_2)

Out[7]: 0.9090909090909091

In [8]: similarity(html_1, html_2)

Out[8]: 0.9545454545454546

References

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值