pythonhtml树相似度_HTML Similarity：使用结构和样式度量标准比较html相似度

最新推荐文章于 2024-06-07 09:54:35 发布

weixin_39817391

最新推荐文章于 2024-06-07 09:54:35 发布

阅读量439

点赞数

文章标签： pythonhtml树相似度

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_39817391/article/details/111446886

版权

HTML Similarity是一个用于衡量网页相似度的Python包，它通过比较HTML标签序列和CSS类来计算结构和样式相似度。推荐的联合相似度计算公式为：k * 结构相似度 + (1 - k) * 样式相似度，其中k=0.3可获得较好结果。示例展示了如何使用该包进行相似度计算。

摘要由CSDN通过智能技术生成

HTML Similarity

This package provides a set of functions to measure the similarity between web pages.

Install

The quick way:

pip install html-similarity

How it works?

Structural Similarity

Uses sequence comparison of the html tags to compute the similarity.

We not implement the similarity based on tree edit distance because it is slower than sequence comparison.

Style Similarity

Extracts css classes of each html document and calculates the jaccard similarity of the sets of classes.

Joint Similarity (Structural Similarity and Style Similarity)

The joint similarity metric is calculated as:

k * structural_similarity(document_1, document_2) + (1 - k) * style_similarity(document_1, document_2)

All the similarity metrics takes values between 0 and 1.

Recommendations for joint similarity

Using k=0.3 give use better results. The style similarity gives more information about the similarity rather than the structural similarity.

Examples

Here is a example:

In [1]: html_1 = '''

First Document

'''

In [2]: html_2 = '''

Second document Document

'''

In [3] from html_similarity import style_similarity, structural_similarity, similarity

In [4]: style_similarity(html_1, html_2)

Out[4]: 1.0

In [7]: structural_similarity(html_1, html_2)

Out[7]: 0.9090909090909091

In [8]: similarity(html_1, html_2)

Out[8]: 0.9545454545454546

References

weixin_39817391

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。