分析整体文档集合质量：来源多样性、内容新鲜度、信息密度、内容具体性

最新推荐文章于 2025-05-04 20:36:05 发布

背太阳的牧羊人

最新推荐文章于 2025-05-04 20:36:05 发布

阅读量443

点赞数 11

分类专栏： python 文章标签： pytorch 分析数据

本文链接：https://blog.csdn.net/u013565133/article/details/147588094

版权

python 专栏收录该内容

134 篇文章

订阅专栏

分析整体文档集合质量，核心可以分成这几块：

指标	含义	举例
来源多样性 (`source_diversity`)	不同来源的数量比例，比如是从10个网站收的，还是1个网站刷的	来源有 100 个，总来源数是 20，得分 0.2
内容新鲜度 (`content_freshness`)	文档最近更新时间，比如都是最近 1 个月的就很新鲜	80% 文档在最近 30 天内更新，得分 0.8
信息密度 (`information_density`)	平均字数，字越多信息量越密集（简单假设）	平均字数 500，打个标准化得分
内容具体性 (`content_specificity`)	看是不是内容都是短句、废话，还是有长段具体描述	比如平均每段字数/句子数来估计

下面直接给你一版完整代码

🔵 注意：这里假设你的 get_collection_stats() 能返回一批文档数据，或者我额外给你加一个方法来模拟获取文档。

先看完整代码：

from typing import List, Dict, Any
import time
from datetime import datetime, timedelta
from collections import Counter

def analyze_source_quality(self, document_ids: List[str] = None) -> Dict[str, Any]:
    """
    分析数据源的整体质量，计算来源多样性、内容新鲜度、信息密度、内容具体性等指标
    """
    start_time = time.time()
    self.logger.info("Analyzing source quality (real version)")

    try:
        # 拿到所有文档的数据列表
        documents = self.get_all_documents()

        if not documents:
            raise ValueError("No documents found for quality analysis")

        total_documents = len(documents)

        # ------- 1. 来源多样性 -------
        sources = [doc.get("source", "unknown") for doc in documents]
        unique_sources = len(set(sources))
        source_diversity = unique_sources / total_documents

        # ------- 2. 内容新鲜度 -------
        now = datetime.utcnow()
        fresh_threshold = now - timedelta(days=30)  # 30天以内算新
        fresh_documents = [doc for doc in documents if datetime.strptime(doc["last_updated"], "%Y-%m-%d") >= fresh_threshold]
        content_freshness = len(fresh_documents) / total_documents

        # ------- 3. 信息密度 -------
        total_words = sum(len(doc.get("content", "").split()) for doc in documents)
        avg_words_per_doc = total_words / total_documents
        # 假设 300字算一般密度，越多越好，压缩到 0~1 区间
        information_density = min(avg_words_per_doc / 500, 1.0)

        # ------- 4. 内容具体性 -------
        # 粗略用句子长度来估计，长句多通常更具体
        avg_sentence_length = sum(
            len(sentence.split())
            for doc in documents
            for sentence in doc.get("content", "").split(".")
            if sentence.strip()
        ) / max(
            sum(
                1 for doc in documents for sentence in doc.get("content", "").split(".") if sentence.strip()
            ), 1
        )
        # 假设平均句子长度 12词以上是好内容
        content_specificity = min(avg_sentence_length / 12, 1.0)

        return {
            "success": True,
            "collection_stats": {
                "total_documents": total_documents,
                "unique_sources": unique_sources,
                "avg_words_per_doc": avg_words_per_doc,
                "avg_sentence_length": avg_sentence_length,
            },
            "quality_metrics": {
                "source_diversity": round(source_diversity, 2),
                "content_freshness": round(content_freshness, 2),
                "information_density": round(information_density, 2),
                "content_specificity": round(content_specificity, 2)
            },
            "processing_time": time.time() - start_time
        }

    except Exception as e:
        self.logger.error(f"Error analyzing source quality: {e}")
        return {
            "success": False,
            "error": str(e),
            "processing_time": time.time() - start_time
        }

如果你没有 get_all_documents，可以自己加一个小模拟方法，比如：

def get_all_documents(self) -> List[Dict[str, Any]]:
    """模拟返回一堆文档"""
    return [
        {"source": "news_site_a", "last_updated": "2025-04-10", "content": "This is a long article about AI development. It covers..."},
        {"source": "news_site_b", "last_updated": "2025-03-28", "content": "Short news about weather."},
        {"source": "news_site_a", "last_updated": "2025-04-12", "content": "Detailed review of the latest smartphone..."},
        {"source": "blog_site_c", "last_updated": "2025-01-05", "content": "A very short blog."},
        # 可以加很多模拟数据
    ]