分析整体文档集合质量,核心可以分成这几块:
指标 | 含义 | 举例 |
---|---|---|
来源多样性 (source_diversity ) | 不同来源的数量比例,比如是从10个网站收的,还是1个网站刷的 | 来源有 100 个,总来源数是 20,得分 0.2 |
内容新鲜度 (content_freshness ) | 文档最近更新时间,比如都是最近 1 个月的就很新鲜 | 80% 文档在最近 30 天内更新,得分 0.8 |
信息密度 (information_density ) | 平均字数,字越多信息量越密集(简单假设) | 平均字数 500,打个标准化得分 |
内容具体性 (content_specificity ) | 看是不是内容都是短句、废话,还是有长段具体描述 | 比如平均每段字数/句子数来估计 |
下面直接给你一版完整代码
🔵 注意:这里假设你的 get_collection_stats()
能返回一批文档数据,或者我额外给你加一个方法来模拟获取文档。
先看完整代码:
from typing import List, Dict, Any
import time
from datetime import datetime, timedelta
from collections import Counter
def analyze_source_quality(self, document_ids: List[str] = None) -> Dict[str, Any]:
"""
分析数据源的整体质量,计算来源多样性、内容新鲜度、信息密度、内容具体性等指标
"""
start_time = time.time()
self.logger.info("Analyzing source quality (real version)")
try:
# 拿到所有文档的数据列表
documents = self.get_all_documents()
if not documents:
raise ValueError("No documents found for quality analysis")
total_documents = len(documents)
# ------- 1. 来源多样性 -------
sources = [doc.get("source", "unknown") for doc in documents]
unique_sources = len(set(sources))
source_diversity = unique_sources / total_documents
# ------- 2. 内容新鲜度 -------
now = datetime.utcnow()
fresh_threshold = now - timedelta(days=30) # 30天以内算新
fresh_documents = [doc for doc in documents if datetime.strptime(doc["last_updated"], "%Y-%m-%d") >= fresh_threshold]
content_freshness = len(fresh_documents) / total_documents
# ------- 3. 信息密度 -------
total_words = sum(len(doc.get("content", "").split()) for doc in documents)
avg_words_per_doc = total_words / total_documents
# 假设 300字算一般密度,越多越好,压缩到 0~1 区间
information_density = min(avg_words_per_doc / 500, 1.0)
# ------- 4. 内容具体性 -------
# 粗略用句子长度来估计,长句多通常更具体
avg_sentence_length = sum(
len(sentence.split())
for doc in documents
for sentence in doc.get("content", "").split(".")
if sentence.strip()
) / max(
sum(
1 for doc in documents for sentence in doc.get("content", "").split(".") if sentence.strip()
), 1
)
# 假设平均句子长度 12词以上是好内容
content_specificity = min(avg_sentence_length / 12, 1.0)
return {
"success": True,
"collection_stats": {
"total_documents": total_documents,
"unique_sources": unique_sources,
"avg_words_per_doc": avg_words_per_doc,
"avg_sentence_length": avg_sentence_length,
},
"quality_metrics": {
"source_diversity": round(source_diversity, 2),
"content_freshness": round(content_freshness, 2),
"information_density": round(information_density, 2),
"content_specificity": round(content_specificity, 2)
},
"processing_time": time.time() - start_time
}
except Exception as e:
self.logger.error(f"Error analyzing source quality: {e}")
return {
"success": False,
"error": str(e),
"processing_time": time.time() - start_time
}
如果你没有 get_all_documents
,可以自己加一个小模拟方法,比如:
def get_all_documents(self) -> List[Dict[str, Any]]:
"""模拟返回一堆文档"""
return [
{"source": "news_site_a", "last_updated": "2025-04-10", "content": "This is a long article about AI development. It covers..."},
{"source": "news_site_b", "last_updated": "2025-03-28", "content": "Short news about weather."},
{"source": "news_site_a", "last_updated": "2025-04-12", "content": "Detailed review of the latest smartphone..."},
{"source": "blog_site_c", "last_updated": "2025-01-05", "content": "A very short blog."},
# 可以加很多模拟数据
]
这段真实版代码干了这些事情:
- 🔎 统计了来源种类多不多(source_diversity)
- ⏳ 检查了有多少文档是最近更新的(content_freshness)
- 🧠 算了每篇文档大概有多少字(information_density)
- 📚 看了句子长度,估计内容具体不具体(content_specificity)
真实例子:
比如你有 100 篇文档:
- 20 个不同来源(source_diversity = 0.2)
- 70 篇是30天内更新的(content_freshness = 0.7)
- 每篇平均 400 个单词(information_density = 400/500 = 0.8)
- 句子平均长度 10 个词(content_specificity = 10/12 = 0.83)