与风景对话_交互式旅游推荐系统_数据预处理与分析（六)基于文本特征的离群数据识别

listenningb

已于 2024-06-24 11:54:21 修改

阅读量281

点赞数 5

文章标签：风景旅游

于 2024-06-24 01:42:51 首次发布

本文链接：https://blog.csdn.net/qq_64391508/article/details/139909806

版权

我们在对数据进行观察时发现，在我们爬取数据时的某些错误可能会导致出现离群数据，比如许多中文文本中出现一篇英文文本，或者出现报错消息，因此我们试图采用基于文本特征的方法来实现离群数据的识别

一.问题分析

我们在训练时发现有很多的报错信息，如下图所示类型会导致我们的数据出现偏差，在推荐旅游攻略时出现这种情况也属于极大的失误，因此我们决定对该情况进行处理，首先我们进行分析，当我们的文本出现错误时，很明显会出现一系列报错信息，而当代浏览器的报错信息绝大一部分都是英语文本，因此识别到英语文本时我们可以很便捷的断定其为离群数据，即我们所不需要的数据，因此我们对其采用基于文本特征的方法来完成工作，编写一个函数来识别文本的主要语言，然后能根据结果进行过滤。

二.程序实现

1.导入langid库

使用 langid 库的原因主要是它在语言检测方面表现良好，并且具有以下优点：

准确性和速度：langid 使用基于 n-gram 的语言识别方法，这种方法在准确性和速度上都有不错的表现。它能够快速地识别文本的主要语言，适用于大多数常见的语言。
简单易用：langid 提供简单的接口和函数，方便在 Python 中集成和使用。只需要调用 langid.classify(text) 就可以返回文本的语言标识。
支持多种语言：langid 支持识别多种语言，包括各种主要的欧洲语言、亚洲语言和其他流行的世界语言。
开源和社区支持：作为开源项目，langid 拥有一个活跃的社区和维护者团队，能够及时响应问题和更新。
适用场景：适用于大规模文本数据的语言检测任务，如文本分类、数据清洗中的异常值检测等。
```
pip install langid
```

2.读取数据

file_path = 'C:/Users/48594/Desktop/深度学习/res/your_file.json'

def filter_outliers_by_language(file_path):
    # 读取 JSON 文件
    df = pd.read_json(file_path, encoding='utf-8')

3.定义函数

# 定义一个函数来检测语言
    def detect_language(text):
        try:
            lang, _ = langid.classify(text)
            return lang
        except Exception as e:
            print(f"Error detecting language: {e}")
            return None

4.储存和过滤

  # 添加一个新列来存储每个文本的语言
    df['language'] = df['text'].apply(detect_language)

    # 过滤出主题语言为英语的文本
    df_filtered = df[df['language'] != 'en']

    # 返回过滤后的 DataFrame
    return df_filtered

总体代码展示

import json
import re
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.ensemble import IsolationForest
from transformers import BertTokenizer, BertModel
import torch
from torch.nn.functional import normalize

# 数据预处理函数
def preprocess_text(text):
    text = re.sub(r'\s+', ' ', text)  # 去除多余的空格
    text = re.sub(r'[^\w\s]', '', text)  # 去除特殊字符
    return text

# 加载数据
file_path = 'C:/Users/48594/Desktop/深度学习/res/your_file.json'  # 替换为您的文件路径

with open(file_path, 'r', encoding='utf-8') as file:
    data = json.load(file)

texts = [preprocess_text(entry['text']) for entry in data]

# 提取文本长度特征
text_lengths = [len(text) for text in texts]

# 提取词汇多样性特征
def compute_vocab_diversity(text):
    words = text.split()
    return len(set(words)) / len(words) if len(words) > 0 else 0

vocab_diversity = [compute_vocab_diversity(text) for text in texts]

# 提取TF-IDF特征
vectorizer = TfidfVectorizer(max_features=1000)
tfidf_matrix = vectorizer.fit_transform(texts).toarray()

# 提取BERT嵌入特征
class BertEmbedder:
    def __init__(self, model_name='bert-base-chinese'):
        self.tokenizer = BertTokenizer.from_pretrained(model_name)
        self.model = BertModel.from_pretrained(model_name)
        self.model.eval()

    def embed_text(self, text):
        inputs = self.tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=128)
        with torch.no_grad():
            outputs = self.model(**inputs)
        return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

embedder = BertEmbedder()
bert_embeddings = np.array([embedder.embed_text(text) for text in texts])

# 合并所有特征
features = np.hstack([np.array(text_lengths).reshape(-1, 1), np.array(vocab_diversity).reshape(-1, 1), tfidf_matrix, bert_embeddings])

# 降维处理
pca = PCA(n_components=50)
reduced_features = pca.fit_transform(features)
# 使用Isolation Forest进行离群数据识别
isolation_forest = IsolationForest(contamination=0.05)
outliers = isolation_forest.fit_predict(reduced_features)

# 标记离群数据
outlier_indices = np.where(outliers == -1)[0]
inlier_indices = np.where(outliers == 1)[0]

print(f"识别出 {len(outlier_indices)} 个离群数据")
# 提取离群数据和正常数据
outlier_texts = [texts[i] for i in outlier_indices]
inlier_texts = [texts[i] for i in inlier_indices]

# 保存离群数据和正常数据到文件
outlier_file_path = 'C:/Users/48594/Desktop/深度学习/res/outliers.json'
inlier_file_path = 'C:/Users/48594/Desktop/深度学习/res/inliers.json'

with open(outlier_file_path, 'w', encoding='utf-8') as file:
    json.dump(outlier_texts, file, ensure_ascii=False, indent=4)

with open(inlier_file_path, 'w', encoding='utf-8') as file:
    json.dump(inlier_texts, file, ensure_ascii=False, indent=4)

print(f"离群数据已保存到 {outlier_file_path}")
print(f"正常数据已保存到 {inlier_file_path}")
if __name__ == "__main__":
    data_file_path = 'C:/Users/48594/Desktop/深度学习/res/outliers.json'

    # 加载数据
    data = load_data(data_file_path)
    # 调用离群数据识别方法
    outlier_indices = detect_outliers(data)

    # 打印离群数据索引
    print("离群数据索引："

经过测试效果良好，可以有效地完成工作

listenningb

关注

5
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
与风景对话_交互式旅游推荐系统_数据预处理与分析（六)基于文本特征的离群数据识别

定义一个函数来检测语言try:4.储存和过滤# 添加一个新列来存储每个文本的语言# 过滤出主题语言为英语的文本= 'en']# 返回过滤后的 DataFrame。
复制链接

扫一扫