与风景对话_交互式旅游推荐系统_数据收集与预处理

本文链接：https://blog.csdn.net/chenxucn/article/details/139909864

文章目录

- 3.数据清洗

3.数据清洗

3.7 处理JSON文件

在上篇文章的3.6中我们已经将爬取下来的文本转成了txt格式，接下来，再来看如何处理JSON文件。以下是完整的代码块，随后我将逐段进行详细分析：

def process_json_file(file_path, ad_keywords):
    """Process a json file and return cleaned and deduplicated texts."""
    with open(file_path, 'r', encoding='utf-8') as file:
        data = json.load(file)

    cleaned_texts = set()
    if isinstance(data, list):
        for item in data:
            if isinstance(item, str):
                cleaned_text = clean_text(item, ad_keywords)
                if cleaned_text:
                    cleaned_texts.add(cleaned_text)
            elif isinstance(item, dict):
                for key, value in item.items():
                    if isinstance(value, str):
                        cleaned_text = clean_text(value, ad_keywords)
                        if cleaned_text:
                            cleaned_texts.add(cleaned_text)
    elif isinstance(data, dict):
        for key, value in data.items():
            if isinstance(value, str):
                cleaned_text = clean_text(value, ad_keywords)
                if cleaned_text:
                    cleaned_texts.add(cleaned_text)

    return list(cleaned_texts)

读取和解析JSON文件

with open(file_path, 'r', encoding='utf-8') as file:
    data = json.load(file)

这段代码打开指定路径的JSON文件，并使用json.load方法将文件内容解析为Python数据结构。file_path是文件路径，encoding='utf-8'确保以UTF-8编码读取文件。json.load返回的data可以是列表或字典，取决于JSON文件的内容。

初始化清理文本的集合

cleaned_texts = set()

这一行代码初始化一个空的集合cleaned_texts，用于存储清理并去重后的文本。集合会自动去重，因此相同的文本只能出现一次。

处理列表类型的数据

if isinstance(data, list):
    for item in data:
        if isinstance(item, str):
            cleaned_text = clean_text(item, ad_keywords)
            if cleaned_text:
                cleaned_texts.add(cleaned_text)
        elif isinstance(item, dict):
            for key, value in item.items():
                if isinstance(value, str):
                    cleaned_text = clean_text(value, ad_keywords)
                    if cleaned_text:
                        cleaned_texts.add(cleaned_text)

这部分代码处理当data是列表类型的情况。首先检查data是否是列表，然后遍历列表中的每个元素：

如果元素是字符串，调用clean_text函数进行清理，并将清理后的文本添加到cleaned_texts集合中。
如果元素是字典，遍历字典的每个键值对，并检查值是否是字符串。如果是，则调用clean_text函数进行清理，并将清理后的文本添加到cleaned_texts集合中。

处理字典类型的数据

elif isinstance(data, dict):
    for key, value in data.items():
        if isinstance(value, str):
            cleaned_text = clean_text(value, ad_keywords)
            if cleaned_text:
                cleaned_texts.add(cleaned_text)

这部分代码处理当data是字典类型的情况。首先检查data是否是字典，然后遍历字典的每个键值对：

如果值是字符串，调用clean_text函数进行清理，并将清理后的文本添加到cleaned_texts集合中。

返回清理后的文本列表

return list(cleaned_texts)

最后，将集合cleaned_texts转换为列表并返回。这样，我们得到了清理并去重后的文本列表。

通过以上分析，我们详细解释了如何逐步处理TXT和JSON文件，并使用clean_text函数清理和去重文本。每段代码的逻辑和目的都进行了深入解析，以帮助更好地理解整个过程。

3.8 保存清理后的数据到文件

def save_to_file(cleaned_data, output_file):
    """Save cleaned data to a file."""
    with open(output_file, 'w', encoding='utf-8') as file:
        for line in cleaned_data:
            file.write(line + '\n')

将清理后的数据写入指定文件，每条数据占一行。

3.9 主函数

def main():
    # Define ad keywords to filter out
    ad_keywords = ['sale', 'buy now', 'free', 'click here', 'subscribe']

    # Process txt file
    txt_file_path = 'data.txt'
    cleaned_txt_data = process_txt_file(txt_file_path, ad_keywords)
    save_to_file(cleaned_txt_data, 'cleaned_data.txt')

    # Process json file
    json_file_path = 'data.json'
    cleaned_json_data = process_json_file(json_file_path, ad_keywords)
    save_to_file(cleaned_json_data, 'cleaned_data.json')

if __name__ == '__main__':
    main()

定义广告关键词列表。
处理TXT文件并保存清理后的结果。
处理JSON文件并保存清理后的结果。

3.10 结论

通过对不同类型的数据（TXT和JSON文件）进行系统的清洗处理，我们显著提升了数据的质量和一致性。数据清洗过程中，我们采取了一系列步骤，包括移除表情符号、仅保留文本字符以及过滤广告关键词。这些步骤确保了最终的数据集不包含无关字符或重复信息，具备较高的纯净度和准确性。以下是实验数据和清洗效果的总结：

实验数据和清洗效果

我们从马蜂窝旅游网站抓取了以下样本数据：

初始数据集：包含共计：

6000+ Json对象
22769504 字
在这里插入图片描述
清洗后的数据集：包含共计：
6000+ Json对象
18198924 字

可以看出通过数据清洗，筛选出了一些不符合格式的Json对象，但这数量较少，仅有几十个对象被清洗掉了，说明数据收集中我们选择的爬虫及其效果都是较好的。
但是字数缩减了有四百万之多，看以看出不符合格式的文字片还是较多的，通过数据清洗实现了较好的清洗效果，Json文件中不符合大模型输入要求的数据集都被很好的清洗掉了。