【python】SemEval 2014数据集 xml文件格式转换为csv+txt

本文链接：https://blog.csdn.net/weixin_44319595/article/details/142073726

笔记为自我总结整理的学习笔记，若有错误欢迎指出哟~

【python】SemEval 2014数据集 xml文件格式转换为csv+txt

SemEval2014简介
4个子任务
数据格式
xml转csv
xml转txt

SemEval2014简介

SemEval2014，ABSA（ Aspect Based Sentiment Analysis）任务关注的领域是NLP中的细粒度情感分析，即给定一个句子判断其中的aspect以及它的情感极性。数据是基于laptop评论和restaurant评论

4个子任务

1.Aspect term extraction，方面术语抽取
2.Aspect term polarity，方面术语极性分类
3.Aspect category detection，方面类别抽取
4.Aspect category polarity，方面类别极性分类
对于子任务1和子任务2提供了laptop和restaurant数据，而对于子任务3和子任务4只提供了restaurant数据。

数据格式

原始数据为xml格式
在这里插入图片描述

xml转csv

xml_csv.py

import xml.etree.cElementTree as ET
import pandas as pd

def xml_csv(xml):
    csv_name = 'SemEval2014_Restaurants.csv'

    # 解析XML文件
    tree = ET.parse(xml)
    root = tree.getroot()
    # 提取所有sentence元素
    sentences = root.findall('sentence')
    # 修复提取数据的方法，处理没有<aspectCategory>子元素的情况
    data = []

    # 遍历每个sentence元素
    for sentence in sentences:
        # 提取text内容
        text = sentence.find('text').text

        # 检查是否存在<aspectCategory>子元素
        aspect_categories_element = sentence.find('aspectCategories')
        if aspect_categories_element is not None:
            # 提取aspectCategories中的所有aspectCategory元素
            aspect_categories = aspect_categories_element.findall('aspectCategory')

            # 提取每个aspectCategory的category和polarity
            for aspect_category in aspect_categories:
                category = aspect_category.get('category')
                polarity = aspect_category.get('polarity')
                data.append([text, category, polarity])

    df = pd.DataFrame(data, columns=['text', 'category', 'polarity'])
    df = df[df['polarity'].isin(['positive', 'negative', 'neutral'])]
    df['polarity'] = df['polarity'].map(
        {'positive': 1, 'neutral': 0, 'negative': -1})

    df.to_csv(path_or_buf=csv_name, index=0)

# 生成csv
xml = 'SemEval2014_Restaurants.xml'
xml_csv(xml)

SemEval2014_Restaurants.csv
在这里插入图片描述

xml转txt

xml_txt.py

import xml.etree.cElementTree as ET

def xml_txt(xml):
    txt = open('SemEval2014_Restaurants.txt', 'a', encoding='utf-8')
    # 解析XML文件
    tree = ET.parse(xml)
    root = tree.getroot()
    # 提取所有sentence元素
    sentences = root.findall('sentence')
    # 修复提取数据的方法，处理没有<aspectCategory>子元素的情况
    data = []

    # 遍历每个sentence元素
    for sentence in sentences:
        # 提取text内容
        text = sentence.find('text').text

        # 检查是否存在<aspectCategory>子元素
        aspect_categories_element = sentence.find('aspectCategories')
        if aspect_categories_element is not None:
            # 提取aspectCategories中的所有aspectCategory元素
            aspect_categories = aspect_categories_element.findall('aspectCategory')

            # 提取每个aspectCategory的category和polarity
            for aspect_category in aspect_categories:
                category = aspect_category.get('category')
                polarity = aspect_category.get('polarity')
                if polarity == "negative":
                    polarity = -1
                elif polarity =="positive":
                    polarity = 1
                else:
                    polarity = 0
                txt.write(f"{polarity}\t{category}\t{text}\n")

# 生成txt
xml = 'SemEval2014_Restaurants.xml'
xml_txt(xml)