[数据工程]如何将xml转csv

本文介绍了如何将XML文件转换为CSV格式,适用于数据分析任务。通过解析XML结构并将其整理成逗号分隔值,可以方便地加载和处理数据。
摘要由CSDN通过智能技术生成
import xml.etree.ElementTree as ELT
from tqdm import tqdm

def parse_xml_to_csv(path, save_path=None):
    """
    Open xml posts dump and convert the text to a csv, tokenizing it in the process
    :param path: path to the xml document containing posts
    :return: a dataframe of processed text
    """

    # Use python's standard library to parse xml file
    doc = ELT.parse(path)
    root = doc.getroot()

    # Each row is a question
    all_rows = [row.attrib for row in root.findall('row')]

    # Using tdqm to display progress since preprocessing takes time
    for item in tqdm(all_rows):
        # Decode text from HTML
        soup = BeautifulSoup(item['Body'], features='html.parser')
        item['body_text'] = soup.get_text()

    # Create dataframe from our list of dict
    df = pd.DataFrame.from_dict(all_rows)
    if save_path:
        df.to_csv(save_path)
    return df
    
parse_xml_to_csv("MiniPosts.xml", "1.csv")

'''

MiniPosts.xml

 

<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="5" PostTypeId="1" CreationDate="2014-05-13T23:58:30.457" Score="9" ViewCount="516" Body="&lt;p&gt;I've always been interested in machine learning, but I can't figure out one thing about starting out with a simple &quot;Hello World&quot; example - how can I avoid hard-coding behavior?&lt;/p&gt;&#xA;&#xA;&lt;p&gt;For example, if I wanted to &quot;teach&quot; a bot how to avoid randomly placed obstacles, I couldn't just use relative motion, because the obstacles move around, but I don't want to hard code, say, distance, because that ruins the whole point of machine learning.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Obviously, randomly generating code would be impractical, so how could I do this?&lt;/p&gt;&#xA;" OwnerUserId="5" LastActivityDate="2014-05-14T00:36:31.077" Title="How can I do simple machine learning without hard-coding behavior?" Tags="&lt;machine-learning&gt;" AnswerCount="1" CommentCount="1" FavoriteCount="1" ClosedDate="2014-05-14T14:40:25.950" />
  <row Id="7" PostTypeId="1" AcceptedAnswerId="10" CreationDate="2014-05-14T00:11:06.457" Score="4" ViewCount="411" Body="&lt;p&gt;As a researcher and instructor, I'm looking for open-source books (or similar materials) that provide a relatively thorough overview of data science from an applied perspective. To be clear, I'm especially interested in a thorough overview that provides material suitable for a college-level course, not particular pieces or papers.&lt;/p&gt;&#xA;" OwnerUserId="36" LastEditorUserId="97" LastEditDate="2014-05-16T13:45:00.237" LastActivityDate="2014-05-16T13:45:00.237" Title="What open-source books (or other materials) provide a relatively thorough overview of data science?" Tags="&lt;education&gt;&lt;open-source&gt;" A
  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值