html里面包含xml,如何通过包含HTML的XML进行解析

我试图用Python解析Adium的XML格式。我想建立一个干净的数据库聊天,但清除所有的格式和超链接。你知道吗

我现在正在使用xmltodict创建列表/字典来遍历它。但每当我点击超链接或文本格式时,都会遇到大问题。我想是因为我正试图通过XML强行访问。它将附加标记放在列表/字典的更深处。你知道吗

基本上,我觉得我走错了方向。你知道吗

下面是我正在使用的XML的两个片段。你知道吗

XML 1<?xml version="1.0" encoding="UTF-8" ?>

time is not of the essence

it is the essence

yo

whats up?

XML 2<?xml version="1.0" encoding="UTF-8" ?>

2nd take, with the bonus stuff I think

这就是我一直在使用的代码(抱歉,里面有些废话):import xmltodict

import os

def get_list_of_all_files_in_sub(dirName):

# create a list of file and sub directories

# names in the given directory

listOfFile = os.listdir(dirName)

allFiles = list()

all_files_with_extension = list()

# Iterate over all the entries

for entry in listOfFile:

# Create full path

fullPath = os.path.join(dirName, entry)

# If entry is a directory then get the list of files in this directory

if os.path.isdir(fullPath):

allFiles = allFiles + get_list_of_all_files_in_sub(fullPath)

else:

allFiles.append(fullPath)

return allFiles

def get_files_with_extension(path, file_extension=""):

# gets a list of all files with a certain extension in a folder and all subfolders

files = get_list_of_all_files_in_sub(path)

all_files_with_extension = []

for file in files:

if file.split(".")[-1] == file_extension:

all_files_with_extension.append(file)

return all_files_with_extension

allmessages = []

files = get_files_with_extension("/Users/Desktop/chats", "chatlog")

for file in files:

print (file)

with open(file) as fd:

doc = xmltodict.parse(fd.read())

messages = doc['chat']['message']

# this is gross, but apparently if "messages" only has one entry, it doesn't return a list. So

# to remedy this, im craming it into a list and back into itself to work with the rest of the code.

if type(messages) is not list:

print ("NOT A LIST")

messages_container = messages

messages = [messages_container]

for message in messages:

# Check to see if the SPAN exists inside DIV, which basically is checking to see if there's a real message in it.

if 'span' in message["div"]:

# checks if there's a sender, if there's no sender, it just doesn't include it in the output.

if message["@sender"] != "":

time = (message["@time"])

print (time)

username = (message["@sender"])

print (username)

# SET THE MESSAGE

# If there are multiple messages within one message, then it comes in as a list.

# But so far its just been things like warnings and offline notifications.

# This seems to happen with AIM messages.

if type(message["div"]['span']) is list:

print (message["div"]['span'])

for submessage in message["div"]['span']:

for subsubmessage in submessage:

print ("---------------1----------------")

print (subsubmessage)

print ("---------------2----------------")

if type(subsubmessage) is list:

print (subsubmessage["#text"])

if "Offline IM sent" not in subsubmessage["#text"]:

text_message = (subsubmessage["#text"])

print (text_message)

else:

text_message = (message["div"]['span']["#text"])

print (text_message)

if len(allmessages) > 0:

if (username == allmessages[-1]["sender"]):

if (allmessages[-1]["message"].endswith('.')):

text_message = allmessages[-1]["message"] + " " + text_message

else:

text_message = allmessages[-1]["message"] + ". " + text_message

del allmessages[-1]

newmessage = { 'time' : time,

'sender' : username,

'message' : text_message

}

allmessages.append (newmessage)

#print ("{} {}: {}".format(time, username, message))

for message in x:

print ("{} {}: {}".format(message['time'], message['sender'], message['message']))

我注意到xmltodict处理html标记的方式,在输出时变成了这样:OrderedDict([('span', OrderedDict([('@style', 'font-family: Arial; font-size: 10pt;'), ('#text', 'time is not of the essence')]))])

OrderedDict([('span', [OrderedDict([('@style', 'font-family: Arial; font-size: 10pt;'), ('span', OrderedDict([('@style', 'font-style: italic;'), ('#text', 'is')])), ('#text', 'it')]), OrderedDict([('@style', 'font-family: Helvetica; font-size: 12pt;')]), OrderedDict([('@style', 'font-family: Arial; font-size: 10pt;'), ('#text', 'the essence')])])])

如您所见,带有格式的#文本被拉出并分开。关于如何做到这一点,还有其他更好的方法或想法吗?你知道吗

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值