html里面包含xml,如何通过包含HTML的XML进行解析

最新推荐文章于 2021-12-27 00:33:07 发布

白驹映月

最新推荐文章于 2021-12-27 00:33:07 发布

阅读量402

点赞数

文章标签： html里面包含xml

我试图用Python解析Adium的XML格式。我想建立一个干净的数据库聊天，但清除所有的格式和超链接。你知道吗

我现在正在使用xmltodict创建列表/字典来遍历它。但每当我点击超链接或文本格式时，都会遇到大问题。我想是因为我正试图通过XML强行访问。它将附加标记放在列表/字典的更深处。你知道吗

基本上，我觉得我走错了方向。你知道吗

下面是我正在使用的XML的两个片段。你知道吗

XML 1<?xml version="1.0" encoding="UTF-8" ?>

time is not of the essence

it is the essence

whats up?

XML 2<?xml version="1.0" encoding="UTF-8" ?>

http://www.youtube.com/watch?v=LqbJx4TFFEE&feature=related

2nd take, with the bonus stuff I think

这就是我一直在使用的代码(抱歉，里面有些废话)：import xmltodict

import os

def get_list_of_all_files_in_sub(dirName):

# create a list of file and sub directories

# names in the given directory

listOfFile = os.listdir(dirName)

allFiles = list()

all_files_with_extension = list()

# Iterate over all the entries

for entry in listOfFile:

# Create full path

fullPath = os.path.join(dirName, entry)

# If entry is a directory then get the list of files in this directory

if os.path.isdir(fullPath):

allFiles = allFiles + get_list_of_all_files_in_sub(fullPath)

else:

allFiles.append(fullPath)

return allFiles

def get_files_with_extension(path, file_extension=""):

# gets a list of all files with a certain extension in a folder and all subfolders

files = get_list_of_all_files_in_sub(path)

all_files_with_extension = []

for file in files:

if file.split(".")[-1] == file_extension:

all_files_with_extension.append(file)

return all_files_with_extension

allmessages = []

files = get_files_with_extension("/Users/Desktop/chats", "chatlog")

for file in files:

print (file)

with open(file) as fd:

doc = xmltodict.parse(fd.read())

messages = doc['chat']['message']

# this is gross, but apparently if "messages" only has one entry, it doesn't return a list. So

# to remedy this, im craming it into a list and back into itself to work with the rest of the code.

if type(messages) is not list:

print ("NOT A LIST")

messages_container = messages

messages = [messages_container]

for message in messages:

# Check to see if the SPAN exists inside DIV, which basically is checking to see if there's a real message in it.

if 'span' in message["div"]:

# checks if there's a sender, if there's no sender, it just doesn't include it in the output.

if message["@sender"] != "":

time = (message["@time"])

print (time)

username = (message["@sender"])

print (username)

# SET THE MESSAGE

# If there are multiple messages within one message, then it comes in as a list.

# But so far its just been things like warnings and offline notifications.

# This seems to happen with AIM messages.

if type(message["div"]['span']) is list:

print (message["div"]['span'])

for submessage in message["div"]['span']:

for subsubmessage in submessage:

print ("---------------1----------------")

print (subsubmessage)

print ("---------------2----------------")

if type(subsubmessage) is list:

print (subsubmessage["#text"])

if "Offline IM sent" not in subsubmessage["#text"]:

text_message = (subsubmessage["#text"])

print (text_message)

else:

text_message = (message["div"]['span']["#text"])

print (text_message)

if len(allmessages) > 0:

if (username == allmessages[-1]["sender"]):

if (allmessages[-1]["message"].endswith('.')):

text_message = allmessages[-1]["message"] + " " + text_message

else:

text_message = allmessages[-1]["message"] + ". " + text_message

del allmessages[-1]

newmessage = { 'time' : time,

'sender' : username,

'message' : text_message

}

allmessages.append (newmessage)

#print ("{} {}: {}".format(time, username, message))

for message in x:

print ("{} {}: {}".format(message['time'], message['sender'], message['message']))

我注意到xmltodict处理html标记的方式，在输出时变成了这样：OrderedDict([('span', OrderedDict([('@style', 'font-family: Arial; font-size: 10pt;'), ('#text', 'time is not of the essence')]))])

OrderedDict([('span', [OrderedDict([('@style', 'font-family: Arial; font-size: 10pt;'), ('span', OrderedDict([('@style', 'font-style: italic;'), ('#text', 'is')])), ('#text', 'it')]), OrderedDict([('@style', 'font-family: Helvetica; font-size: 12pt;')]), OrderedDict([('@style', 'font-family: Arial; font-size: 10pt;'), ('#text', 'the essence')])])])

如您所见，带有格式的#文本被拉出并分开。关于如何做到这一点，还有其他更好的方法或想法吗？你知道吗

白驹映月

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
html里面包含xml,如何通过包含HTML的XML进行解析

我试图用Python解析Adium的XML格式。我想建立一个干净的数据库聊天，但清除所有的格式和超链接。你知道吗我现在正在使用xmltodict创建列表/字典来遍历它。但每当我点击超链接或文本格式时，都会遇到大问题。我想是因为我正试图通过XML强行访问。它将附加标记放在列表/字典的更深处。你知道吗基本上，我觉得我走错了方向。你知道吗下面是我正在使用的XML的两个片段。你知道吗XML 1time i...
复制链接

扫一扫