I'm trying to parse .svg files from http://kanjivg.tagaini.net/ , but I can't successfully extract the information inside.
A part of 0f9ab.svg looks like this:
My .py file:
import lxml.etree as ET
svg = ET.parse('0f9ab.svg')
print(svg) #
# AttributeError: 'lxml.etree._ElementTree' object has no attribute 'tag'
print(svg.tag)
# TypeError: 'lxml.etree._ElementTree' object is not subscriptable
print(svg[0])
# TypeError: 'lxml.etree._ElementTree' object is not iterable
for child in svg:
print(child)
# None
print(svg.find("./svg"))
# []
print(svg.findall("//g"))
# []
print(svg.xpath("//g"))
Purpose
I tried all kinds of operations I could think of, but nothing gets me any data from the .svg file.
I want to extract the kanji (Japanese character) in kvg:element="kanji" (which are at different depth levels).
Question
Is using lxml the wrong package for this?
If not, how do I extract information from my parsed .svg file?
Other solution
I could of course I could just read the file as a string and search
for kvg:element=", but I would like to proper way of extracting xml
/ svg.
I used xmltodict before, but my code became really messy extracting kvg:element, because they were at different depth levels.
解决方案
.parse() returns an ElementTree, which represents the tree as a whole. To query individual nodes, you need an Element, most likely the root element of the tree.
Replace part of your code with this:
xml = ET.parse('0f9ab.svg')
svg = xml.getroot()
print(svg) #
and I think you'll have some success.
Note also that .findall() requires a relative path and, in your case, a namespace qualifier:
print(svg.findall(".//{http://www.w3.org/2000/svg}g"))