我很抱歉再次提出这个问题,但是这个问题还没有解决。在
这不是一个非常复杂的问题,我确信这是相当直截了当的,但我根本看不到这个问题。在
我用来解析XML文件的代码是打开的,并以我想要的格式读取——最后一个for循环中的print语句证明了这一点。在
例如,它输出以下内容:Pivoting support handle D0584129 20090106 US
Hinge D0584130 20090106 US
Deadbolt turnpiece D0584131 20090106 US
这正是我希望我的数据写入CSV文件的方式。但是,当我试图将这些作为行写入CSV本身时,它只会打印XML文件中最后一行中的一行,并以这种方式:Flashlight package,D0584138,20090106,US
以下是我的全部代码,因为它可能有助于理解整个过程,其中感兴趣的区域是分隔的\u xml中的for xml字符串的起始位置:from bs4 import BeautifulSoup
import csv
import unicodecsv as csv
infile = "C:\\Users\\Grisha\\Documents\\Inventor\\2009_Data\\Jan\\ipg090106.xml"
# The first line of code defines a function "separated_xml" that will allow us to separate, read, and then finally parse the data of interest with
def separated_xml(infile): # Defining the data reading function for each xml section - This breaks apart the xml from the start (root element <?xml... ) to the next iteration of the root element
file = open(infile, "r") # Used to open the xml file
buffer = [file.readline()] # Used to read each line and placing inside vector
# The first for-loop is used to slice every section of the USPTO XML file to be read and parsed individually
# It is necessary because Python wishes to read only one instance of a root element but this element is found many times in each file which causes reading errors
for line in file: # Running for-loop for the opened file and searches for root elements
if line.startswith("<?xml "):
yield "".join(buffer) # 1) Using "yield" allows to generate one instance per run of a root element and 2) .join takes the list (vector) "buffer" and connects an empty string to it
buffer = [] # Creates a blank list to store the beginning of a new 'set' of data in beginning with the root element
buffer.append(line) # Passes lines into list
yield "".join(buffer) # Outputs
file.close()
# The second nested set of for-loops are used to parse the newly reformatted data into a new list
for xml_string in separated_xml(infile): # Calls the output of the separated and read file to parse the data
soup = BeautifulSoup(xml_string, "lxml") # BeautifulSoup parses the data strings where the XML is converted to Unicode
pub_ref = soup.findAll("publication-reference") # Beginning parsing at every instance of a publication
lst = [] # Creating empty list to append into
with open('./output.csv', 'wb') as f:
writer = csv.writer(f, dialect = 'excel')
for info in pub_ref: # Looping over all instances of publication
# The final loop finds every instance of invention name, patent number, date, and country to print and append into
for inv_name, pat_num, date_num, country in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("date"), soup.findAll("country")):
print(inv_name.text, pat_num.text, date_num.text, country.text)
lst.append((inv_name.text, pat_num.text, date_num.text, country.text))
writer.writerow([inv_name.text, pat_num.text, date_num.text, country.text])
我也尝试过将open和writer放在for循环之外,以检查哪里出现了问题,但是没有用。我知道这个文件一次只写一行,并且一遍又一遍地重写同一行(这就是为什么CSV文件中只剩下1行),我就是看不到它。在
提前谢谢你的帮助。在