我正在使用Python mechanize模块向网站提交一个简单的查询,然后分解返回的元素以获取我需要的数据。但我似乎无法正确处理传回的转义序列。这是我的代码:
def stripEscape(string): #credit goes to sarnold
delete = ""
i=1
while (i<0x20):
delete += chr(i)
i += 1
t = string.translate(None, delete)
return t
def getHTML(metID):
br = mechanize.Browser()
response = br.open("http://urlgoeshere.com")
br.form = list(br.forms())[0]
br["PROMPT12"] = metID
response = br.submit()
htmlText = response.read()
parseHTML(htmlText)
def parseHTML(htmlText):
htmlText.index('table')
arr = re.split(r'(?\w{2}>)',htmlText) # everything after background tag
logFile = open('Log.txt','wb')
for ele in arr:
ele = stripEscape(ele)
if ele == '':
arr.remove(ele)
for ele in arr:
logFile.write("ele: "+ele+'\n')
if re.match('/table', ele):
logFile.write("END OF TABLE FOUND")
logFile.write("\nele: "+ele+'\n')
break
# other element filters
当我通过交互式shell传递参数时,stripEscape函数工作正常,但网站中的一个数组元素是\r\n\r\n,这会“逃脱”我的过滤器。它会像我这样写入我的日志文件:
ele: normal
ele: stuff
ele:
ele: more
ele: normal
绕过过滤器的结束表标签会导致我的所有其他过滤器变得混乱。有没有更好的方法来处理转义序列?