python怎么查找错误位置_Python:使用index / find在HTML中搜索Unicode字符串返回错误的位置...

I am trying to parse the number of results from the HTML code returned from a search query, however when I use find/index() it seems to return the wrong position. The string I am searching for has an accent, so I try searching for it in Unicode form.

A snippet of the HTML code being parsed:

Aproximádamente 37 resultados.

and I search for it like this:

str_start = html.index(u'Aproxim\xe1damente ')

str_end = html.find(' resultados', str_start + 16)#len('Aproxim\xe1damente ')==16

print html[str_start+16:str_end] #works by changing 16 to 24

The print statement returns:

damente 37

When the expected result is:

37

It seems str_start isn't starting at the beginning of the string I am searching for, instead 8 positions back.

print html[str_start:str_start+5]

Outputs:

l">

The problem is hard to replicate though because it doesn't happen when using the code snippet, only when searching inside the entire HTML string. I could simply change str_start+16 to str_start+24 to get it working as intended, however that doesn't help me understand the problem. Is it a Unicode issue? Hopefully someone can shed some light on the issue.

Thank you.

SAMPLE CODE:

from urllib2 import Request, urlopen

url = 'http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1'

post = None

headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2)'}

req = Request(url, post, headers)

conn = urlopen(req)

html = conn.read()

str_start = html.index(u'Aproxim\xe1damente ')

str_end = html.find(' resultados', str_start + 16)

print html[str_start+16:str_end]

解决方案

Your problem ultimately boils down to the fact that in Python 2.x, the str type represents a sequence of bytes while the unicode type represents a sequence of characters. Because one character can be encoded by multiple bytes, that means that the length of a unicode-type representation of a string may differ from the length of a str-type representation of the same string, and, in the same way, an index on a unicode representation of the string may point to a different part of the text than the same index on the str representation.

What's happening is that when you do str_start = html.index(u'Aproxim\xe1damente '), Python automatically decodes the html variable, assuming that it is encoded in utf-8. (Well, actually, on my PC I simply get a UnicodeDecodeError when I try to execute that line. Some of our system settings relating to text encoding must be different.) Consequently, if str_start is n then that means that u'Aproxim\xe1damente ' appears at the nth character of the HTML. However, when you use it as a slice index later to try and get content after the (n+16)th character, what you're actually getting is stuff after the (n+16)th byte, which in this case is not equivalent because earlier content of the page featured accented characters that take up 2 bytes when encoded in utf-8.

The best solution would be simply to convert the html to unicode when you receive it. This small modification to your sample code will do what you want with no errors or weird behaviour:

from urllib2 import Request, urlopen

url = 'http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1'

post = None

headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2)'}

req = Request(url, post, headers)

conn = urlopen(req)

html = conn.read().decode('utf-8')

str_start = html.index(u'Aproxim\xe1damente ')

str_end = html.find(' resultados', str_start + 16)

print html[str_start+16:str_end]

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值