I am trying to parse the number of results from the HTML code returned from a search query, however when I use find/index() it seems to return the wrong position. The string I am searching for has an accent, so I try searching for it in Unicode form.
A snippet of the HTML code being parsed:
Aproximádamente 37 resultados.
and I search for it like this:
str_start = html.index(u'Aproxim\xe1damente ')
str_end = html.find(' resultados', str_start + 16)#len('Aproxim\xe1damente ')==16
print html[str_start+16:str_end] #works by changing 16 to 24
The print statement returns:
damente 37
When the expected result is:
37
It seems str_start isn't starting at the beginning of the string I am searching for, instead 8 positions back.
print html[str_start:str_start+5]
Outputs:
l">
The problem is hard to replicate though because it doesn't happen when using the code snippet, only when searching inside the entire HTML string. I could simply change str_start+16 to str_start+24 to get it working as intended, however that doesn't help me understand the problem. Is it a Unicode issue? Hopefully someone can shed some light on the issue.
Thank you.
SAMPLE CODE:
from urllib2 import Request, urlopen
url = 'http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1'
post = None
headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2)'}
req = Request(url, post, headers)
conn = urlopen(req)
html = conn.read()
str_start = html.index(u'Aproxim\xe1damente ')
str_end = html.find(' resultados', str_start + 16)
print html[str_start+16:str_end]
解决方案
Your problem ultimately boils down to the fact that in Python 2.x, the str type represents a sequence of bytes while the unicode type represents a sequence of characters. Because one character can be encoded by multiple bytes, that means that the length of a unicode-type representation of a string may differ from the length of a str-type representation of the same string, and, in the same way, an index on a unicode representation of the string may point to a different part of the text than the same index on the str representation.
What's happening is that when you do str_start = html.index(u'Aproxim\xe1damente '), Python automatically decodes the html variable, assuming that it is encoded in utf-8. (Well, actually, on my PC I simply get a UnicodeDecodeError when I try to execute that line. Some of our system settings relating to text encoding must be different.) Consequently, if str_start is n then that means that u'Aproxim\xe1damente ' appears at the nth character of the HTML. However, when you use it as a slice index later to try and get content after the (n+16)th character, what you're actually getting is stuff after the (n+16)th byte, which in this case is not equivalent because earlier content of the page featured accented characters that take up 2 bytes when encoded in utf-8.
The best solution would be simply to convert the html to unicode when you receive it. This small modification to your sample code will do what you want with no errors or weird behaviour:
from urllib2 import Request, urlopen
url = 'http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1'
post = None
headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2)'}
req = Request(url, post, headers)
conn = urlopen(req)
html = conn.read().decode('utf-8')
str_start = html.index(u'Aproxim\xe1damente ')
str_end = html.find(' resultados', str_start + 16)
print html[str_start+16:str_end]