以下来自于http://www.toutiao.com/i6321943520135348737/?group_id=6321939698362384641&group_flags=0
的一个爬虫教程:
# -*- coding: utf-8 -*-
import re
import requests as r
request = r.get("http://top.baidu.com/category?c=1&fr=topindex")
request.encoding = "gbk"
result = request.text
print(result)
raw_input()
'''
pattern = 'title=".+?"'
output = re.findall(pattern, result, re.S)
for each in output:
print(each[0])
'''
pattern = re.compile('title=".+?"',re.S)
items = re.findall(pattern,result)
for item in items:
print(item)
注意被注释掉的标红的那句是有问题的,会导致结果不正确,有时候会返回一堆类似于:
u'78e4,u'84b2....
之类的结果