单线程爬虫最基本原理:使用Requests获取网页源代码,再使用正则表达式匹配出感兴趣的内容
第一个网页爬虫—Requests获取网页源代码
- 直接获取源代码
- 修改http头获取源代码
不需要header
#-*-coding:utf8-*-
import requests
html = requests.get('http://tieba.baidu.com/f?ie=utf-8&kw=python')
print html.text
需要header
#-*-coding:utf8-*-
import requests
import re
import sys
reload(sys)
sys.setdefaultencoding("gb18030")
type = sys.getfilesystemencoding()
# headers = {}
html = requests.get('http://jp.tingroom.com/yuedu/yd300p/')
# html = requests.get('http://jp.tingroom.com/yuedu/yd300p/',headers = headers)
html.encoding = 'utf-8'
# print html.text
title = re.findall('color:#666666;">(.*?)</span>',html.text,re.S)
for each in title:
print each
chinese = re.findall('color: #039;">(.*?)</a>',html.text,re.S)
for each in chinese:
print each
怎么样获取header
在网页中点击审查元素,然后点击network,然后再刷新网页,随便点一个,再点击header,往下翻就可以看到