2 爬虫案例1:访问百度贴吧
假设我们要访问的贴吧是:动漫吧
头几页的URL地址为:
https://tieba.baidu.com/f?kw=%E5%8A%A8%E6%BC%AB&ie=utf-8&pn=0
https://tieba.baidu.com/f?kw=%E5%8A%A8%E6%BC%AB&ie=utf-8&pn=50
https://tieba.baidu.com/f?kw=%E5%8A%A8%E6%BC%AB&ie=utf-8&pn=100
https://tieba.baidu.com/f?kw=%E5%8A%A8%E6%BC%AB&ie=utf-8&pn=150
2.1 获取一页的html
from urllib import request
# 加载一个页面
def loadPage(url):
# 发起一个请求
req = request.Request(url)
print(req)# <urllib.request.Request object at 0x007B1370>
# 打开响应的对象
response = request.urlopen(req)
print(response)# <http.client.HTTPResponse object at 0x01F36BF0>
# 获取响应的内容
html = response.read()
# 对获取到的unicode编码进行解码
content = html.decode('utf-8')
return content
url = 'https://tieba.baidu.com/f?kw=%E5%8A%A8%E6%BC%AB&ie=utf-8&pn=50'
content = loadPage(url)
print(content)
以上就可以爬取到网页文件了。
2.2 把网页的内容保存到本地html文件
# 把下载内容保存到本地文件
def writePage(html, filename):
print('正在保存到:', filename)
f = open(filename, 'w', encoding='utf8')
f.write(html)
f.close()
url = 'https://tieba.baidu.com/f?kw=%E5%8A%A8%E6%BC%AB&pn=50'
content = loadPage(url)
filename = 'tieba.html'
writePage(content, filename)
2.3 设置起始页和终止页
指定从第几页开始,到第几页结束
# 设置起始页和终止页
def tiebaSpider(url, beginPage, endPage):
for page in range(beginPage, endPage+1):
pn = 50*(page-1)
fullurl = url + str(pn)
content = loadPage(fullurl)
filename = '第' + str(page) + '页.html'
writePage(content, filename)
url = 'https://tieba.baidu.com/f?kw=%E5%8A%A8%E6%BC%AB&pn='
tiebaSpider(url, 1, 4)
2.4 用户输入参数
from urllib import request, parse
# 设置起始页和终止页
def tiebaSpider(url, beginPage, endPage):
for page in range(beginPage, endPage+1):
pn = 50*(page-1)
fullurl = url + '&pn=' + str(pn)
content = loadPage(fullurl)
filename = '第' + str(page) + '页.html'
writePage(content, filename)
if __name__ == '__main__':
kw = input('请输入要爬取的贴吧:')
beginPage = int(input('请输入起始页:')) # input()返回的内容是字符串 需要通过int()强制转换为整数
endPage = int(input('请输入终止页:'))
key = parse.urlencode({'kw':kw})
print(key)
url = 'https://tieba.baidu.com/f?' + key
tiebaSpider(url, beginPage, endPage)
以上就可以由用户输入爬取的百度贴吧名字以及起始、终止页并保存了。