【爬虫基础】requests库的使用方法以及爬取百度贴吧案例

爬虫的请求库为什么要学习requests,⽽不是urllib?

  1. requests的底层实现就是urllib
  2. requests在Python2和Python3通⽤,⽅法完全⼀样
  3. requests简单易⽤
  4. requests能够⾃动帮助我们解压(gzip压缩的)⽹⻚内容

requests的作用

作⽤:发送⽹络请求,返回相应数据

requests的基本用法

import requests
url = 'https://www.baidu.com/'
response = requests.get(url)
print(response.text)

打印出来的结果是:
响应结果
可以看到相应结果中有乱码。

解决requests的相应结果的乱码问题的方法

有两个:
(1)

import requests
url = 'https://www.baidu.com/'
response = requests.get(url)
contents = response.content.decode('utf-8')
print(contents)

(2)

import requests
url = 'https://www.baidu.com/'
response = requests.get(url)
response.encoding = 'utf-8'
print(response.text)

response.text和response.content的区别

(1)response.text

  • 类型:str
  • 修改编码⽅式:response.encoding = ‘utf-8’
    (2)response.content
  • 类型:bytes
  • 修改编码⽅式:response.content.decode(‘utf8’)

发送简单请求

response = requests.get(url)
# response的常⽤⽅法:
response.text
response.content
response.status_code
response.request.headers
response.headers

下载图片

url = 'https://www.baidu.com/img/bd_logo1.png?where=su'
response = requests.get(url)
with open('baidu.png','wb') as f:
	f.write(response.content)

发送带header的请求

为什么请求需要带上header?
模拟浏览器,欺骗服务器,获取和浏览器⼀致的内容
header的形式:字典

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}

用法是:

requests.get(url,headers = headers)

发送带参数的请求

如果url中带有参数,形式为:
https://cn.bing.com/search?q=python
那么我们用requests发送请求时,需要发送带参数的请求,参数的形式为字典。
kw = {‘q’:‘python’}
用法:requests.get(url,params=kw)

url = 'https://cn.bing.com/'
kw = {'q':'python'}
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
response = requests.get(url,headers=headers,params=kw)
print(response.url)
print(response.content.decode('utf-8'))

贴吧练习

import requests
class TiebaSpider(object):
    def __init__(self,name):
        self.url = 'https://tieba.baidu.com/f?kw=' + name + '&ie=utf-8&pn='
        self.name = name
        self.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'}

    def get_urllist(self):
        self.urllist = []
        for i in range(10):
            self.urllist.append(self.url + str(50*i))
        return self.urllist

    def jiexi(self,url):
        response = requests.get(url,headers=self.headers)
        return response.text

    def baocun(self,contents,page):
        file_name = '{}第{}页.html'.format(self.name,page)
        with open(file_name,'w',encoding='utf-8') as f:
            f.write(contents)

    def run(self):
        urllist = self.get_urllist()
        for url in urllist:
            contents = self.jiexi(url)
            page = urllist.index(url) + 1
            self.baocun(contents,page)
            exit()


if __name__ == '__main__':
    tieba = TiebaSpider('lol')
    tieba.run()
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值