认识网络爬虫

最新推荐文章于 2024-10-02 10:53:34 发布

xue.stone

最新推荐文章于 2024-10-02 10:53:34 发布

阅读量76

点赞数

分类专栏： python3#爬虫文章标签： python

本文链接：https://blog.csdn.net/qq_42690088/article/details/118085694

版权

python3#爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

这篇博客介绍了Python网络爬虫的基础，主要关注requests库的get和post方法。通过示例代码展示了如何使用get方法访问百度首页和进行搜索，以及如何利用params参数组织URL。同时，解释了post方法用于登录操作，并提供了数据提交的实例。还讨论了处理编码问题以避免乱码，以及如何下载并保存网络上的图片。

摘要由CSDN通过智能技术生成

认识网络爬虫

'''
requests中的get方法。（无需登录即可查询）
'''
import requests
# 百度首页
url = 'https://www.baidu.com/'
html = requests.get(url)  # get=回车键
# print(html.text)          # .text获取网页源代码

# 百度搜索页 eg:搜索python
url = 'https://www.baidu.com/s?wd=python' # 单个或者键值对较少的
html = requests.get(url)    # get=回车键
# print(html.text)          # .text获取网页源代码

params = {'wd':'python'}                # params可以更好的维护url后边的参数
url = 'http://www.baidu.com/s'          # http后边不需要+s
html = requests.get(url,params=params)  # get=回车键，params=params添加对应的键值对
# print(html.text)                        # .text获取网页源代码


'''
requests中的post方法。（无需登录即可查询）
'''
url = 'http://httpbin.org/post'
data = {'username':'zhangsanfeng','gongfu':'taiji'}     # data获取传入参数
html = requests.post(url,data=data)     # post=登录按钮，data=data添加对应的键值对,eg:用户名、密码、验证码等..
# print(html.text)                    # .text获取网页源代码

'''
设置编码格式，产生乱码的原因：编码和解码不一致，一般解码需要跟编码一致。
网页中的编码格式存在：charset中，如果charset=utf-8,表示使用的utf-8编码格式。
方法1：手动设置  html.encoding = 'utf-8'
方法2：自动设置  tmhl.encoding = html.apparent_encoding
'''
html.encoding = html.apparent_encoding


'''
获取图片、音乐、视频：这些是通过二进制文本存储的，但是不能用text获取，需要用content获取。
我们需要根据爬取数据类型，使用不同的获取方法。
'''
url = 'https://img1.gtimg.com/ninja/2/2021/05/ninja162148683144302.jpg'
html = requests.get(url)
print(html.content)
# 写入文件
with open('1.jpg','wb') as w:
    w.write(html.content)