python爬虫相关模块使用举例

最新推荐文章于 2022-09-19 16:43:23 发布

留奇

最新推荐文章于 2022-09-19 16:43:23 发布

阅读量171

点赞数

分类专栏：爬虫与数据挖掘

本文链接：https://blog.csdn.net/lqiqil/article/details/107638617

版权

爬虫与数据挖掘专栏收录该内容

5 篇文章 0 订阅

订阅专栏

官方内置爬取模块urllib

urllib的request模块可以非常方便地抓取URL内容，也就是发送一个GET请求到指定的页面，然后返回HTTP的响应。
基本使用：

#导入urllib模块并导入其中request模块
import urllib.request
#或from urllib import request
#请求网站数据
response=urllib.request.urlopen("https://www.baidu.com")
print(response)
#获取网站源码并转码，注意网站编码格式
text=response.read().decode("utf-8")
print("text")
#存文件
f=open("F:/2.txt","w")
f.write(text)

获取response里面的信息：

import urllib.request
response=urllib.request.urlopen("https://www.baidu.com")
print(response)
#获取响应头内容
print(response.info())
#获取状态码
print(response.getcode())
#获取网址
print(response.geturl())

运行urllib库中的编码解码操作：

import urllib.request
#编码
s=urllib.request.quote("中国")
#解码
v=urllib.request.unquote(s)
print(s)
print(v)

伪装成浏览器：

import urllib.request
import random
#请求网站数据
response=urllib.request.urlopen("www.baicu.com")
print(response)
#获取网站源码并转码，注意网站编码格式
text=response.read().decode("utf-8")
print(text)
#获取header request中的User-Agent
header={
	"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"
}
#使用Request()方法
req=urllib.request.Request("https://www.baidu.com",headers=header)
#获取请求
response=urllib.request.urlopen(req)
text=response.read().decode("utf-8")
print(text)
#写入文件
f=open("F:/aaa.txt",mode="w")
f.write(text)

#当多次爬取同一个网站时设置多个代理地址防止被发现
agentList=[
     "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0",
     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",
     "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16"
]
#使用Request方法
req=urllib.request.Request("www.baidu.com)
#随机从数组中添加访问头
req.add_header("User-Agent",random.choice(agentList))
#获取请求
respons=urllib.request.urlopen(req)
text=response.read().decode("utf-8")
print(text)

抓取需要登录访问的数据：

"""
1.模拟登录
      用户名、密码、验证码
"""
data={
	"user-name":"876868787",
	"password":"123456"
}
header={
	"User-Agent":"浏览器型号"
}
response=urllib.request.urlopen("login_website",data,header=header)

"""
2.伪装成已登录
      cookie-->session
header={
	"User-Agent":"浏览器型号"，
	"Cookie":"key1=value1;key2=value2;..."
}
"""

第三方爬取模块request：

import requests
data=requests.get("http://www.qq.com")
print("原编码：",data.encoding)
data.encoding="utf-8"
html=data.text
print(html)

留奇

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录