Python爬虫 requests

最新推荐文章于 2024-09-21 16:28:19 发布

胖头猫

最新推荐文章于 2024-09-21 16:28:19 发布

阅读量567

点赞数

分类专栏： Python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/catfishH/article/details/125643741

版权

Python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

该篇博客介绍了Python使用requests库进行网络爬虫时，如何设置请求头以模拟浏览器行为，处理网页编码问题，以及如何保存网页内容。内容包括设置User-Agent避免反爬，判断响应状态，获取网页编码，以及处理文件保存时可能出现的问题。同时，讲解了urllib.parse库在URL解析和引用中的应用。

摘要由CSDN通过智能技术生成

获取网页

import requests

url = "http://baidu.com.cn"
user_agent = 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.4 (KHTML, like Gecko)' + ' Chrome/22.0.1229.79 Safari/537.4'      
headers = {  # url请求头
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Charset': 'gb18030,utf-8;q=0.7,*;q=0.3',
            'Accept-Encoding': 'gzip,deflate,sdch',
            'Accept-Language': 'en-US,en;q=0.8',
            'Connection': 'keep-alive',
            'User-Agent': "'" + user_agent + "'",
            'Referer': "'" + url + "'",
        }
web_timeout  # 超时响应时间
retry = 1  # 再次尝试次数

def is_response_avaliable(self, response): # 判断网页是否可达
	# Only get html page, when the response code is 200 
    if response.status_code == requests.codes.ok: 
    	if 'html' in response.headers['Content-Type']:
        	return True
      	return False

def handle_encoding(response): # 得到真正的encoding
	# Get the real type of encode
    if response.encoding == 'ISO-8859-1':
    	charset_re = re.compile("((^|;)\s*charset\s*=)([^\"']*)", re.M)
        charset=charset_re.search(response.text) 
        charset=charset and charset.group(3) or None 
        response.encoding = charset

## 获取网页内容
try:
	response = requests.get(url, headers=headers,
                	timeout=web_timeout, proxies=None)
	if is_response_avaliable(response): # 连接成功response code为200
		handle_encoding(response)
        page_source = response.text  # 网页内容
except:
	if retry > 0:  # try to visit again
    	request(retry - 1) # 递归略写

requests.get(url)

返回一个包含服务器资源的Response对象：
response.status_code：返回状态，成功200，失败404
response.text：获取的 url 页面内容
response.encoding：从header中猜测的响应内容编码方式
response.apparent_encoding：从内容中分析出的备选编码方式
response.content：响应内容的二进制形式

url 请求头

伪装成浏览器实现反反爬，F12后找到Network(网络)，选择一个请求，找到user-agent。
在这里插入图片描述

保存网页

def do_save_page_file(self, url, page_source):
	"""
    save page txt to file if url match the pattern    
    Args:
    	url: url
        page_source: html txt string
    Returns:
        None
    """
    # Each page saved as a independent file, use the url to name it
    fname = urllib.parse.quote_plus(url)  # url 解码
    # Deal with the long path problem
    page_file = out_dir + fname # out_dir为输出路径
    if len(page_file) < 1:
        return
    if page_file[0] == '.': # 出现相对路径
    	base = os.path.split(os.path.abspath(__file__))[0]
        page_file = base + page_file[1:]    
        if len(page_file) > 256: # 文件名过长截断
        	page_file = page_file[:255]
     else:
     	if len(page_file) > 256:
        	page_file = page_file[:255]
     try:
     	with open(page_file, 'w') as fp:
        	fp.write(page_source.encode('utf-8'))
            fp.flush()
     except (IOError, UnicodeEncodeError, Exception) as e:
            print(e)

url 解析和引用

urllib.parse 用于对URL拆分、拼接、编码、解码等，在功能上分为两大类：URL parsing(URL解析)和URL quoting(URL引用)

1. 拆分合并

1）urllib.parse.urlparse() 拆分url的6个组成部分(scheme, netloc, path, params, query, fragment)。
2）urllib.parse.urlsplit() 拆分url成5个部分，params将被合并到path中并以分号隔开(scheme, netloc, path, query, fragment)。
3）urllib.parse.parse_qs()和urllib.parse.parse_qsl() 分析前者得到的query字符串为字典/列表。

4）urlunparse() 与 urlparse() 相反，urlunsplit() 与 urlsplit() 相反。
5）urljoin(base url,url,allow_fragments=True) 接收两个url的组件或者是完整的url进行组合，base以第二个为准。
6）urlencode()将字典或two-element tuples的序列组合成字符串，与parse_qs()相反。

2. 特殊字符编码

str或bytes型数据中下划线，句号，逗号，斜线和字母数字这类符号不转化，其余会编码成“%十六进制”模式。
urllib.parse.quote(string,safe=‘/’,encoding=None,errors=None)
urllib.parse.quote_plus(string,safe=‘/’,encoding=None,errors=None)
两者对一些特殊符号处理不同
在这里插入图片描述