爬虫
hide_in_darkness
萌新一枚
展开
-
[python爬虫]URL解析(urllib.parse库)
from urllib.parse import *""" urllib库的parse模块定义了处理URL的标准接口"""url = "https://so.iqiyi.com/so/q_%25E9%25A9%25AF%25E9%25BE%2599%25E9%25AB%2598%25E6%2589%258B?" \ "source=suggest&sr=17221995103565892&ssrt=20200825083940436&ssra=e2191原创 2020-08-25 10:18:22 · 646 阅读 · 0 评论 -
[python爬虫]--robots.txt机器人协议(urllib.robotparser库)
from urllib.robotparser import *# 获取robots协议内容def get_robots(robot_url): """ :param robot_url: :return: """ # class urllib.robotparser.RobotFileParser(url='') rp = RobotFileParser() ''' 这个类提供了一些可以读取、解析和回答关于 url 上的 r原创 2020-08-25 08:13:46 · 424 阅读 · 0 评论 -
[python爬虫]request引发的异常(urllib.error库)
from urllib.error import *import urllib.request as urtry: url = 'https://www.baidu.com' res = ur.urlopen(url) print(res.read().decode('UTF-8'))# exception urllib.error.ContentTooShortError(msg, content)# 此异常会在 urlretrieve() 函数检测到已下载的数据量小于原创 2020-08-25 10:48:40 · 274 阅读 · 0 评论 -
[python爬虫]常用user_agent.py
在分布式爬虫中,为了获取多样的爬虫信息,我们通常会采用不同的user_agent去访问,我在这里提供了一些user_agentimport random# pc端的user-agentuser_agent_pc = [ # 谷歌 'Mozilla/5.0.html (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.html.2171.71 Safari/537.36', 'Moz原创 2020-10-18 11:16:40 · 286 阅读 · 1 评论 -
[python爬虫]urllib.request库的学习
import urllib.request as urimport urllib.parse as upimport socketimport osimport urllib.error as ueimport http.cookiejar# 普通的opener对象def get_res(): try: url = 'https://httpbin.org/post' wd = 'python' data = {原创 2020-08-26 08:42:27 · 375 阅读 · 0 评论 -
python爬虫--猫眼电影TOP100榜爬取
import requestsfrom requests.exceptions import RequestExceptionimport reimport jsonimport timedef write_to_file(content): with open('result.txt', 'a', encoding='UTF-8') as f: f.write(json.dumps(content, ensure_ascii=False)+'\n')def pa原创 2020-08-28 08:44:58 · 350 阅读 · 0 评论