网络爬虫的本质与HTTP状态码

最新推荐文章于 2023-08-21 20:54:13 发布

Laicaling

最新推荐文章于 2023-08-21 20:54:13 发布

阅读量532

点赞数 1

CC 4.0 BY-SA版权

分类专栏：网络爬虫数据采集 http代理文章标签： python http

本文链接：https://blog.csdn.net/Laicaling/article/details/106145996

网络爬虫同时被 3 个专栏收录

206 篇文章

订阅专栏

数据采集

198 篇文章

订阅专栏

http代理

189 篇文章

订阅专栏

本文介绍了网络爬虫的基础知识，强调了HTTP GET和POST协议在爬虫中的作用。通过HttpClient执行HTTP请求时，涉及到的状态管理和Cookie操作，如CookieSpecRegistry、CookieStore等。此外，还展示了如何使用Python的selenium库生成Cookies，特别是16yun代理配置的实现，以及ChromeDriver的配置以进行模拟登录。理解HTTP状态码和正确处理Cookie对于有效地进行网络爬虫至关重要。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

使用HTTP GET协议获取数据，使用HTTP POST协议提交数据。

在这里插入图片描述
客户端向服务器发送一个请求，请求头包含请求的方法、URL、协议版本、以及包含请求修饰符、客户信息和内容的类似于MIME的消息结构。
服务器以一个状态行作为响应，响应的内容包括消息协议的版本，成功或者错误编码加上包含服务器信息、实体元信息以及可能的实体内容。
通常HTTP消息包括客户机向服务器的请求消息和服务器向客户机的响应消息。这两种类型的消息由一个起始行，一个或者多个头域，一个指示头域结束的空行和可选的消息体组成。
在HTTP 请求执行的过程中，HttpClient 添加了下列和状态管理相关的的对象到执行上下文中：
• http.cookiespec-registry :CookieSpecRegistry 实例代表了实际的cookie 规范注册表。这个属性的值设置在本地内容中，优先于默认的。
• http.cookie-spec :Cookiespec 实例代表了真实的cookie 规范。
• http.cookie-origin : CookieOrigin 实例代表了真实的服务源服务器的详细信息。
• http.cookie-store ： CookieStore 实例代表了真实的cookie 存储。设置在本地内容中的这个属性的值优先于默认的。
　　本地 HttpContext 对象可以被用来定制HTTP 状态管理内容，先于请求执行或在请求执行之后检查它的状态：
HttpClient httpclient=new DefaultHttpClient();
HttpContext localContext = new BasicHttpContext();
HttpGet httpget = new HttpGet(“http://localhost:8080/”);
HttpResponse reponse = httpclient.exectute(httpget,localContext);
CookieOrigin cookieOrigin = (CookieOrigin)localContext.getAttribut(ClientContext.COOKIE_ORIGIN);
System.out.println(“Cookie oriin: “+cookieOrigin);
CookieSpce cookieSpce = (CookieSpce)localContext.getAttribute(ClientContext.COOKIE_SPEC);
Cookie 是HTTP 代理和目标服务器可以交流保持会话状态信息的令牌或短包。HttpClient 使用Cookie 接口来代表抽象的Cookie 令牌。在他简单形式中HTTP 的cookie 几乎是名/值对。通常一个HTTP 的cookie 也包含一些属性，比如版本号，合法的域名，指定cookie 应用所在的源服务器URL 子集的路径，cookie 的最长有效时间。
Cookie 有两个版本，但版本0 被认为是不符合官方规范的。符合标准的cookie 的期望版本是1 。HttpClient 可以处理基于不同版本的cookie。如下实例
BasicClientCookie netscapeCookie = new BasicClientCookie(“nanme”,“value”);
netsacpeCookie.setVersion(0);
netsacpeCookie.steDomain(”.mycompany.com”);
netsacpeCookie.setPath("/");
如何生成Cookies
我们使用chrome driver来进行登录和cookie的生成
import os
import time
import zipfile
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
class GenCookies(object):
USER_AGENT = open(‘useragents.txt’).readlines()

16yun 代理配置

PROXY_HOST = ‘t.16yun.cn’ # proxy or host
PROXY_PORT = 31111 # port
PROXY_USER = ‘USERNAME’ # username
PROXY_PASS = ‘PASSWORD’ # password
@classmethod
def get_chromedriver(cls, use_proxy=False, user_agent=None):
manifest_json = “”"
{
“version”: “1.0.0”,
“manifest_version”: 2,
“name”: “Chrome Proxy”,
“permissions”: [
“proxy”,
“tabs”,
“unlimitedStorage”,
“storage”,
“<all_urls>”,
“webRequest”,
“webRequestBlocking”
],
“background”: {
“scripts”: [“background.js”]
},
“minimum_chrome_version”:“22.0.0”
}
“”"
background_js = “”"
var config = {
mode: “fixed_servers”,
rules: {
singleProxy: {
scheme: “http”,
host: “%s”,
port: parseInt(%s)
},
bypassList: [“localhost”]
}
};
chrome.proxy.settings.set({value: config, scope: “regular”}, function() {});
function callbackFn(details) {
return {
authCredentials: {
username: “%s”,
password: “%s”
}
};
}
chrome.webRequest.onAuthRequired.addListener(
callbackFn,
{urls: ["<all_urls>"]},
[‘blocking’]
);
“”" % (cls.PROXY_HOST, cls.PROXY_PORT, cls.PROXY_USER, cls.PROXY_PASS)
path = os.path.dirname(os.path.abspath(file))
chrome_options = webdriver.ChromeOptions()
if use_proxy:
pluginfile = ‘proxy_auth_plugin.zip’
with zipfile.ZipFile(pluginfile, ‘w’) as zp:
zp.writestr(“manifest.json”, manifest_json)
zp.writestr(“background.js”, background_js)
chrome_options.add_extension(pluginfile)
if user_agent:
chrome_options.add_argument(’–user-agent=%s’ % user_agent)
driver = webdriver.Chrome(
os.path.join(path, ‘chromedriver’),
chrome_options=chrome_options)
return driver
def init(self, username, password):
self.url = ‘https://passport.weibo.cn/signin/login?entry=mweibo&r=https://m.weibo.cn/’
self.browser = self.get_chromedriver(use_proxy=True, user_agent=self.USER_AGENT)
self.wait = WebDriverWait(self.browser, 20)
self.username = username
self.password = password
def open(self):
“”"
打开网页输入用户名密码并点击
:return: None
“”"
self.browser.delete_all_cookies()
self.browser.get(self.url)
username = self.wait.until(EC.presence_of_element_located((By.ID, ‘loginName’)))
password = self.wait.until(EC.presence_of_element_located((By.ID, ‘loginPassword’)))
submit = self.wait.until(EC.element_to_be_clickable((By.ID, ‘loginAction’)))
username.send_keys(self.username)
password.send_keys(self.password)
time.sleep(1)
submit.click()
def password_error(self):
“”"
判断是否密码错误
:return:
“”"
try:
return WebDriverWait(self.browser, 5).until(
EC.text_to_be_present_in_element((By.ID, ‘errorMsg’), ‘用户名或密码错误’))
except TimeoutException:
return False
def get_cookies(self):
“”"
获取Cookies
:return:
“”"
return self.browser.get_cookies()
def main(self):
“”"
入口
:return:
“”"
self.open()
if self.password_error():
return {
‘status’: 2,
‘content’: ‘用户名或密码错误’
}
# 如果不需要验证码直接登录成功
cookies = self.get_cookies()
return {
‘status’: 1,
‘content’: cookies
}
if name == ‘main’:
result = GenCookies(
username=‘180000000’,
password=‘16yun’,
).main()
print(result)
服务器一般会使用cookie来标识用户，如果接受并带上该cookie继续访问，服务器会认为你是一个已标识的正常用户。因此，大部分网站需要使用cookie的来爬取内容。