【python爬虫】爬取ipip信息（随机User-Agent+获取并携带cookie+获取并携带csrf_token访问）

最新推荐文章于 2023-10-10 14:33:37 发布

AA8j

最新推荐文章于 2023-10-10 14:33:37 发布

阅读量488

点赞数

分类专栏： python爬虫

本文链接：https://blog.csdn.net/qq_44874645/article/details/114965112

版权

python爬虫专栏收录该内容

7 篇文章 2 订阅

订阅专栏

本模块为IP溯源单线程获取ipip信息的模块。

效果图

在这里插入图片描述

代码

# -*- coding: utf-8 -*-
# @Time    : 2021/3/19 10:30
# @Author  : AA8j
# @FileName: ipip.py
# @Software: PyCharm
# @Blog    ：https://blog.csdn.net/qq_44874645
import re
import urllib.request  # 发送请求
import http.cookiejar  # cookie
from fake_useragent import UserAgent


def get_ipip_html(ip):
    url1 = "https://www.ipip.net"
    cookiejar = http.cookiejar.CookieJar()
    # 使用HTTPCookieProcessor来创建cookie处理对象handler
    handler = urllib.request.HTTPCookieProcessor(cookiejar)
    # 通过build_opener构建opener,传入cookie处理对象handler
    opener = urllib.request.build_opener(handler)
    # 访问url,访问后会自动保存cookie到cookiejar对象中

    User_Agent = UserAgent().random
    headers = {
        'Host': 'www.ipip.net',
        'User-Agent': User_Agent
    }
    request1 = urllib.request.Request(url1, headers=headers)
    # 构造request对象

    # 没有cookie访问会设置一个cookie忽略报错
    try:
        opener.open(request1)
    except Exception:
        pass

    # 再用opener.open()发送请求会带着这个cookie信息
    url2 = "https://www.ipip.net/ip.html"
    request2 = urllib.request.Request(url2, headers=headers)
    response = opener.open(request2)
    html1 = response.read().decode()
    csrf_token = crawl_csrf_token(html1)
    # 加入必要头及body
    headers['Referer'] = url1
    data = f'csrf_token={csrf_token}&ip={ip}'
    # data需要bytes型
    data = bytes(data, 'utf-8')
    # 再次访问
    request2 = urllib.request.Request(url2, headers=headers, data=data)
    response = opener.open(request2)
    html2 = response.read().decode()
    return html2


# 从第一个返回的html页面爬取csrf_token
def crawl_csrf_token(html1):
    # .*?：非贪心捕获
    csrf_token_rule = re.compile(r'csrf_token" value="(.*?)">')
    csrf_token = csrf_token_rule.search(html1).group(1)
    return csrf_token


# 爬取信息
def dig_ipip_information(html):
    # 删去空字符
    html = re.sub(r'\s', '', html)
    # 爬取国内高精度定位位置
    specific_location_rule = re.compile(r'<table.*国内高精度.*?46px;">(.*?)<.*table>')
    specific_location = ''.join(specific_location_rule.findall(html))
    # 爬取是否为IDC机房
    idc_rule = re.compile(r'骨干网数据')
    idc = ''.join(idc_rule.findall(html))
    if idc:
        idc = '是'

    # 生成字典并返回
    ipip_dict = {'specific_location': specific_location, 'idc': idc}
    return ipip_dict


if __name__ == '__main__':
    ip = '36.110.116.43'
    ipip_html = get_ipip_html(ip)
    ip_information = dig_ipip_information(ipip_html)
    print(ip_information)

{'specific_location': '中国北京北京海淀区硅谷亮城', 'idc': ''}

AA8j

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【python爬虫】爬取ipip信息（随机User-Agent+获取并携带cookie+获取并携带csrf_token访问）

本模块为IP溯源单线程获取ipip信息的模块。效果图代码import reimport urllib.request # 发送请求import http.cookiejar # cookiefrom fake_useragent import UserAgentdef get_ipip_html(ip): url1 = "https://www.ipip.net/ip.html" cookiejar = http.cookiejar.CookieJar() #
复制链接

扫一扫