python爬虫代理ip的爬取与使用并存入数据库（requests+xpath）

最新推荐文章于 2022-11-13 15:07:07 发布

！小白菜！y

最新推荐文章于 2022-11-13 15:07:07 发布

阅读量699

点赞数

分类专栏： mysql+navicat python爬虫项目文章标签： python 爬虫 tcp/ip 数据库 mysql

本文链接：https://blog.csdn.net/qq_45834835/article/details/125122959

版权

python爬虫项目同时被 2 个专栏收录

6 篇文章 1 订阅

订阅专栏

mysql+navicat

5 篇文章 1 订阅

订阅专栏

前言

使用软件： pycharm navicat
学习连接：https://www.bilibili.com/video/BV1Hg4y1z76H?spm_id_from=333.880.my_history.page.click
爬取区域：快代理的免费代理

提示：以下是本篇文章正文内容，下面案例可供参考

一、下载库

#直接在Terminal下载
pip install requests
pip install parsel
pip install pymysql

出现问题就换个镜像去下载（比如：Could not find a version that satisfies the requirement urllib3）

pip install 包的名字 -i http://pypi.doubanio.com/simple/ --trusted-host pypi.doubanio.com

二、步骤

1.引入库

import re			#正则表达式
import pymysql		#连接数据库
import requests		#request请求
import parsel		#解析数据
import time			# 时间模块

2.代码

#获取数据函数
def getData():
    list = []
    for page in range(1100,1200):
        print('===================获取第{}页的数据================'.format(page))
        urls = 'https://free.kuaidaili.com/free/inha/{}/'.format(str(page))
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.53'
        }

        #发送请求
        response = requests.get(urls,headers = headers)
        # 获取数据 文本格式
        data = response.text
        # 转换数据类型
        html_data = parsel.Selector(data)
        #数据解析 //跨界点标曲
        path_list = html_data.xpath('//table[@class = "table table-bordered table-striped"]/tbody/tr')
        
        #遍历
        for tr in path_list:
            # print(tr)
            
            dict_proxies = {}
            
            # td标签包裹的文本
            http_type = tr.xpath('./td[4]/text()').extract_first()        #协议类型
            ip_num = tr.xpath('./td[1]/text()').extract_first()           #ip
            ip_port = tr.xpath('./td[2]/text()').extract_first()          #端口

            # print(http_type,ip_num,ip_port)
            
            #构建ip字典	代理ip使用格式 {"协议类型":"ip:端口"}
            dict_proxies[http_type] = ip_num+':'+ip_port
            print(dict_proxies)
            list.append(dict_proxies)
            
            time.sleep(0.5)

    print(list)
    print('获取到数量',len(list))
    return list

#检测代理ip的质量 用百度网址测试
def check_ip(list):
    #可以使用的ip的列表
    can_use = []
    #头部信息
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.53'
    }
    #遍历获取到的ip
    for proxy in list:
        try:
        	#发送请求
            response = requests.get('https://www.baidu.com',headers = headers,proxies = proxy,timeout = 0.1)
            #请求成功 返回服务为200
            if response.status_code ==200:
                can_use.append(proxy)
        
        except Exception as e:
            print(e)
        
        finally:
            print('当前ip：',proxy,'检测通过')
    
    return can_use

#连接数据库
def connectMysql(can_use):
	#遍历检测成功的ip
    for i in can_use:
        # print(i)
    
        item = str(i)								#转换为字符串类型
        
        #找到协议类型（HTTP / HTTPS）
        findh = re.compile(r'[A-Z]+')				
        http_type = re.search(findh,item).group()	
		
		#找ip与端口号
        findi = re.compile(r'\b((25[0-5]|2[0-4]\d|[0-1]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[0-1]?\d\d?):\d+\b')							
        ip_num = re.search(findi, item).group()
        
        print(http_type,ip_num)
		
		#连接数据库
        connect = pymysql.Connection(host='localhost',user='root',password='152800',port=3306,database='ip')											
        cursor = connect.cursor()
        sql ='insert into agent_ip (http_type,ip_num) VALUES (%s,%s)'
        cursor.execute(sql,[http_type,ip_num])
        connect.commit()

#主函数
def main():
	#获取ip
    list = getData()
	
	#检测ip
    can_use = check_ip(list)
    print('能用的:', can_use)
    print('数量：', len(can_use))
	
	#连接数据库
    connectMysql(can_use)

if __name__ == '__main__':
    main()

连接数据库详细教程：https://blog.csdn.net/qq_45834835/article/details/124491316?spm=1001.2014.3001.5501

总结

爬取的过程中很有可能会因为网络不好而报错！
爬取的话只能重新爬下一步的优化应该是要获取到断开时候爬取的位置以及在断开前把数据传入数据库

！小白菜！y

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
python爬虫代理ip的爬取与使用并存入数据库（requests+xpath）

python爬虫代理ip的爬取与使用并存入数据库（requests+xpath）二、步骤1.引入库2.代码连接数据库详细教程：https://blog.cs
复制链接

扫一扫