爬虫基本原理介绍、实现以及问题解决

最新推荐文章于 2023-06-28 01:20:34 发布

qianshanding0708

最新推荐文章于 2023-06-28 01:20:34 发布

阅读量345

点赞数

文章标签：数据库人工智能 python java 编程语言

原文链接：https://mp.weixin.qq.com/s?__biz=MzI4OTU3ODk3NQ==&mid=2247505593&idx=1&sn=6d526af68d5b7d7c7c80e0beef6f36f5&chksm=ec2f9c09db58151f166fd44ac1396c2fbf4c22984938eeae9150b28474a262f467a4af8e5604&scene=126&&sessionid=0

版权

更多内容关注微信公众号：fullstack888

一、爬虫的意义

1.前言

最近拉开了毕业季的序幕，提前批开启了大厂抢人模式，所以很多人都开始在力扣刷题，希望能够在大厂抢人的时期脱颖而出。为了能实现群内力扣刷题排名就需要对力扣网站进行数据爬取，最近就对爬虫的机制和爬虫的意义进行了了解。

2.爬虫能做什么

其实爬虫的主要目标就是通过大量自动化进行目标网站的访问，获取公开的数据，方便我们进行数据统计或者数据整合。其中公开shuju一定要注意，就是一定是网页可以公开访问的数据进行访问，否则是违法的哦，容易面向监狱编程。另外就是一定要注意访问的频次，不能对原始网站造成危害（一般都会做限制了）。不然会变成一只有毒的爬虫。

3.爬虫有什么意义

其实爬虫主要做的事情就是数据的收集，接下来就可以做数据的处理，企业可以用这些数据来进行市场分析，把握商机，就行买股票一样，有大量的历史数据我们就可以尝试去预测市场走势，押中了就是一次机会。

另外现在人工智能这么火爆，但是人工智能的基础就是大数据，我们听说过训练集其实就是大数据，我们有时候拿不到现成的数据集的时候就需要进行爬虫拿到我们的数据基础。

二、爬虫的实现

1.爬虫的基础原理

爬虫其实就是自动访问相应的网站，拿到我们想要的数据。比如我们想要查快递，就会不断的访问一个网页，去看最新的进度，爬虫就是去模拟这个过程，同时为了提高效率可能会省略一些步骤。我们这次就以力扣的刷题总数做例子。

2.api的获取

我们打开力扣的主页的时候一定会进行数据的访问拿到一些信息，我们打开开发者模式，就可以看到每一条请求。例如下图：

右侧就是我的主页其中的一条数据库请求内容，他用的语法是graphql，赶兴趣我们下次再讲，我们只要用就行了。

我们其实可以对请求头进行精简，得到下面的graphql语法：

payload = {"operation_name": "userPublicProfile",   #查询数据库请求内容
    "query": '''query userPublicProfile($userSlug: String!) {
userProfilePublicProfile(userSlug: $userSlug) {
    username
    submissionProgress {
        acTotal
    }
}
}
''',
    "variables": '{"userSlug":"查询对象"}'
}

3.爬虫实现

我们直接对上面构造的访问方式进行访问，看看得到了什么：

import requests as rq
from urllib.parse import urlencode


headers={       #请求头信息
    "Referer":"https://leetcode.cn",
}


payload = {"operation_name": "userPublicProfile",   #查询数据库请求内容
    "query": '''query userPublicProfile($userSlug: String!) {
userProfilePublicProfile(userSlug: $userSlug) {
    username
    submissionProgress {
        acTotal
    }
}
}
''',
    "variables": '{"userSlug":"romantic-haibty42"}'
}


res = rq.post("https://leetcode.cn/graphql/"+"?"+ urlencode(payload),headers = headers)
print(res.text)

可以从上面的发现我们拿到了acTotal字段，也就是我们想要总的刷题数。但是我们尝试对大量数据进行访问的时候我们就会看到访问频次的限制。

三、反爬解决方案

1.反爬的实现方式

很多网站常用的一种反爬的方式是对单ip进行限制，如果一个ip在一定的时间内大量访问，那么就会不再返回信息，而是返回错误。主要是数据库的日志系统会对访问进行记录。

2.反爬的解决方法

Ipidea是一个IP代理平台，为全球用户提供优质大数据代理服务，目前拥有千万级真实住宅IP资源，包含超过220个国家和地区，日更新超过4000万，汇聚成代理服务池并提供API接入，支持http、https、socks5等多种协议类型，并且拥有API形式和账号密码多种使用方式，非常易于上手。官网地址：https://www.ipidea.net/

3.反爬的实现代码

其实我们有了上面的代码之后再加入到ipidea就会很简单，只要我们去官网下示例代码，然后插入我们的代码就行了：

只要我们将代码中的tiqu换成我们的提取链接，然后将我们的代码放到核心业务的try里面就可以实现了。

不过我为了使用socks5代理方式进行了修改，完整版代码如下：

# coding=utf-8
# ！/usr/bin/env python
import json
import threading
import time
import requests as rq
from urllib.parse import urlencode


headers={
    "Referer":"https://leetcode.cn",


}


payload = {"operation_name": "userPublicProfile",
    "query": '''query userPublicProfile($userSlug: String!) {
userProfilePublicProfile(userSlug: $userSlug) {
    username
    submissionProgress {
        acTotal
    }
}
}
''',
    "variables": '{"userSlug":"kingley"}'
}


username = "romantic-haibty42"


def int_csrf(proxies,header):
    sess= rq.session()
    sess.proxies = proxies
    sess.head("https://leetcode.cn/graphql/")
    header['x-csrftoken'] = sess.cookies["csrftoken"]


testUrl = 'https://api.myip.la/en?json'




# 核心业务
def testPost(host, port):
    proxies = {
        'http': 'socks5://{}:{}'.format(host, port),
        'https': 'socks5://{}:{}'.format(host, port),
    }
    res = ""


    while True:
        try:
            header = headers
            # print(res.status_code)
            chaxun = payload
            chaxun['variables'] = json.dumps({"userSlug" : f"{username}"})
            res = rq.post("https://leetcode.cn/graphql/"+"?"+ urlencode(chaxun),headers = header,proxies=proxies)
            print(host,res.text)
        except Exception as e:
            print(e)
        break




class ThreadFactory(threading.Thread):
    def __init__(self, host, port):
        threading.Thread.__init__(self)
        self.host = host
        self.port = port


    def run(self):
        testPost(self.host, self.port)




# 提取代理的链接  json类型的返回值 socks5方式
tiqu = ''


while 1 == 1:
    # 每次提取10个，放入线程中
    resp = rq.get(url=tiqu, timeout=5)
    try:
        if resp.status_code == 200:
            dataBean = json.loads(resp.text)
        else:
            print("获取失败")
            time.sleep(1)
            continue
    except ValueError:
        print("获取失败")
        time.sleep(1)
        continue
    else:
        # 解析json数组
        print("code=", dataBean)
        code = dataBean["code"]
        if code == 0:
            threads = []
            for proxy in dataBean["data"]:
                threads.append(ThreadFactory(proxy["ip"], proxy["port"]))
            for t in threads:  # 开启线程
                t.start()
                time.sleep(0.01)
            for t in threads:  # 阻塞线程
                t.join()
    # break
    break

实现结果如下：