python爬虫模拟登陆Gethub并进行搜索

最新推荐文章于 2024-04-03 19:32:43 发布

.含笑.

最新推荐文章于 2024-04-03 19:32:43 发布

阅读量831

点赞数 1

分类专栏：爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_39551311/article/details/96872316

版权

爬虫专栏收录该内容

51 篇文章 26 订阅

订阅专栏

1. 目标

以Github为例实现模拟登陆的过程，同时爬取登录后才可以访问的页面信息，如好友动态、个人信息。登录后可以看到这些信息，退出后就看不到这些信息了。

2. 环境准备

安装好lxml和requests库。

3. 分析登陆过程

1 先退出登录，同时清除Cookies

2 打开https://github.com/login，用Google开发者工具进行登录抓包

3 点击登录后的抓包见下图：

Header中包括Cookies、Host、Origin、Referer、User-Agent等,带着头文件访问登陆页面

class Login(object):
    def __init__(self):
        self.headers = {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
            'Connection': 'keep-alive',
            'Host': 'github.com',
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
        }
    def get_token(self):
        # 访问GitHub的登录页面
        response = self.session.get(self.login_url,headers = self.headers)
        # 调用HTML类对HTML文本进行初始化，成功构造XPath解析对象，
        # 同时可以自动修正HMTL文本（标签缺少闭合自动添加上）
        selector = etree.HTML(response.text)
        # 解析出登陆所需的authenticity_token信息
        token = selector.xpath("//input[@name='authenticity_token']/@value")
        print(token)
        return token

在登陆页面获取 token的value 参数,表单提交数据 form data 的 authenticity_token 护眼色前面获取的 token,后面是个 post 请求，访问携带 form_data 参数访问，才不会被阻拦

    def login(self):
        post_data = {
            'utf8':'✓',
            'authenticity_token':self.token,
            'login':'账号',
            'password':'密码',
        }
        response = self.session.post(self.post_url, data=post_data,headers = self.headers)
        if response.status_code == 200:
            print(response)
        else:
            print(response.status_code)

可以返回html看看结果，然后登陆成功后，会用session 保持会话，就可以做进一步的操作，在gethub里面搜索项目或资料或者下载，尽情发挥，看到后面的 https://github.com/search? 还有后面携带的 parameters 的参数，和前面有一配制好久可以访问了

    def search(self):
        key_name = input('搜索 Gethub项目 :')
        params = {
            "utf8": "✓",
            "q": key_name,
            "type":""
        }
        print(key_name)
        url = "https://github.com/search"
        response = self.session.get(url,headers=self.headers,params=params)
        print(response)

        return response.text

    def get_search(self,html):
        # class ="repo-list-item d-flex flex-column flex-md-row flex-justify-start py-4 public source"
        # class ="col-12 col-md-9 d-inline-block text-gray mb-2 pr-4"

        pattern = re.compile('<p class="col-12 col-md-9 d-inline-block text-gray mb-2 pr-4">(.*?)</p>',re.S)
        projects = re.findall(pattern,html)
        print(projects)

        for project in projects:
            print(project)

最后的出的结果，可以后期进一步分析，爬取获取所有的项目,进行项目分析,哪些项目的star最多或评论最多等操作

.含笑.

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
0
评论
python爬虫模拟登陆Gethub并进行搜索

1. 目标以Github为例实现模拟登陆的过程，同时爬取登录后才可以访问的页面信息，如好友动态、个人信息。登录后可以看到这些信息，退出后就看不到这些信息了。2. 环境准备安装好lxml和requests库。3. 分析登陆过程 1先退出登录，同时清除Cookies 2打开https://github.com/login，用Google开发者工具进行登录抓包...
复制链接

扫一扫