爬虫

最新推荐文章于 2024-04-01 13:30:49 发布

永远喜欢薇尔莉特

最新推荐文章于 2024-04-01 13:30:49 发布

阅读量169

点赞数

分类专栏： python 文章标签：爬虫 python

本文链接：https://blog.csdn.net/zqxlonely/article/details/116266803

版权

python 专栏收录该内容

14 篇文章 1 订阅

订阅专栏

Write a program to scrape https://github.com, using requests and Beautiful Soup. The goal is to get, for a given GitHub username, i.e., https://github.com/google, a list of repositories with their GitHub-assigned programming language, the number of forks as well as the number of stars a repository has.

Note that the repositories may spread out across several pages, we only focus on the second page, and return the result in a DataFrame. The output format is shown as follows (Note: The repository list may change dynamically over time, following result is only for reference),return it as result2:

Hint:

The url of Github can take two query strings, username and page number. For example, if we want to find all the repository of google on the third page, we can use the following url: https://github.com/google?page=3
Use get_text(strip=True) to remove the white spaces, new lines of the text.
You may encounter the exception:ConnectionError: HTTPConnectionPool(host=‘xxx.xx.xxx.xxx’, port=xxxx): Max retries exceeded with url:xx if using requests to establish multiple connections without closing. You can use the following example code to avoid such exception.

requests.adapters.DEFAULT_RETRIES = 5
s = requests.session()
s.keep_alive = False
url = 'xx'
r = s.get(url,params=xx)

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re


def getRepo(username):
    # YOUR CODE HERE

    # we only focus on the **second** page
    page = 2
    params = "page=" + str(page)

    url = "https://github.com/" + username

	# 使用requests设置请求属性，发起get请求，获取body
    requests.adapters.DEFAULT_RETRIES = 5
    s = requests.session()
    s.keep_alive = False
    r = s.get(url, params=params)
    text = r.text

	# 解析为html
    beautiful_soup_repositories = BeautifulSoup(text, "html.parser")
	# 获取类名为"public source d-block py-4 border-bottom"的li标签
    all_li = beautiful_soup_repositories.find_all(name='li',
                                                  attrs={"class": "public source d-block py-4 border-bottom"})

    names = []
    programming_languages = []
    stars_numbers = []
    forks_numbers = []

	# 遍历所有li标签
    for li in all_li:
        a = li.find(name="a", attrs={"itemprop": "name codeRepository"})
		
		# 获取标签内容(排除两边的空格)
        name = a.get_text(strip=True)
        print("name -> " + name)
        names.append(name)

        span = li.find(name="span", attrs={"itemprop": "programmingLanguage"})
        programming_language = span.get_text(strip=True)
        print("programming_language -> " + programming_language)
        programming_languages.append(programming_language)

        r = s.get(url + "/" + name)
        text = r.text
        beautiful_soup_repository = BeautifulSoup(text, "html.parser")

        href = "/" + username + "/" + name + "/" + "stargazers"
        a = beautiful_soup_repository.find(name="a", attrs={"href": href})
        stars_number = a.get_text(strip=True)
        print("stars_number -> " + stars_number)
        stars_numbers.append(stars_number)

        href = "/" + username + "/" + name + "/" + "network" + "/" + "members"
        a = beautiful_soup_repository.find(name="a", attrs={"href": href})
        forks_number = a.get_text(strip=True)
        print("forks_number -> " + forks_number)
        forks_numbers.append(forks_number)

        print()

	# dataframe设置显示所有的行和列，并且不限制列的宽度
    data = {"Repository": names, "Language": programming_languages, "Forks": forks_numbers, "Stars": stars_numbers}
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_rows', None)
    pd.set_option('max_colwidth', None)
    result2 = pd.DataFrame(data=data)
    return result2


result2 = getRepo("google")
print(result2)

爬虫代码报错： http.client.RemoteDisconnected: Remote end closed connection without response

原因：服务器限制了User-Agent的访问。

1.什么是user-agent?

有一些网站不喜欢被爬虫程序访问，所以会检测连接对象，如果是爬虫程序，也就是非人点击访问，它就会不让你继续访问，所以为了要让程序可以正常运行，需要隐藏自己的爬虫程序的身份。此时，我们就可以通过设置User Agent的来达到隐藏身份的目的，User Agent的中文名为用户代理，简称UA。

User Agent存放于Headers中，服务器就是通过查看Headers中的User Agent来判断是谁在访问。在Python中，如果不设置User Agent，程序将使用默认的参数，那么这个User Agent就会有Python的字样，如果服务器检查User Agent，那么没有设置User Agent的Python程序将无法正常访问网站。

Python允许我们修改这个User Agent来模拟浏览器访问，它的强大毋庸置疑。

如何突破限制？

答案是生成随机的User-Agent，即随机从预定义的user-agent中取出一个使用。常见的user-agent列表：

from random import randint

USER_AGENTS = [
“Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)”,
“Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)”,
“Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)”,
“Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)”,
“Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)”,
“Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)”,
“Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)”,
“Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)”,
“Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6”,
“Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1”,
“Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0”,
“Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5”,
“Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6”,
“Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11”,
“Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20”,
“Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52”,
]

random_agent = USER_AGENTS[randint(0, len(USER_AGENTS)-1)]
headers = {
    'User-Agent':random_agent,
 }

注意，random.randint()函数的取值是个闭区间[a, b], 也就是b也能取到。

如果这样仍然会报错，那么就要考虑网站是不是封了你的ip。随机ip设置可参考：https://blog.csdn.net/c406495762/article/details/60137956

永远喜欢薇尔莉特

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
爬虫

Write a program to scrape https://github.com, using requests and Beautiful Soup. The goal is to get, for a given GitHub username, i.e., https://github.com/google, a list of repositories with their GitHub-assigned programming language, the number of forks a
复制链接

扫一扫