爬虫

Write a program to scrape https://github.com, using requests and Beautiful Soup. The goal is to get, for a given GitHub username, i.e., https://github.com/google, a list of repositories with their GitHub-assigned programming language, the number of forks as well as the number of stars a repository has.

Note that the repositories may spread out across several pages, we only focus on the second page, and return the result in a DataFrame. The output format is shown as follows (Note: The repository list may change dynamically over time, following result is only for reference),return it as result2:

Hint:

  • The url of Github can take two query strings, username and page number. For example, if we want to find all the repository of google on the third page, we can use the following url: https://github.com/google?page=3
  • Use get_text(strip=True) to remove the white spaces, new lines of the text.
  • You may encounter the exception:ConnectionError: HTTPConnectionPool(host=‘xxx.xx.xxx.xxx’, port=xxxx): Max retries exceeded with url:xx if using requests to establish multiple connections without closing. You can use the following example code to avoid such exception.
requests.adapters.DEFAULT_RETRIES = 5
s = requests.session()
s.keep_alive = False
url = 'xx'
r = s.get(url,params=xx)
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re


def getRepo(username):
    # YOUR CODE HERE

    # we only focus on the **second** page
    page = 2
    params = "page=" + str(page)

    url = "https://github.com/" + username

	# 使用requests设置请求属性,发起get请求,获取body
    requests.adapters.DEFAULT_RETRIES = 5
    s = requests.session()
    s.keep_alive = False
    r = s.get(url, params=params)
    text = r.text

	# 解析为html
    beautiful_soup_repositories = BeautifulSoup(text, "html.parser")
	# 获取类名为"public source d-block py-4 border-bottom"的li标签
    all_li = beautiful_soup_repositories.find_all(name='li',
                                                  attrs={"class": "public source d-block py-4 border-bottom"})

    names = []
    programming_languages = []
    stars_numbers = []
    forks_numbers = []

	# 遍历所有li标签
    for li in all_li:
        a = li.find(name="a", attrs={"itemprop": "name codeRepository"})
		
		# 获取标签内容(排除两边的空格)
        name = a.get_text(strip=True)
        print("name -> " + name)
        names.append(name)

        span = li.find(name="span", attrs={"itemprop": "programmingLanguage"})
        programming_language = span.get_text(strip=True)
        print("programming_language -> " + programming_language)
        programming_languages.append(programming_language)

        r = s.get(url + "/" + name)
        text = r.text
        beautiful_soup_repository = BeautifulSoup(text, "html.parser")

        href = "/" + username + "/" + name + "/" + "stargazers"
        a = beautiful_soup_repository.find(name="a", attrs={"href": href})
        stars_number = a.get_text(strip=True)
        print("stars_number -> " + stars_number)
        stars_numbers.append(stars_number)

        href = "/" + username + "/" + name + "/" + "network" + "/" + "members"
        a = beautiful_soup_repository.find(name="a", attrs={"href": href})
        forks_number = a.get_text(strip=True)
        print("forks_number -> " + forks_number)
        forks_numbers.append(forks_number)

        print()

	# dataframe设置显示所有的行和列,并且不限制列的宽度
    data = {"Repository": names, "Language": programming_languages, "Forks": forks_numbers, "Stars": stars_numbers}
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_rows', None)
    pd.set_option('max_colwidth', None)
    result2 = pd.DataFrame(data=data)
    return result2


result2 = getRepo("google")
print(result2)

爬虫代码报错: http.client.RemoteDisconnected: Remote end closed connection without response

原因:服务器限制了User-Agent的访问。

1.什么是user-agent?

有一些网站不喜欢被爬虫程序访问,所以会检测连接对象,如果是爬虫程序,也就是非人点击访问,它就会不让你继续访问,所以为了要让程序可以正常运行,需要隐藏自己的爬虫程序的身份。此时,我们就可以通过设置User Agent的来达到隐藏身份的目的,User Agent的中文名为用户代理,简称UA。

User Agent存放于Headers中,服务器就是通过查看Headers中的User Agent来判断是谁在访问。在Python中,如果不设置User Agent,程序将使用默认的参数,那么这个User Agent就会有Python的字样,如果服务器检查User Agent,那么没有设置User Agent的Python程序将无法正常访问网站。

Python允许我们修改这个User Agent来模拟浏览器访问,它的强大毋庸置疑。
  1. 如何突破限制?

答案是生成随机的User-Agent,即随机从预定义的user-agent中取出一个使用。常见的user-agent列表:

from random import randint

USER_AGENTS = [
“Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)”,
“Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)”,
“Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)”,
“Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)”,
“Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)”,
“Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)”,
“Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)”,
“Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)”,
“Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6”,
“Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1”,
“Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0”,
“Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5”,
“Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6”,
“Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11”,
“Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20”,
“Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52”,
]

random_agent = USER_AGENTS[randint(0, len(USER_AGENTS)-1)]
headers = {
    'User-Agent':random_agent,
 }

注意,random.randint()函数的取值是个闭区间[a, b], 也就是b也能取到。

如果这样仍然会报错,那么就要考虑网站是不是封了你的ip。随机ip设置可参考:https://blog.csdn.net/c406495762/article/details/60137956

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

永远喜欢薇尔莉特

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值