Write a program to scrape https://github.com, using requests and Beautiful Soup. The goal is to get, for a given GitHub username, i.e., https://github.com/google, a list of repositories with their GitHub-assigned programming language, the number of forks as well as the number of stars a repository has.
Note that the repositories may spread out across several pages, we only focus on the second page, and return the result in a DataFrame. The output format is shown as follows (Note: The repository list may change dynamically over time, following result is only for reference),return it as result2
:
Hint:
- The url of Github can take two query strings, username and page number. For example, if we want to find all the repository of google on the third page, we can use the following url: https://github.com/google?page=3
- Use
get_text(strip=True)
to remove the white spaces, new lines of the text. - You may encounter the exception:
ConnectionError: HTTPConnectionPool(host=‘xxx.xx.xxx.xxx’, port=xxxx): Max retries exceeded with url:xx
if usingrequests
to establish multiple connections without closing. You can use the following example code to avoid such exception.
requests.adapters.DEFAULT_RETRIES = 5
s = requests.session()
s.keep_alive = False
url = 'xx'
r = s.get(url,params=xx)
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
def getRepo(username):
# YOUR CODE HERE
# we only focus on the **second** page
page = 2
params = "page=" + str(page)
url = "https://github.com/" + username
# 使用requests设置请求属性,发起get请求,获取body
requests.adapters.DEFAULT_RETRIES = 5
s = requests.session()
s.keep_alive = False
r = s.get(url, params=params)
text = r.text
# 解析为html
beautiful_soup_repositories = BeautifulSoup(text, "html.parser")
# 获取类名为"public source d-block py-4 border-bottom"的li标签
all_li = beautiful_soup_repositories.find_all(name='li',
attrs={"class": "public source d-block py-4 border-bottom"})
names = []
programming_languages = []
stars_numbers = []
forks_numbers = []
# 遍历所有li标签
for li in all_li:
a = li.find(name="a", attrs={"itemprop": "name codeRepository"})
# 获取标签内容(排除两边的空格)
name = a.get_text(strip=True)
print("name -> " + name)
names.append(name)
span = li.find(name="span", attrs={"itemprop": "programmingLanguage"})
programming_language = span.get_text(strip=True)
print("programming_language -> " + programming_language)
programming_languages.append(programming_language)
r = s.get(url + "/" + name)
text = r.text
beautiful_soup_repository = BeautifulSoup(text, "html.parser")
href = "/" + username + "/" + name + "/" + "stargazers"
a = beautiful_soup_repository.find(name="a", attrs={"href": href})
stars_number = a.get_text(strip=True)
print("stars_number -> " + stars_number)
stars_numbers.append(stars_number)
href = "/" + username + "/" + name + "/" + "network" + "/" + "members"
a = beautiful_soup_repository.find(name="a", attrs={"href": href})
forks_number = a.get_text(strip=True)
print("forks_number -> " + forks_number)
forks_numbers.append(forks_number)
print()
# dataframe设置显示所有的行和列,并且不限制列的宽度
data = {"Repository": names, "Language": programming_languages, "Forks": forks_numbers, "Stars": stars_numbers}
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('max_colwidth', None)
result2 = pd.DataFrame(data=data)
return result2
result2 = getRepo("google")
print(result2)
爬虫代码报错: http.client.RemoteDisconnected: Remote end closed connection without response
原因:服务器限制了User-Agent的访问。
1.什么是user-agent?
有一些网站不喜欢被爬虫程序访问,所以会检测连接对象,如果是爬虫程序,也就是非人点击访问,它就会不让你继续访问,所以为了要让程序可以正常运行,需要隐藏自己的爬虫程序的身份。此时,我们就可以通过设置User Agent的来达到隐藏身份的目的,User Agent的中文名为用户代理,简称UA。
User Agent存放于Headers中,服务器就是通过查看Headers中的User Agent来判断是谁在访问。在Python中,如果不设置User Agent,程序将使用默认的参数,那么这个User Agent就会有Python的字样,如果服务器检查User Agent,那么没有设置User Agent的Python程序将无法正常访问网站。
Python允许我们修改这个User Agent来模拟浏览器访问,它的强大毋庸置疑。
- 如何突破限制?
答案是生成随机的User-Agent,即随机从预定义的user-agent中取出一个使用。常见的user-agent列表:
from random import randint
USER_AGENTS = [
“Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)”,
“Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)”,
“Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)”,
“Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)”,
“Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)”,
“Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)”,
“Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)”,
“Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)”,
“Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6”,
“Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1”,
“Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0”,
“Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5”,
“Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6”,
“Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11”,
“Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20”,
“Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52”,
]
random_agent = USER_AGENTS[randint(0, len(USER_AGENTS)-1)]
headers = {
'User-Agent':random_agent,
}
注意,random.randint()函数的取值是个闭区间[a, b], 也就是b也能取到。
如果这样仍然会报错,那么就要考虑网站是不是封了你的ip。随机ip设置可参考:https://blog.csdn.net/c406495762/article/details/60137956