利用Python爬取并保存GitHub中TensorFlow的issues

最新推荐文章于 2024-04-08 10:58:04 发布

Rorschach_Warden

最新推荐文章于 2024-04-08 10:58:04 发布

阅读量753

点赞数 1

文章标签： python tensorflow 开发语言

本文链接：https://blog.csdn.net/weixin_52595313/article/details/126857751

版权

最近在做深度学习的框架测试，需要调研一下深度学习框架的一些问题

尝试了好多种方法，github网页经常犯病，真的是很难登进去，，，程序运行半天，啥都没爬出来，直接崩溃。。。最后用了洪荒之力，终于成功了！

首先，爬取tensorflow的issues网页源代码，代码如下：

import requests   
import time

for i in range(1,86):
	while True:
		try:
			html = requests.get("https://github.com/tensorflow/tensorflow/issues?	 page="+str(i+1)+"&q=is%3Aissue+is%3Aopen",timeout=(30,50),verify=False)
			print(html.text)       
			print("第"+str(i+1)+"页爬取完毕")
			time.sleep(1)
			break
		except:
			print("Connection was refused by the server...")
			time.sleep(2)
			print("ReConnecting...")
			continue

这里直接用了while True，怎么说呢，就是爬不下来，你这程序就一直给我爬！！！别想歇着！

之后我们需要复制控制台输出的全部信息，放入记事本中。我爬取的时候用了三个编译器，由于程序输出量太大，vscode和Pycharm控制台直接挤爆。。。。后来发现了宝藏编译器——Sublime Text！人家那控制台是真大！前面二位也不懂得学一学。

不过你也可以直接就写入txt文件里面，也十分方便。

第二步，对爬取下来的网页源代码进行相关信息的提取。

代码如下：

import re

with open('2.txt','r',encoding='utf-8') as f:
  content = f.read()

re1 = r'href="/tensorflow/tensorflow/issues/(.*?)</a>'
reResult = re.findall(re1, content)
cnt = 1
for i in range(len(reResult)):
	if (len(reResult[i])>7):
		cnt+=1
		print(reResult[i])

输出在控制台中的内容即为tensorflow各个issues的编号和标题

之后再将此信息保存在txt文件中。

最后，也是最重要的一步，爬取并保存网页！！——折腾了好久，真的折磨

我最初的代码

from selenium import webdriver
import time
from selenium.webdriver.chrome.service import Service
# Open file
fileHandler  =  open  ("problem1.txt",  "r")
# Get list of all lines in file
listOfLines  =  fileHandler.readlines()


# Close file
fileHandler.close()
# Iterate over the lines


for  line in  listOfLines:
	#print(line.strip())
	#print(line.strip()[0:5])
	while True:
		try:
			s = Service(executable_path=r'D:\Users\zhang\anaconda3\chromedriver.exe')
			driver = webdriver.Chrome(service=s)
			driver.get('https://github.com/tensorflow/tensorflow/issues/' + line.strip()[0:5])
			# 打印网页title
			print(driver.title)
			time.sleep(3)

			# 1. 执行 Chome 开发工具命令，得到mhtml内容
			res = driver.execute_cdp_cmd('Page.captureSnapshot', {})
			time.sleep(3)

			# 2. 写入文件
			str = line.strip() + ".mhtml"
			with open(r'D:/result/' + str, 'w', newline='') as f:  # 根据5楼的评论，添加newline=''
				f.write(res['data'])

			# 操作网址
			time.sleep(3)
			print("成功一个")
			# 关闭网址
			driver.quit()

			break		#之前没加。会导致死循环


		except:
			print("Connection was refused by the server...")
			time.sleep(2)
			print("ReConnecting...")
			continue


#该方法一直响应，速度较慢。不可行

该方法利用了selenium库处理保存网页，还需要下载相关浏览器的驱动，我用的是Chrome驱动

这段代码我跑了八百年没跑出来，刚开始因为github难以打开的原因，程序直接崩溃

之后加了while True，程序是不崩溃了，但是一个网页都跑不下来，是真的搞心态

它真的，，，，我哭死

之后尝试换方法

最后在一篇利用Beautiful Soap爬取网页图片的博客中找到灵感，并成功爬取

import requests
from bs4 import BeautifulSoup

from selenium import webdriver
import time
from selenium.webdriver.chrome.service import Service

# Open file
fileHandler  =  open  ("problem1.txt",  "r")
# Get list of all lines in file
listOfLines  =  fileHandler.readlines()

# Close file
fileHandler.close()
# Iterate over the lines


PATH = "D:/result/tensorflow/"
url = "https://github.com/tensorflow/tensorflow/issues/"

for  line in  listOfLines:
    # 保存图片和网页
    #print(line.strip()[0:5])
    #position=url+line.strip()[0:5]
    #print(position)
    #print(1)

    while True:
        try:
            r = requests.get(url + line.strip()[0:5], "html.parser")
            con = r.content
            # 注意：在Python 3 中要使用二进制写入模式（‘wb’）来开启待操作文件，而不能像原来Python 2 那样，采用字符写入模式（‘w’）
            o = open(PATH + line.strip()[0:5] + ".html", 'wb')
            o.write(con)
            o.close()

            print(line.strip()[0:5]+"成功")

            time.sleep(3)
            break

        except:
            print("Connection was refused by the server...")
            time.sleep(2)
            print("ReConnecting...")
            continue

当然 While True是必不可少的，在执行时间较长的代码中，难免在执行过程中遇到各种各样的问题，从而导致程序的崩溃。那么While True就可以保证我们代码遇到问题可以反复执行

结果如下

Rorschach_Warden

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
2
评论
利用Python爬取并保存GitHub中TensorFlow的issues

我爬取的时候用了三个编译器，由于程序输出量太大，vscode和Pycharm控制台直接挤爆。当然 While True是必不可少的，在执行时间较长的代码中，难免在执行过程中遇到各种各样的问题，从而导致程序的崩溃。尝试了好多种方法，github网页经常犯病，真的是很难登进去，，，程序运行半天，啥都没爬出来，直接崩溃。这里直接用了while True，怎么说呢，就是爬不下来，你这程序就一直给我爬！之后加了while True，程序是不崩溃了，但是一个网页都跑不下来，是真的搞心态。它真的，，，，我哭死。
复制链接

扫一扫