使用Python工具抓取网页

最新推荐文章于 2019-09-22 23:29:04 发布

Yunhe_Feng

最新推荐文章于 2019-09-22 23:29:04 发布

阅读量889

点赞数

分类专栏： Python/Perl/etc. 文章标签： python bs4 Python抓取网页 Python爬取数据 Python爬虫 Python request

本文链接：https://blog.csdn.net/vernice/article/details/49392675

版权

Python/Perl/etc. 专栏收录该内容

46 篇文章 1 订阅

订阅专栏

本文详细介绍了使用Python的requests和BeautifulSoup组件包进行网页抓取和解析的方法，包括如何应对不同网页解析过程不一致的问题以及访问服务器资源过频导致的连接错误。通过提供具体的代码示例，文章深入探讨了如何有效获取目标链接，并在成功获取后进一步解析网页内容，确保数据收集的准确性和效率。

摘要由CSDN通过智能技术生成

最近在做一个基于文本分析的项目，需要抓取相关网页进行分析。我使用了Python的request和beautifulsoup组件包抓取和解析网页。在抓取过程中发现了很多问题，这些问题是在抓取工作开始之前，不曾预料到的。比如，由于不同网页的解析过程可能不一致，这可能导致解析失败；再比如，由于访问服务器资源过于频繁，可能会导致connection closed by remote host错误的出现。如下的代码考虑到了这两个问题。

import requests
import bs4
import time

# output file name
output = open("C:\\result.csv", 'w', encoding="utf-8")

# start request
request_link = "http://where-you-want-to-crawl-from"
response = requests.get(request_link)	

# parse the html
soup = bs4.BeautifulSoup(response.text,"html.parser")	

# try to get the link starting with href
try:
	link = str((soup.find_all('a')[30]).get('href'))
except Exception as e_msg:
	link = 'NULL'

# find the related app
if (link.startswith("/somewords")):
	# sleep
	time.sleep(2)
	# request the sub link
	response = requests.get("some_websites" + link)
	soup = bs4.BeautifulSoup(response.text,"html.parser")
	# get the info you want: div label and class is o-content 
	info_you_want = str(soup.find("div", {"class": "o-content"}))	
	try:
		sub_link = ((str(soup.find("div", {"class": "crumb clearfix"}))).split('</a>')[2]).split('</div>')[0].strip()
	except Exception as e_msg:
		sub_link = "NULL_because_exception"
	try:			
		info_you_want = (info_you_want.split('"o-content">')[1]).split('</div>')[0].strip()
	except Exception as e_msg:
		info_you_want = "NULL_because_exception"
	info_you_want = info_you_want.replace('\n', '')
	info_you_want = info_you_want.replace('\r', '')
	# write results into file
	output.writelines(info_you_want + "\n" + "\n")
# not find the aimed link
else:
	output.writelines(str(e) + "," + app_name[e] + "\n")

output.close()