html文件怎么保存链接,如何使用beautifulsoup将链接的html保存在文件中,并对html文件中的所有链接执行相同的操作...

我明白了。 使用美丽的汤递归URL解析的代码:

import requests

import urllib2

from bs4 import BeautifulSoup

link_set = set()

give_url = raw_input("Enter url:\t")

def magic(give_url, link_set, count):

# print "______________________________________________________"

#

# print "Count is: " + str(count)

# count += 1

# print "THE URL IT IS SCRAPPING IS:" + give_url

page = urllib2.urlopen(give_url)

page_content = page.read()

with open('page_content.html', 'w') as fid:

fid.write(page_content)

response = requests.get(give_url)

html_data = response.text

soup = BeautifulSoup(html_data)

list_items = soup.find_all('a')

for each_item in list_items:

html_link = each_item.get('href')

if(html_link is None):

pass

else:

if(not (html_link.startswith('http') or html_link.startswith('https'))):

link_set.add(give_url + html_link)

else:

link_set.add(html_link)

# print "Total links in the given url are: " + str(len(link_set))

magic(give_url,link_set,0)

link_set2 = set()

link_set3 = set()

for element in link_set:

link_set2.add(element)

count = 1

for element in link_set:

magic(element,link_set3,count)

count += 1

for each_item in link_set3:

link_set2.add(each_item)

link_set3.clear()

count = 1

print "Total links scraped are: " + str(len(link_set2))

for element in link_set2:

count +=1

print "Element number " + str(count) + "processing"

print element

print "\n"

有很多错误,所以我要求你们都请告诉我在哪里可以提高代码。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值