运行环境Python3.6.4
目的:爬取一个网页中的所有链接
策略:
1)确定好要爬取的入口链接
2)根据需求构建好链接提取的正则表达式
3)模拟成浏览器并爬取对应网页
4)根据2)中的正则表达式提取出该网页中包含的链接
5)过滤掉重复的链接
6)后续操作。比如打印这些链接到屏幕上等
import re
import urllib.request
def getlink(url):
headers = ("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
urllib.request.install_opener(opener)
file = urllib.request.urlopen(url)
data = str(file.read())
pat = '(https?://[^\s)";]+\.(\w|/)*)'
link = re.compile(pat).findall(data)
link = list(set(link))
return link
url = "https://www.csdn.net/"
linklist = getlink(url)
for link in linklist:
print(link[0])