Python、requests和beauthoulsoup绝对是最好的选择,尤其是对于初学者来说。beauthulsoup可以处理html、xml等的所有变体。在
然后需要安装python并请求安装bs4。阅读requests docs和{a2}都很容易做到。在
如果你还不知道的话,我建议你学一点Python的基本知识。在
下面是一个简单的示例,可以获取您请求的页面的标题:import requests
from bs4 import BeautifulSoup as bs
url = 'http://some.local.domain/'
response = requests.get(url)
soup = bs(response.text, 'html.parser')
# let's get title of the page
title = soup.title
print(title)
# let's get all the links in the page
links = soup.find_all('a')
for link in links:
print(link.get('href'))
link1 = link[0]
link2 = link[1]
# let's follow a link we find in the page (we'll go for the first)
response = requests.get(link1, stream=True)
# if we have an image and