参考:https://blog.csdn.net/hdu09075340/article/details/74202339
-------------------
参考:https://www.cnblogs.com/hhh5460/p/5044038.html
四中方法
'''
得到当前页面所有连接
'''
import requests
import re
from bs4 import BeautifulSoup
from lxml import etree
from selenium import webdriver
url = 'http://www.ok226.com'
r = requests.get(url)
r.encoding = 'gb2312'
# 利用 re (太黄太暴力!)
matchs = re.findall(r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')" , r.text)
for link in matchs:
print(link)
print()
# 利用 BeautifulSoup4 (DOM树)
soup = BeautifulSoup(r.text,'lxml')
for a in soup.find_all('a'):
link = a['href']
print(link)
print()
# 利用 lxml.etree (XPath)
tree = etree.HTML(r.text)
for link in tree.xpath("//@href"):
print(link)
print()
# 利用selenium(要开浏览器!)
driver = webdriver.Firefox()
driver.get(url)
for link in driver.find_elements_by_tag_name("a"):
print(link.get_attribute("href"))
driver.close()
----------------------------------------------------------
参考:https://blog.csdn.net/xtingjie/article/details/73465522
----------------------------------------------------------
参考:https://blog.csdn.net/linzch3/article/details/72884715
---------------------------------------------------------
参考:https://www.baidu.com/s?wd=python+%E4%B8%8B%E4%B8%80%E9%A1%B5+%E6%8A%93%E5%8F%96&ie=utf-8&tn=02049043_27_pg
----------------------------------------------------------
参考:https://blog.csdn.net/wangxw_lzu/article/details/75092603
>>> import re, urllib.request
>>>
>>> url = 'http://www.nmc.cn'
>>> html = urllib.request.urlopen(url).read()
>>> html = html.decode('utf-8') #python3版本中需要加入
>>> links = re.findall('<a target="_blank" href="(.+?)" title',html)
>>> titles = re.findall('<a target="_blank" .+? title="(.+?)">',html)
>>> tags = re.findall('<a target="_blank" .+? title=.+?>(.+?)</a>',html)
>>> for link,title,tag in zip(links,titles,tags):
... print(tag,url+link,title)
...
沙尘暴预警 http://www.nmc.cn/publish/country/warning/dust.html 中央气象台4月5日06时继续发布沙尘暴蓝色预警
>>>
----------------------------------------------------------------------------------
>>> from bs4 import BeautifulSoup
>>> import urllib.request
>>>
>>> url = 'http://www.nmc.cn'
>>> html = urllib.request.urlopen(url).read()
>>> soup = BeautifulSoup(html,'lxml')
>>> content = soup.select('#alarmtip > ul > li.waring > a')
>>>
>>> for n in content:
... link = n.get('href')
... title = n.get('title')
... tag = n.text
... print(tag, url + link, title)
...
沙尘暴预警 http://www.nmc.cn/publish/country/warning/dust.html 中央气象台4月5日06时继续发布沙尘暴蓝色预警
>>>
--------------------------------------------------------------------------
生成一个列表:参考https://www.cnblogs.com/xiaxiaoxu/p/7862099.html
-------------------------------------------------------------------------
--------------------------------------------
-------------------------------------------
-------------------------------------------
最终:获得的所有页面
效果:
-------------------------------------
-------------------------------------------------------------
----------------------------------------------------------
----------------------------------------------------------