python爬虫如何获得完整链接（动态网页）

最新推荐文章于 2024-08-03 14:18:54 发布

神创

最新推荐文章于 2024-08-03 14:18:54 发布

阅读量1.2w

点赞数 9

分类专栏： python 爬虫

本文链接：https://blog.csdn.net/qq_19741181/article/details/79826724

版权

python 同时被 2 个专栏收录

84 篇文章 3 订阅

订阅专栏

爬虫

16 篇文章 0 订阅

订阅专栏

参考：https://blog.csdn.net/hdu09075340/article/details/74202339

-------------------

参考：https://www.cnblogs.com/hhh5460/p/5044038.html

四中方法

'''
得到当前页面所有连接
'''

import requests

import re
from bs4 import BeautifulSoup
from lxml import etree
from selenium import webdriver

url = 'http://www.ok226.com'
r = requests.get(url)
r.encoding = 'gb2312'


# 利用 re （太黄太暴力！）
matchs = re.findall(r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')" , r.text)
for link in matchs:
    print(link)
    
print()


# 利用 BeautifulSoup4 （DOM树）
soup = BeautifulSoup(r.text,'lxml')
for a in soup.find_all('a'):
    link = a['href']
    print(link)
    
print()


# 利用 lxml.etree （XPath）
tree = etree.HTML(r.text)
for link in tree.xpath("//@href"):
    print(link)
    
print()


# 利用selenium（要开浏览器！）
driver = webdriver.Firefox()
driver.get(url)
for link in driver.find_elements_by_tag_name("a"):
    print(link.get_attribute("href"))
driver.close()

----------------------------------------------------------

参考：https://blog.csdn.net/xtingjie/article/details/73465522

----------------------------------------------------------

参考：https://blog.csdn.net/linzch3/article/details/72884715

---------------------------------------------------------

参考：https://www.baidu.com/s?wd=python+%E4%B8%8B%E4%B8%80%E9%A1%B5+%E6%8A%93%E5%8F%96&ie=utf-8&tn=02049043_27_pg

----------------------------------------------------------

参考：https://blog.csdn.net/wangxw_lzu/article/details/75092603

>>> import  re, urllib.request
>>>
>>> url = 'http://www.nmc.cn'
>>> html = urllib.request.urlopen(url).read()
>>> html = html.decode('utf-8')     #python3版本中需要加入
>>> links = re.findall('<a target="_blank" href="(.+?)" title',html)
>>> titles = re.findall('<a target="_blank" .+? title="(.+?)">',html)
>>> tags = re.findall('<a target="_blank" .+? title=.+?>(.+?)</a>',html)
>>> for link,title,tag in zip(links,titles,tags):
...     print(tag,url+link,title)
...
沙尘暴预警 http://www.nmc.cn/publish/country/warning/dust.html 中央气象台4月5日06时继续发布沙尘暴蓝色预警
>>>

----------------------------------------------------------------------------------

>>> from bs4 import BeautifulSoup
>>> import urllib.request
>>>
>>> url = 'http://www.nmc.cn'
>>> html = urllib.request.urlopen(url).read()
>>> soup = BeautifulSoup(html,'lxml')
>>> content = soup.select('#alarmtip > ul > li.waring > a')
>>>
>>> for n in content:
...     link = n.get('href')
...     title = n.get('title')
...     tag = n.text
...     print(tag, url + link, title)
...
沙尘暴预警 http://www.nmc.cn/publish/country/warning/dust.html 中央气象台4月5日06时继续发布沙尘暴蓝色预警
>>>