python爬虫 如何获得完整链接(动态网页)

参考:https://blog.csdn.net/hdu09075340/article/details/74202339


-------------------

参考:https://www.cnblogs.com/hhh5460/p/5044038.html

四中方法

'''
得到当前页面所有连接
'''

import requests

import re
from bs4 import BeautifulSoup
from lxml import etree
from selenium import webdriver

url = 'http://www.ok226.com'
r = requests.get(url)
r.encoding = 'gb2312'


# 利用 re (太黄太暴力!)
matchs = re.findall(r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')" , r.text)
for link in matchs:
    print(link)
    
print()


# 利用 BeautifulSoup4 (DOM树)
soup = BeautifulSoup(r.text,'lxml')
for a in soup.find_all('a'):
    link = a['href']
    print(link)
    
print()


# 利用 lxml.etree (XPath)
tree = etree.HTML(r.text)
for link in tree.xpath("//@href"):
    print(link)
    
print()


# 利用selenium(要开浏览器!)
driver = webdriver.Firefox()
driver.get(url)
for link in driver.find_elements_by_tag_name("a"):
    print(link.get_attribute("href"))
driver.close()

----------------------------------------------------------

参考:https://blog.csdn.net/xtingjie/article/details/73465522


----------------------------------------------------------

参考:https://blog.csdn.net/linzch3/article/details/72884715

---------------------------------------------------------

参考:https://www.baidu.com/s?wd=python+%E4%B8%8B%E4%B8%80%E9%A1%B5+%E6%8A%93%E5%8F%96&ie=utf-8&tn=02049043_27_pg

----------------------------------------------------------

参考:https://blog.csdn.net/wangxw_lzu/article/details/75092603

>>> import  re, urllib.request
>>>
>>> url = 'http://www.nmc.cn'
>>> html = urllib.request.urlopen(url).read()
>>> html = html.decode('utf-8')     #python3版本中需要加入
>>> links = re.findall('<a target="_blank" href="(.+?)" title',html)
>>> titles = re.findall('<a target="_blank" .+? title="(.+?)">',html)
>>> tags = re.findall('<a target="_blank" .+? title=.+?>(.+?)</a>',html)
>>> for link,title,tag in zip(links,titles,tags):
...     print(tag,url+link,title)
...
沙尘暴预警 http://www.nmc.cn/publish/country/warning/dust.html 中央气象台4月5日06时继续发布沙尘暴蓝色预警
>>>

----------------------------------------------------------------------------------

>>> from bs4 import BeautifulSoup
>>> import urllib.request
>>>
>>> url = 'http://www.nmc.cn'
>>> html = urllib.request.urlopen(url).read()
>>> soup = BeautifulSoup(html,'lxml')
>>> content = soup.select('#alarmtip > ul > li.waring > a')
>>>
>>> for n in content:
...     link = n.get('href')
...     title = n.get('title')
...     tag = n.text
...     print(tag, url + link, title)
...
沙尘暴预警 http://www.nmc.cn/publish/country/warning/dust.html 中央气象台4月5日06时继续发布沙尘暴蓝色预警
>>>


--------------------------------------------------------------------------

生成一个列表:参考https://www.cnblogs.com/xiaxiaoxu/p/7862099.html


-------------------------------------------------------------------------




--------------------------------------------


-------------------------------------------


-------------------------------------------

最终:获得的所有页面


效果:


-------------------------------------


-------------------------------------------------------------



----------------------------------------------------------


----------------------------------------------------------


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值