Python 爬虫入门三(网页)

最新推荐文章于 2024-06-19 17:27:45 发布

lambda_blank

最新推荐文章于 2024-06-19 17:27:45 发布

阅读量83

点赞数

分类专栏： Python 爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/lambda_blank/article/details/117839030

版权

Python 同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

爬虫

3 篇文章 0 订阅

订阅专栏

获取网页里所有的链接

def getAllLink(url):
    html = urlopen(url)
    bs = BeautifulSoup(html, 'html.parser')
    for link in bs.find_all('a'):
        if 'href' in link.attrs:
            print (link.attrs['href'])

获取百度百科里所有相关的链接

def getAllBaikeLink(url):
    html = urlopen(url)
    bs = BeautifulSoup(html, 'html.parser')
    for link in bs.find('div', {'class': 'content'}).find_all('a', {'target': '_blank'}, href = re.compile('^/item/*')):
        if 'href' in link.attrs:
            print(link.attrs['href'])


------------------------------------------相同---------------------------------------------


def getAllBaikeLink(url):
    html = urlopen(url)
    bs = BeautifulSoup(html, 'html.parser')
    for link in bs.find('div', {'class': 'content'}).find_all('a', {'target': '_blank', 'href' : re.compile('^/item/*')}):
        if 'href' in link.attrs:
            print(link.attrs['href'])

遍历一个网站所有内链(去重)

pages = set()
def getAllLinks(后缀):
    global pages
    html = urlopen('网页地址'.format(后缀))
    bs = BeautifulSoup(html, 'html.parser')
    for link in bs.find_all('a', href=re.compile('正则')):
        if link.attrs['href'] not in pages:
            #得到一个新的页面
            newPage = link.attrs['href']
            print(newPage)
            pages.add(newPage)
            getAllLinks(newPage)

遍历一个网站所有内链并打印信息(去重) 】

pages = set()
def getAllLinksData(后缀):
    global pages
    html = urlopen('网页地址'.format(后缀))
    bs = BeautifulSoup(html, 'html.parser')
    try:
        print (bs.h1.get_text())
        print (bs.find(id='xxx').find('li').find('a').attrs['href'])
    except AttributeError:
        print ("页面属性缺少")

    for link in bs.find_all('a',href=re.compile('正则')):
        if link.attrs['href'] not in pages:
            #得到一个新的页面
            newPage = link.attrs['href']
            print(newPage)
            pages.add(newPage)
            getAllLinksData(newPage)

lambda_blank

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python 爬虫入门三(网页)

获取标签属性def getUrlByReg(): html = urlopen('网页URL') bs = BeautifulSoup(html, 'html.parser') images = bs.findAll('img') for img in images: print(img['src'])
复制链接

扫一扫