Python网络爬虫:BeautifulSoup

1.基本使用

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello</p>', 'lxml')
print(soup.p.string)

结果:
Hello

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello</p>', 'lxml')
print(soup.prettify())

在这里插入图片描述

2.标签选择器

选择元素

格式:BeautifulSoup对象名.标签名
只返回匹配的第一个标签
例如:

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.title.string)
print(soup.head)
print(soup.p)

在这里插入图片描述

选择名称(name)

格式:BeautifulSoup对象名.标签名.name

获取属性(attrs)

格式:BeautifulSoup对象名.标签名.attrs['属性名']
可以调用attrs获取所有属性

html =  """<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.attrs['class'])
print(soup.a.attrs)

在这里插入图片描述

获取内容(string)

格式:BeautifulSoup对象名.标签名.string

嵌套选择

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)

在这里插入图片描述

子节点和子孙节点

选择(所有直接)子节点:BeautifulSoup对象名.标签名.contents,返回值为一个列表

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)

在这里插入图片描述
选择子节点:BeautifulSoup对象名.标签名.children,返回值为一个迭代器对象

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)

在这里插入图片描述
获取子孙节点:BeautifulSoup对象名.标签名.descendants,返回值为一个迭代器对象

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
    print(i, child)

在这里插入图片描述

父节点和祖先节点

获取父节点:BeautifulSoup对象名.标签名.parent
获取祖先节点:BeautifulSoup对象名.标签名.parents

兄弟节点

获取后一个兄弟节点:BeautifulSoup对象名.标签名.next_sibling
获取前一个兄弟节点:BeautifulSoup对象名.标签名.previous_sibling
获取后面所有兄弟节点:BeautifulSoup对象名.标签名.next_siblings
获取前面所有兄弟节点:BeautifulSoup对象名.标签名.previous_siblings

3.标准选择器

find_all(name , attrs , recursive , text , **kwargs)

name(标签名)

BeautifulSoup对象名.find_all('标签名'),返回值为列表

attrs(属性名)

BeautifulSoup对象名.find_all(attrs={'属性名':'属性值'}),返回值为列表
一些常见属性也可以直接通过属性名进行查找。

print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))

text(文本内容)

BeautifulSoup对象名.find_all(text='文本内容'),返回值为列表,内容为text的内容。

find(name , attrs , recursive , text , **kwargs)

find()方法的用法和find_all()一直,但返回值为单个元素

find_parents() find_parent()
find_next_siblings() find_next_sibling()
find_all_next() find_next()
find_all_previous() find_previous()

4.CSS选择器

通过select()直接传入CSS选择器即可完成选择

获取属性

对象名['属性名']或者对象名.attrs['属性名']

获取内容

获取标签里的文本:对象名.get_text()或者对象名.string

5.小案例

以爬取西北大学新冠肺炎防控专题网站一个新闻页面为例,使用本节所学内容:

import requests
import re
from lxml import etree
from bs4 import BeautifulSoup

#函数1:请求网页
def page_request(url):
    ua = {'User-Agent':'User-Agent:Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1'}
    resp = requests.get(url,headers = ua)
    print("请求状态:%d"%(resp.status_code))
    html = resp.content.decode('utf-8')
    return html

#函数2:解析网页
def page_analysis(html):
    soup = BeautifulSoup(html, 'lxml')
    title = soup.title
    #打印网页标题
    print(title.get_text())
    #解析网页内容
    info = []
    info_list = soup.select('ul.lm_list > li')
    for item in info_list:
        title = item.a.attrs['title']
        url = "http://yqfk.nwu.edu.cn/"+item.a.attrs['href']
        date = item.span.string
        info_item ={
            'title':title,
            'url':url,
            'date':date
        }
        info.append(info_item)
    print(info)
    return info
#写入csv
def csv_def(info):
    import csv
    with open(r'D:\nwu.csv','a',encoding='utf-8-sig',newline='') as cf:
        w = csv.DictWriter(cf,fieldnames = ['title','url','date'])
        w.writeheader()
        w.writerows(info)
        print("爬取完成!")


url = 'http://yqfk.nwu.edu.cn/xxdt.htm'
html = page_request(url)
info = page_analysis(html)
csv_def(info)

在这里插入图片描述

©️2020 CSDN 皮肤主题: 精致技术 设计师: CSDN官方博客 返回首页
实付0元
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值