爬虫学习笔记(2）

最新推荐文章于 2021-02-04 06:08:30 发布

黑码

最新推荐文章于 2021-02-04 06:08:30 发布

阅读量192

点赞数

分类专栏：爬虫学习

本文链接：https://blog.csdn.net/Littlewhite520/article/details/104225330

版权

爬虫学习专栏收录该内容

7 篇文章 0 订阅

订阅专栏

网络爬虫规则之提取

Beautiful Soup库，也叫beautifulsoup4 或bs4
约定引用方式如下，即主要是用BeautifulSoup类
import bs4 from
import bs4 from BeautifulSoup

4种解析器：
soup = BeautifulSoup(‘<html>data</html>’，’html.parser’)
bs4的HTML解析器   BeautifulSoup(mk,’html.parser’)     安装bs4库
lxml的HTML解析器    BeautifulSoup(mk,’lxml’)             pip install lxml
lxml的XML解析器        BeautifulSoup(mk,’xml’)             pip install lxml
html5lib的解析器    BeautifulSoup(mk,’html5lib’)     pip install html5lib

BeautifulSoup类5种基本元素：
Tag 标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾
Name 标签的名字，<p>…</p>的名字是’p’，格式：<tag>.name
Attributes 标签的属性，字典形式组织，格式：<tag>.attrs
NavigableString 标签内非属性字符串，<>…</>中字符串，格式：<tag>.string
Comment 标签内字符串的注释部分，一种特殊的Comment类型

Tag 标签：
任何存在于HTML语法中的标签都可以用soup.<tag>访问获得
当HTML文档中存在多个相同<tag>对应内容时，soup.<tag>返回第一个

最好大学排名实例：

import requests
from bs4 import BeautifulSoup
import bs4
  
def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
  
def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string, tds[1].string, tds[3].string])
  
def printUnivList(ulist, num):
    tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
    print(tplt.format("排名","学校名称","总分",chr(12288)))
    for i in range(num):
        u=ulist[i]
        print(tplt.format(u[0],u[1],u[2],chr(12288)))
      
def main():
    uinfo = []
    url = 'https://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
    html = getHTMLText(url)
    fillUnivList(uinfo, html)
    printUnivList(uinfo, 20) # 20 univs
main()

黑码

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫学习笔记(2）

网络爬虫规则之提取1 Beautiful Soup库入门 2 信息组织与提取 3 实例：大学排名爬取Beautiful Soup库，也叫beautifulsoup4 或bs4约定引用方式如下，即主要是用BeautifulSoup类import bs4 fromimport bs4 from BeautifulSoup4种解析器：soup = BeautifulSoup(‘&...
复制链接

扫一扫

专栏目录