python_网络数据爬取篇2

最新推荐文章于 2020-04-28 12:27:56 发布

Charben

最新推荐文章于 2020-04-28 12:27:56 发布

阅读量459

点赞数

分类专栏： python爬虫文章标签： Python爬虫

本文链接：https://blog.csdn.net/Charben/article/details/78339967

版权

python爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

BeautifulSoup库

在上一篇爬虫Blog之后，获取到这些数据之后，我们只是爬的源码，还要对进一步处理，就是解析的地方，这里就是利用BeautifulSoup文件库：

我们来看看它的介绍：

“You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects.”你傻啊，去翻译这段话，看看中文BeautifulSoup文档（http://beautifulsoup.readthedocs.io/zh_CN/latest/）Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

import requests
from bs4 import BeautifulSoup
r = requests.get('https://book.douban.com/subject/27031869/?icn=index-editionrecommend/')
soup = BeautifulSoup(r.text,"lxml")             '''BeautidulSoup库的使用 '''
partern = soup.find_all('p','comment-content')
for item in partern:
     print(item.string)

语句类对象

soup = BeautifulSoup('文档'，"解析器")

项目实练：中国大学排名定向爬虫

# -*- coding: utf-8 -*-
import requests
import bs4
from bs4 import BeautifulSoup
import lxml
###得到网页页面内容
def gethtml(url):
    try:
        r = requests.get(url,timeout = 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "出错啦"
    
###处理爬虫到的页面信息
def dealDate(ulist,html):
    soup = BeautifulSoup(html,"lxml")
    for tr in soup.find('tbody').children:
        if isinstance(tr,bs4.element.Tag):
           tds = tr('td')
           ulist.append([tds[0].string,tds[1].string,tds[2].string])
           
###输出排名结果
def printUnitlist(ulist,num):
    tplt = "{:^10}\t{:^6}\t{:^10}"
    print(tplt.format("排名","学校","学校总分",chr(255)))
    for i in range(num):
        u = ulist[i]
        print(tplt.format(u[0],u[1],u[2],chr(255)))
    
def main():
    uinfo = []
    url = "http://www.zuihaodaxue.com/shengyuanzhiliangpaiming2017.html";;
    html = gethtml(url)
    dealDate(uinfo,html)
    printUnitlist(uinfo,20)
    
main()

正则表达式

当然我们还可以使用正则表达式库提取关键字

使用正则表示式（regular expression regex RE）提取页面关键信息[简洁的表示一组字符串的特征或者模式]

表达文本类型的特征
同时查找或者替换一组字符串
匹配字符串

正则表达式语法：

经典正则表达式表达字符串：

实例：匹配IP地址的正则表达式

正则表达式re库采用raw string类型(原生字符串类型：不包括转义字符的类型)表示正则表达式： r'text'

例如：表示中国大陆邮政编码 r'[1-9]\d{5}'

Re库功能函数：

函数方法详解：

def search(pattern,string,flags=0):

pattern:正则表达式的字符串或原生字符串表示

string:待匹配字符串

flags: 正则表达式使用时的控制标记

re.IGNORECASE 忽略正则表达式的大小写

re.MULTILINE 给定字符串的每行匹配开始

re.DOTALL 默认匹配除开换行符外的所有字符串

def split (pattern,string,maxsplit=0,flags=0)

maxsplit:分割最大数

Re库的另一种等价用法（面向对象用法——编译后的多次操作）：

pat = re.compile(r'text') //将re匹配字符编译成re库对象

rst = pat.search(string) //后面只需要操作pat对象

Match 对象：包含了一次正则表达式的信息

# Re库贪婪匹配：re库默认采用贪婪匹配，即输出匹配最长的字串

# 最短匹配：最小匹配是加问好？的匹配

希望收到您的评论，让我进步！

Charben

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python_网络数据爬取篇2

BeautifulSoup库在上一篇爬虫Blog之后，获取到这些数据之后，我们只是爬的源码，还要对进一步处理，就是解析的地方，这里就是利用BeautifulSoup文件库：我们来看看它的介绍： “You didn't write that awful page. You're just trying to get some data out of it. Bea
复制链接

扫一扫