python网络爬虫与信息提取

最新推荐文章于 2024-04-13 09:59:05 发布

GrandNovice

最新推荐文章于 2024-04-13 09:59:05 发布

阅读量663

点赞数 3

本文链接：https://blog.csdn.net/XXMRXXX/article/details/101639400

版权

文章目录

1.Requests库的入门
2.网络爬虫盗亦有道
3.Requests库的爬取实例
4. Beautiful Soup库入门
5.信息组织与提取方法
6.实例1：中国大学排名爬虫
7.正则表达式库

The Website is the API
掌握定向网络数据爬取和网页解析的基本能力
Alt

1.Requests库的入门

1.1 Requests库的安装和方法

requests库的7个主要方法
Alt

import requests
r = requests.get("http://www.baidu.com")
r.status_code

输出：
200

r.encoding = 'utf-8'
r.text

输出：
‘\r\n <meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge>百度一下，你就知道

新闻 hao123 地图视频贴吧 <a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录更多产品

关于百度 About Baidu

\r\n’

1.2 Requests库的get()方法

Alt
Alt
Alt
Alt

import requests
r = requests.get("http://www.baidu.com")
print(r.status_code)  # 200表示访问成功

输出:
200

type(r)

输出：
requests.models.Response

r.headers

输出：

{'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Content-Type': 'text/html', 'Date': 'Sat, 28 Sep 2019 14:09:11 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:32 GMT', 'Connection': 'Keep-Alive', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Server': 'bfe/1.0.8.18', 'Pragma': 'no-cache'}

Alt
流程：
Alt

import requests
r = requests.get("http://www.baidu.com")
r.status_code  # 200表示访问成功

r.text

'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç\x99¾åº¦ä¸\x80ä¸\x8b class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ\x96°é\x97»</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>å\x9c°å\x9b¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§\x86é¢\x91</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å\x90§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç\x99»å½\x95</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">ç\x99»å½\x95</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ\x9b´å¤\x9aäº§å\x93\x81</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å\x85³äº\x8eç\x99¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>ä½¿ç\x94¨ç\x99¾åº¦å\x89\x8då¿\x85è¯»</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>æ\x84\x8fè§\x81å\x8f\x8dé¦\x88</a>&nbsp;äº¬ICPè¯\x81030173å\x8f·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'

r.encoding

'ISO-8859-1'

r.apparent_encoding

'utf-8'

r.encoding = 'utf-8'

r.text

'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">登录</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'

Alt

1.3 爬取网页的通用代码框架

Alt
Alt
Alt
Alt

1.4 HTTP协议及Requests库方法

Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt

1.5 Requests库主要方法解析

Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt

1.6 单元小结

Alt
Alt

2.网络爬虫盗亦有道

2.1 网络爬虫引发的问题

Alt
Alt Alt

2.2 Robots协议

Alt
Alt
Alt
对于没有robots协议的网站即表示对网络爬虫无任何禁止。

2.3 Robots协议的遵守方式

Alt
Alt

2.4 单元小结

Alt

3.Requests库的爬取实例

3.1 实例1：京东商品页面的爬取

Alt

import requests
url = "https://item.jd.com/2967929.html"
try:
    r = requests.get(url)
    r.raise_for_status() #返回值！=200时抛出异常
    r.encoding = r.apparent_encoding
    print(r.text[:1000])
except:
    print("爬取失败！")

<!DOCTYPE HTML>
<html lang="zh-CN">
<head>
    <!-- shouji -->
    <meta http-equiv="Content-Type" content="text/html; charset=gbk" />
    <title>【华为荣耀8】荣耀8 4GB+64GB 全网通4G手机 魅海蓝【行情 报价 价格 评测】-京东</title>
    <meta name="keywords" content="HUAWEI荣耀8,华为荣耀8,华为荣耀8报价,HUAWEI荣耀8报价"/>
    <meta name="description" content="【华为荣耀8】京东JD.COM提供华为荣耀8正品行货，并包括HUAWEI荣耀8网购指南，以及华为荣耀8图片、荣耀8参数、荣耀8评论、荣耀8心得、荣耀8技巧等信息，网购华为荣耀8上京东,放心又轻松" />
    <meta name="format-detection" content="telephone=no">
    <meta http-equiv="mobile-agent" content="format=xhtml; url=//item.m.jd.com/product/2967929.html">
    <meta http-equiv="mobile-agent" content="format=html5; url=//item.m.jd.com/product/2967929.html">
    <meta http-equiv="X-UA-Compatible" content="IE=Edge">
    <link rel="canonical" href="//item.jd.com/2967929.html"/>
        <link rel="dns-prefetch" href="//misc.360buyimg.com"/>
    <link rel="dns-prefetch" href="//static.360buyimg.com"/>
    <link rel="dns-prefetch" href="//img10.360buyimg.com"/>
    <link rel="dns

3.2 实例2：亚马逊商品页面的爬取

Alt

import requests
url = "https://www.amazon.cn/gp/product/B01M8L5Z3Y"
try:
    kv = {'user-agent':'Mozilla\5.0'}
    r = requests.get(url, headers=kv) # 替换headers，模拟浏览器
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[1000:2000])
except:
    print("爬取失败！")

输出：

b(ue,"onunload");ue.stub(ue,"onflush");

(function(d,e){function h(f,b){if(!(a.ec>a.mxe)&&f){a.ter.push(f);b=b||{};var c=f.logLevel||b.logLevel;c&&c!==k&&c!==m&&c!==n&&c!==p||a.ec++;c&&c!=k||a.ecf++;b.pageURL=""+(e.location?e.location.href:"");b.logLevel=c;b.attribution=f.attribution||b.attribution;a.erl.push({ex:f,info:b})}}function l(a,b,c,e,g){d.ueLogError({m:a,f:b,l:c,c:""+e,err:g,fromOnError:1,args:arguments},g?{attribution:g.attribution,logLevel:g.logLevel}:void 0);return!1}var k="FATAL",m="ERROR",n="WARN",p="DOWNGRADED",a={ec:0,ecf:0,
pec:0,ts:0,erl:[],ter:[],mxe:50,startTimer:function(){a.ts++;setInterval(function(){d.ue&&a.pec<a.ec&&d.uex("at");a.pec=a.ec},1E4)}};l.skipTrace=1;h.skipTrace=1;h.isStub=1;d.ueLogError=h;d.ue_err=a;e.onerror=l})(ue_csm,window);

ue.stub(ue,"event");ue.stub(ue,"onSushiUnload");ue.stub(ue,"onSushiFlush");

var ue_url='/gp/product/B01M8L5Z3Y/uedata/unsticky/459-2123700-1157113/NoPageType/ntpoffrw',
ue_sid='459-2123700-1157113',
ue_mid='AAHKV2X7AFYLW',

3.3 实例3：百度360搜索关键词提交

向网站提交关键词并获得搜索结果
Alt
Alt

import requests
url = "http://www.baidu.com/s"
keyword = "Python" # 要搜索的关键词
try:
    kv = {'wd':keyword}
    r = requests.get(url, params=kv) # 通过params输入键值对并获得相关请求
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except:
    print("爬取失败！")

输出：

http://www.baidu.com/s?wd=Python
371260

3.4 实例4：网络图片的爬取和存储

Alt

# picture
import requests
import os
url = "http://b.hiphotos.baidu.com/image/pic/item/0eb30f2442a7d9337119f7dba74bd11372f001e0.jpg"
root = "F://pics//"
path = root + url.split('/')[-1] # 路径为root+ 以/分割的最后一部分
try:
    if not os.path.exists(root): # root目录如果不存在则自动创建
        os.mkdir(root) 
    if not os.path.exists(path): # 文件如果不存在则写入文件
        with open(path, 'wb') as f:
            f.write(r.content)
            f.close()
            print("文件保存成功")
    else:
        print("文件已存在")
except:
    print("爬取失败")

输出：
文件保存成功

3.5 实例5：IP地址归属地的自动查询

Alt

import requests
url = "http://m.ip138.com/ip.asp?ip="
try:
    r = requests.get(url+'202.204.80.112')
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[-500:])
except:
    print("爬取失败")

输出：

value="查询" class="form-btn" />
					</form>
				</div>
				<div class="query-hd">ip138.com IP查询(搜索IP地址的地理位置)</div>
				<h1 class="query">您查询的IP：202.204.80.112</h1><p class="result">本站主数据：北京市海淀区 北京理工大学 教育网</p><p class="result">参考数据一：北京市 北京理工大学</p>

			</div>
		</div>

		<div class="footer">
			<a href="http://www.miitbeian.gov.cn/" rel="nofollow" target="_blank">沪ICP备10013467号-1</a>
		</div>
	</div>

	<script type="text/javascript" src="/script/common.js"></script></body>
</html>

3.6 单元小结

Alt
Alt

4. Beautiful Soup库入门

4.1 Beautiful Soup库的安装

pip install beautifulsoup4

Alt

import requests

r = requests.get("https://python123.io/ws/demo.html")
r.text

'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'

demo = r.text #demo为html格式的信息

from bs4 import BeautifulSoup

soup = BeautifulSoup(demo, "html.parser") #两个参数，一个demo，另一个为“html.parser”

print(soup.prettify())

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

4.2 Beautiful Soup库的基本元素

Alt
Alt
Alt
Alt
Alt Alt

from bs4 import BeautifulSoup
soup = BeautifulSoup(demo, "html.parser")
soup.title

<title>This is a python demo page</title>

tag = soup.a

tag

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

soup.a.name

'a'

soup.a.parent.name

'p'

soup.a.parent.parent.name

'body'

# 标签属性信息
tag = soup.a
tag.attrs

{'class': ['py1'],
 'href': 'http://www.icourse163.org/course/BIT-268001',
 'id': 'link1'}

tag.attrs['class'] # class属性的值，class是一个列表，第一个元素为py1

['py1']

tag.attrs['href'] # 获得标签的链接属性

'http://www.icourse163.org/course/BIT-268001'

type(tag.attrs) # 标签属性的类型

dict

type(tag) #标签类型

bs4.element.Tag

soup.a.string

'Basic Python'

soup.p.string

'The demo python introduces several python courses.'

type(soup.p.string)

bs4.element.NavigableString

newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser") #html页面中用<!表示一个注释的开始

newsoup.b.string

'This is a comment'

newsoup.p.string

'This is not a comment'

type(newsoup.p.string)

bs4.element.NavigableString

Alt

4.3基于bs4库的HTML内容遍历方法

Alt
Alt
Alt
.contents返回列表类型
.children和.descendants返回迭代类型，只能在for循环中使用
Alt

import requests
r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
demo

'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'

soup = BeautifulSoup(demo, "html.parser")

soup.head

<head><title>This is a python demo page</title></head>

soup.head.contents # 返回head标签的所有儿子节点组成的列表

[<title>This is a python demo page</title>]

soup.body.contents

['\n',
 <p class="title"><b>The demo python introduces several python courses.</b></p>,
 '\n',
 <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>,
 '\n']

len(soup.body.contents) #获得body的儿子节点的数量

soup.body.contents[0]

'\n'

for child in soup.body.descendants:
    print(child)

<p class="title"><b>The demo python introduces several python courses.</b></p>
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.


<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
 and 
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Advanced Python

Alt
.parent返回列表类型
.parents返回迭代类型，只能在for循环中使用

# 标签树的上行遍历
import requests
from bs4 import BeautifulSoup
r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
demo

'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'

soup = BeautifulSoup(demo, "html.parser")

soup.title.parent

<head><title>This is a python demo page</title></head>

soup.html.parent # html的父亲标签是它本身

<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>

soup.parent

for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

p
body
html
[document]

Alt

前两个返回列表类型
后两个返回迭代类型，只能在for循环中使用
Alt
Alt

# 标签树的平行遍历
import requests
from bs4 import BeautifulSoup
r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
demo

'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'

soup = BeautifulSoup(demo, "html.parser")

soup.a.next_sibling

' and '

type(soup.a.next_sibling) #a标签的下一个平行标签非标签

bs4.element.NavigableString

soup.a.next_sibling.next_sibling #a标签的下一个平行标签的下一个平行标签

<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>

soup.a.previous_sibling #a标签的前一个平行节点

'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'

type(soup.a.previous_sibling) #非标签

bs4.element.NavigableString

soup.a.previous_sibling.previous_sibling

没有任何输出表示a标签的前一个平行节点的再前一个平行节点是空信息

soup.a.parent #a标签的父亲节点

<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

Alt

4.4基于bs4库的HTML格式化和编码

如何能够让html页面更好地显示？
Alt

# 
import requests
from bs4 import BeautifulSoup
r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
demo

'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'

soup = BeautifulSoup(demo, 'html.parser')

soup.prettify()

'<html>\n <head>\n  <title>\n   This is a python demo page\n  </title>\n </head>\n <body>\n  <p class="title">\n   <b>\n    The demo python introduces several python courses.\n   </b>\n  </p>\n  <p class="course">\n   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n    Basic Python\n   </a>\n   and\n   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n    Advanced Python\n   </a>\n   .\n  </p>\n </body>\n</html>'

print(soup.a.prettify()) #添加换行符，使得html标签正常显示

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
 Basic Python
</a>

单元小结

Alt
Alt

5.信息组织与提取方法

5.1 信息标记的三种形式

Alt
Alt
Alt
Alt
Alt 三种通用的信息标记：
1.XML
Alt
Alt
标签中有内容时，可以通过一对标签来表达；标签中没有内容时，可以通过一对尖括号来表达，也可以添加注释。
Alt 2.JSON
Alt
Alt
Alt JSON使用有类型的键值对将信息组织起来，如果值中有多个信息与同一个键相对应，我们采用方括号；我们也可以把新的键值对作为值的一部分放到键值对中，采用大括号的形式进行嵌套。

Alt 3.YAML
Alt
键和值都没有双引号。
Alt
通过缩进来表达所属关系。
Alt
Alt
Alt
YAML采用无类型的键值对，在键和值中无双引号或相关的类型标记，可以用#表示注释，-表示并列，|表示整块数据(一般指多行文本)，键值对之间可以嵌套。

5.2 三中信息标记形式的比较

Alt
Alt
Alt 对比：
Alt
XML有效信息所占比例不高，大多数信息被标签占用。
Alt
JSON使用较少的代码和不断重复的双引号表示。
Alt
YAML较少的代码没有双引号，最简洁。
Alt
Alt

5.3 信息提取的一般方法

Alt
Alt
最优：
Alt
Alt

from bs4 import BeautifulSoup
import requests
r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, "html.parser")
for link in soup.find_all('a'):
    print(link.get('href'))

http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001

5.4 基于bs4库的HTML内容查找方法

Alt
Alt

from bs4 import BeautifulSoup
import requests
r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, "html.parser")

soup.find_all('a') # 输出一个列表类型，列表中包含了在这个文件中出现的所有a标签

[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>,
 <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

soup.find_all(['a', 'b']) #a标签和b标签作为一个列表形式给到第一个参数

[<b>The demo python introduces several python courses.</b>,
 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>,
 <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

for tag in soup.find_all(True): #遍历所有标签
    print(tag.name)

html
head
title
body
p
b
p
a
a

# 正则表达式库
import re
for tag in soup.find_all(re.compile('b')): # 以b开头的所有的信息作为查找的库
    print(tag.name)

body
b

# 查找p标签中包含course字符串的信息
soup.find_all('p', 'course')

[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]

# 查找id属性为link1的值为查找库
soup.find_all(id='link1')

[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]

soup.find_all(id='link')

[]

soup.find_all(id=re.compile('link'))

[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>,
 <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

soup.find_all('a')
soup.find_all('a', recursive=False) #结果为空，表示从soup这样一个根节点开始，它的儿子节点层面是没有a标签的

[]

soup

<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>

soup.find_all(string="Basic Python") # 检索一定的字符串信息

['Basic Python']

soup.find_all(string=re.compile("Python")) #使用正则表达式检索含有Python的字符串

['Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n',
 'Basic Python',
 'Advanced Python']

5.5 单元小结

Alt
Alt

6.实例1：中国大学排名爬虫

6.1 中国大学排名定向爬虫实例介绍

Alt
Alt
说明无robots协议对爬虫进行限制，可随意爬取。
Alt
Alt
Alt
Alt

6.2 中国大学排名定向爬虫实例编写

格式化输出：
Alt

import requests
from bs4 import BeautifulSoup
import bs4

# 获取url信息，输出url内容
def getHTMLText(url):
    # 基本框架如下
    try:
        # 通过get函数获取url信息，并且设定timeout时间为30s
        r = requests.get(url, timeout=30)
        # raise_for_status()来产生异常信息
        r.raise_for_status()
        # 修改编码
        r.encoding = r.apparent_encoding
        # 将网页信息内容返回给程序的其他部分
        return r.text
    except:
        # 如出现错误则返回空字符串
        return ""
    # 函数编写完成，注释掉return ""
    # return ""

# 将一个html页面返回到一个list列表中，list列表定义为ulist，核心部分
def fillUnivList(ulist,html):
    # 观察html文件结构，首先找到tbody标签，获取所有大学的相关信息，然后再tbody标签中解析tr标签获取每一个大学的信息，
    # 再根据tr标签中的td标签把每一个大学的相关的数据参数写到对应的ulist的列表中，推荐使用遍历查找方法
    soup = BeautifulSoup(html, "html.parser")
    # 使用for语句去查找html文本中的tbody标签，并且将他的孩子children进行遍历，其中，tr表示一所大学对应的信息
    for tr in soup.find('tbody').children:
        # children中有可能出现非标签(NavigableString)的字符串类型，使用isinstance函数来对类型进行判断
        if isinstance(tr, bs4.element.Tag):
            # 再对td标签进行查询，将所有td标签存到列表tds中
            tds = tr('td')
            # 在ulist中增加我们需要的对应字段，大学排名，大学名称，大学得分
            ulist.append([tds[0].string, tds[1].string, tds[2].string, tds[3].string])
    # 函数编写完成，注释掉pass
    # pass

# 将ulist的信息打印出来，num指定打印的数量
def printUnivList(ulist,num):
    # 打印表头
    print("{:^10}\t{:^6}\t{:^10}\t{:^10}".format("排名", "学校名称", "省市", "总分"))
    # 打印学校的其他信息
    for i in range(num):
        # 将第i个学校的信息存到变量u中，u为列表类型，然后将每一所学校的信息打印出来，注意使用跟表头相一致的字符串表示
        u = ulist[i]
        print("{:^10}\t{:^6}\t{:^10}\t{:^10}".format(u[0], u[1], u[2], u[3]))
    # 函数编写完成，注释掉print("Suc" + str(num))  
    # print("Suc" + str(num))

# 写主函数
def main():
    # 将大学信息放到一个uinfo的列表中
    uinfo = []
    url = 'http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html'
    # 将url转换成html
    html = getHTMLText(url)
    # 将html信息提出后放在uinfo的变量中
    fillUnivList(uinfo, html)
    # 打印大学信息
    printUnivList(uinfo, 20)  # 打印前20所学校的信息
    
main()
# 然后添加各个函数的功能

    排名    	 学校名称 	    省市    	    总分    
    1     	 清华大学 	    北京    	   94.6   
    2     	 北京大学 	    北京    	   76.5   
    3     	 浙江大学 	    浙江    	   72.9   
    4     	上海交通大学	    上海    	   72.1   
    5     	 复旦大学 	    上海    	   65.6   
    6     	中国科学技术大学	    安徽    	   60.9   
    7     	华中科技大学	    湖北    	   58.9   
    7     	 南京大学 	    江苏    	   58.9   
    9     	 中山大学 	    广东    	   58.2   
    10    	哈尔滨工业大学	   黑龙江    	   56.7   
    11    	北京航空航天大学	    北京    	   56.3   
    12    	 武汉大学 	    湖北    	   56.2   
    13    	 同济大学 	    上海    	   55.7   
    14    	西安交通大学	    陕西    	   55.0   
    15    	 四川大学 	    四川    	   54.4   
    16    	北京理工大学	    北京    	   54.0   
    17    	 东南大学 	    江苏    	   53.6   
    18    	 南开大学 	    天津    	   52.8   
    19    	 天津大学 	    天津    	   52.3   
    20    	华南理工大学	    广东    	   52.0

6.3 中国大学排名定向爬虫实例优化

基本成功输出，但是存在字符对齐问题。 Alt
Alt

代码对比上一节printUnivList(ulist,num)函数部分

import requests
from bs4 import BeautifulSoup
import bs4

# 获取url信息，输出url内容
def getHTMLText(url):
    # 基本框架如下
    try:
        # 通过get函数获取url信息，并且设定timeout时间为30s
        r = requests.get(url, timeout=30)
        # raise_for_status()来产生异常信息
        r.raise_for_status()
        # 修改编码
        r.encoding = r.apparent_encoding
        # 将网页信息内容返回给程序的其他部分
        return r.text
    except:
        # 如出现错误则返回空字符串
        return ""
    # 函数编写完成，注释掉return ""
    # return ""

# 将一个html页面返回到一个list列表中，list列表定义为ulist，核心部分
def fillUnivList(ulist,html):
    # 观察html文件结构，首先找到tbody标签，获取所有大学的相关信息，然后再tbody标签中解析tr标签获取每一个大学的信息，
    # 再根据tr标签中的td标签把每一个大学的相关的数据参数写到对应的ulist的列表中，推荐使用遍历查找方法
    soup = BeautifulSoup(html, "html.parser")
    # 使用for语句去查找html文本中的tbody标签，并且将他的孩子children进行遍历，其中，tr表示一所大学对应的信息
    for tr in soup.find('tbody').children:
        # children中有可能出现非标签(NavigableString)的字符串类型，使用isinstance函数来对类型进行判断
        if isinstance(tr, bs4.element.Tag):
            # 再对td标签进行查询，将所有td标签存到列表tds中
            tds = tr('td')
            # 在ulist中增加我们需要的对应字段，大学排名，大学名称，大学得分
            ulist.append([tds[0].string, tds[1].string, tds[2].string, tds[3].string])
    # 函数编写完成，注释掉pass
    # pass

# 将ulist的信息打印出来，num指定打印的数量
def printUnivList(ulist,num):
    # 格式对齐调整
    tplt = "{0:^10}\t{1:{4}^10}\t{2:^10}\t{3:^10}" # {1:{4}^10}表示打印第一列变量（学校名称）时，使用format中的第4个变量chr(12288)填充
    # 打印表头
    print(tplt.format("排名", "学校名称", "省市", "总分", chr(12288))) # chr(12288)改为增加中文空格补全
    # 打印学校的其他信息
    for i in range(num):
        # 将第i个学校的信息存到变量u中，u为列表类型，然后将每一所学校的信息打印出来，注意使用跟表头相一致的字符串表示
        u = ulist[i]
        print(tplt.format(u[0], u[1], u[2], u[3], chr(12288)))
    # 函数编写完成，注释掉print("Suc" + str(num))  
    # print("Suc" + str(num))

# 写主函数
def main():
    # 将大学信息放到一个uinfo的列表中
    uinfo = []
    url = 'http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html'
    # 将url转换成html
    html = getHTMLText(url)
    # 将html信息提出后放在uinfo的变量中
    fillUnivList(uinfo, html)
    # 打印大学信息
    printUnivList(uinfo, 20)  # 打印前20所学校的信息
    
main()
# 然后添加各个函数的功能

    排名    	　　　学校名称　　　	    省市    	    总分    
    1     	　　　清华大学　　　	    北京    	   94.6   
    2     	　　　北京大学　　　	    北京    	   76.5   
    3     	　　　浙江大学　　　	    浙江    	   72.9   
    4     	　　上海交通大学　　	    上海    	   72.1   
    5     	　　　复旦大学　　　	    上海    	   65.6   
    6     	　中国科学技术大学　	    安徽    	   60.9   
    7     	　　华中科技大学　　	    湖北    	   58.9   
    7     	　　　南京大学　　　	    江苏    	   58.9   
    9     	　　　中山大学　　　	    广东    	   58.2   
    10    	　哈尔滨工业大学　　	   黑龙江    	   56.7   
    11    	　北京航空航天大学　	    北京    	   56.3   
    12    	　　　武汉大学　　　	    湖北    	   56.2   
    13    	　　　同济大学　　　	    上海    	   55.7   
    14    	　　西安交通大学　　	    陕西    	   55.0   
    15    	　　　四川大学　　　	    四川    	   54.4   
    16    	　　北京理工大学　　	    北京    	   54.0   
    17    	　　　东南大学　　　	    江苏    	   53.6   
    18    	　　　南开大学　　　	    天津    	   52.8   
    19    	　　　天津大学　　　	    天津    	   52.3   
    20    	　　华南理工大学　　	    广东    	   52.0

6.4 单元小结

Alt

7.正则表达式库

Alt

7.1 正则表达式的概念

Alt
Alt
Alt
Alt
Alt
Alt

7.2 正则表达式的语法

Alt
Alt
Alt
Alt
Alt
Alt
Alt

7.3 Re库的基本使用

Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt
Alt

import re

match = re.search(r'[1-9]\d{5}', 'BIT 100081')

if match:
    print(match.group(0))

import re

match = re.match(r'[1-9]\d{5}', 'BIT 100081')

if match:
    match.group(0)

match.group(0)

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-7-f6ba369a70a4> in <module>()
----> 1 match.group(0)


AttributeError: 'NoneType' object has no attribute 'group'

match = re.match(r'[1-9]\d{5}', '100081 BIT')

if match:
    print(match.group(0))

import re

ls = re.findall(r'[1-9]\d{5}', 'BIT100081 TSU100084')

ls

['100081', '100084']

import re

re.split(r'[1-9]\d{5}', 'BIT100081 TSU100084') #讲将匹配的部分去掉，去掉之后，将分割的部分作为字符串元素存到一个列表中。

['BIT', ' TSU', '']

re.split(r'[1-9]\d{5}', 'BIT100081 TSU100084', maxsplit=1) #增加参数约束maxsplit=1，只对第一个字符串进行操作

['BIT', ' TSU100084']

import re

# 匹配结果并进行输出
for m in re.finditer(r'[1-9]\d{5}', 'BIT100081 TSU100084'): 
    if m:
        print(m.group(0))

100081
100084

import re

re.sub(r'[1-9]\d{5}', ':zipcode', 'BIT100081 TSU100084')

'BIT:zipcode TSU:zipcode'

7.4 Re库的match对象

Alt
Alt

import re

match = re.search(r'[1-9]\d{5}', 'BIT 100081')

if match:
    print(match.group(0))

type(match)

_sre.SRE_Match

m = re.search(r'[1-9]\d{5}', 'BIT100081 TSU100084')

m.string

'BIT100081 TSU100084'

m.re

re.compile(r'[1-9]\d{5}', re.UNICODE)

m.pos

m.endpos

m.group(0)

'100081'

m.start()

m.end()

m.span()

(3, 9)

7.5 Re库的贪婪匹配和最小匹配

Alt
Alt
Alt
Alt

7.6 单元小结

Alt

8.1 “淘宝商品信息定向爬虫”实例介绍

Alt
Alt
每一页有44个商品，变量s表示该页商品的起始编号。
分析：
向淘宝提交搜索的接口以及对以每一个不同翻页的url的参数变量
Alt
淘宝搜索页面不允许爬虫对其爬取。
Alt

8.2 “淘宝商品信息定向爬虫”实例编写

8.3 单元小结

GrandNovice

关注

3
点赞
踩
16

收藏

觉得还不错? 一键收藏
0
评论
python网络爬虫与信息提取

The Website is the API掌握定向网络数据爬取和网页解析的基本能力1.Requests库的入门1.1 Requests库的安装和方法requests库的7个主要方法import requestsr = requests.get("http://www.baidu.com")r.status_code输出：200r.encoding = 'utf-8'r....
复制链接

扫一扫