《Python网络数据采集》第一章、第二章(阅读代码笔记)

分享关于学习Python,跟着书籍敲的代码。

第一本书:《Byte Of Python》,给出代码笔记链接:ByteOfPython笔记代码,链接中有此书的PDF格式资源。

第二本书:《Python网络数据采集》,给出此书PDF格式的资源链接:https://pan.baidu.com/s/1eSq6x5g 密码:a46q

此篇给出《Python网络数据采集》第一章:初见网络爬虫   第二章:复杂HTML解析    的代码笔记,供大家参考。

第一章:初见网络爬虫 

#-*-coding:utf-8-*-

###原生的爬网页
# import urllib.request
# response = urllib.request.urlopen('http://localhost:8080/zhf/login!index.action')
# print(response.read().decode('utf-8'))

###使用BeautifulSoup块爬网页
# from urllib.request import urlopen
# from bs4 import BeautifulSoup
#
# html=urlopen("http://localhost:8080/zhf/login.jsp")
# bshtml=BeautifulSoup(html.read(),"html.parser")
# print("网页抓取成功!")
# print(bshtml.title)
# print(bshtml.head)
# print(bshtml.body)


###排误性(有预见性地处理异常):使用BeautifulSoup块爬网页
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup

def getHtmlTitle(url):
    try:
        html=urlopen(url)
    except HTTPError as e:
        return None
    except URLError as e:
        return None
    try:
        bshtml=BeautifulSoup(html.read(),"html.parser")
        title=bshtml.head.title;
    except ArithmeticError as e:
        return None
    return title
url="http://localhost:8080/zhf/login.jsp"
title=getHtmlTitle(url)
if title is None:
    print("抓取失败")
else:
    print("抓取成功:\n{0}".format(title))

第二章:复杂HTML解析

#-*-coding:utf-8-*-

########复杂HTML解析

###根据css的class或者id这些标志属性,过滤网页元素,获取特定的标签。
###   .get_text(),会去掉html中所有的标签
# from urllib.request import urlopen
# from bs4 import BeautifulSoup
#
# html=urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
# bshtml=BeautifulSoup(html,"html.parser")
# namelist=bshtml.findAll("span",{"class":"green"})
# for name in namelist:
#     print("name:{0}".format(name.get_text()))


# #findAll/find方法
# #  findAll(tag, attributes, recursive, text, limit, keywords)
# #     find(tag, attributes, recursive, text, keywords)
# from urllib.request import urlopen
# from bs4 import BeautifulSoup
#
# html=urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
# bshtml=BeautifulSoup(html,"html.parser")
# princelist=bshtml.findAll(text="the prince")
# print("'the prince'出现了:{0}  次".format(len(princelist)))
#
# alltext=bshtml.findAll(id='text')
# print(alltext[0].get_text())


###子标签以及后代标签

###子标签:.children
# from urllib.request import urlopen
# from bs4 import BeautifulSoup
#
# html=urlopen("http://www.pythonscraping.com/pages/page3.html")
# bshtml=BeautifulSoup(html,"html.parser")
#
# for child in bshtml.find("table",{"id":"giftList"}).children:
#     print(child)

###后代标签:. descendants
# from urllib.request import urlopen
# from bs4 import BeautifulSoup
#
# html=urlopen("http://www.pythonscraping.com/pages/page3.html")
# bshtml=BeautifulSoup(html,"html.parser")
#
# for child in bshtml.find("table",{"id":"giftList"}).descendants:
#     print(child)

###所有的兄弟标签:.next_siblings
###兄弟标签:.next_sibling
#previous_sibling:上一个兄弟标签
#previous_siblings:前面所有的兄弟标签

# from urllib.request import urlopen
# from bs4 import BeautifulSoup
#
# html=urlopen("http://www.pythonscraping.com/pages/page3.html")
# bhtml=BeautifulSoup(html,"html.parser")
#
# for tr in bhtml.find("table",{"id":"giftList"}).tr.next_siblings:
#     print(tr)

###父标签:parent 和 parents
# from urllib.request import urlopen
# from bs4 import BeautifulSoup
#
# html=urlopen("http://www.pythonscraping.com/pages/page3.html")
# bthtml=BeautifulSoup(html,"html.parser")
#
# obj=bthtml.find("img",{"src":"../img/gifts/img6.jpg"}).parent.previous_sibling.get_text()
# print(obj)


#######正则表达式   re.compile("正则表达式")

# 邮箱的正则:[A-Za-z0-9\._+]+@[A-Za-z]+\.(com|org|edu|net)
#
# from  urllib.request import urlopen
# from bs4 import BeautifulSoup
# import re
#
# html=urlopen("http://www.pythonscraping.com/pages/page3.html")
# bthtml=BeautifulSoup(html,"html.parser")
#
# imges=bthtml.findAll("img",{"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")})
# for imge in imges:
#     print(imge)
#     print(imge["src"])

###获取属性:myImgTag.attrs["src"]
###         myTag.attrs

第三章链接: 第三章 开始采集


  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

象话

打赏犹如太阳穴的枪口

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值