猫哥教你写爬虫 033--爬虫初体验-BeautifulSoup-作业

beautifulsoup 解析器

解析器使用方法优势劣势
Python标准库BeautifulSoup(text, "html.parser")Python的内置标准库执行速度适中文档容错能力强Python 2.7.3 or 3.2.2前的版本中文档容错能力差
lxml HTML 解析器BeautifulSoup(text, "lxml")速度快文档容错能力强需要安装C语言库
lxml XML 解析器BeautifulSoup(text, "xml")速度快唯一支持XML的解析器需要安装C语言库
html5libBeautifulSoup(text, "html5lib")生成HTML5格式的文档速度慢不依赖外部扩展

作业1:爬取文章, 并保存到本地(每个文章, 一个html文件)

wordpress-edu-3autumn.localprod.forc.work
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://wordpress-edu-3autumn.localprod.forc.work/').text,'html.parser')
for i in soup.find_all('h2',class_='entry-title'):
    print(i.find('a').text)
    with open('{}.html'.format(i.find('a').text),'w',encoding='utf8') as file:
        soup = BeautifulSoup(requests.get(i.find('a')['href']).text,'lxml')
        file.write(str(soup.find('div',class_='entry-content')))
复制代码

作业2: 爬取分类下的图书名和对应价格, 保存到books.txt

books.toscrape.com
最终效果...

import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('http://books.toscrape.com/').text,'html.parser')
with open('books.txt','w',encoding='utf8') as file:
    for i in soup.find('ul',class_='nav nav-list').find('ul').find_all('li'):
        file.write(i.text.strip()+'\n')
        res = requests.get("http://books.toscrape.com/"+i.find('a')['href'])
        res.encoding='utf8'
        soup = BeautifulSoup(res.text,'html.parser')
        for j in soup.find_all('li',class_="col-xs-6 col-sm-4 col-md-3 col-lg-3"):
            print(j.find('h3').find('a')['title'])
            file.write('\t"{}" {}\n'.format(j.find('h3').find('a')['title'],j.find('p',class_='price_color').text))
复制代码
Travel
	"It's Only the Himalayas" £45.17
	"Full Moon over Noah’s Ark: An Odyssey to Mount Ararat and Beyond" £49.43
	"See America: A Celebration of Our National Parks & Treasured Sites" £48.87
	"Vagabonding: An Uncommon Guide to the Art of Long-Term World Travel" £36.94
	"Under the Tuscan Sun" £37.33
	"A Summer In Europe" £44.34
	"The Great Railway Bazaar" £30.54
	"A Year in Provence (Provence #1)" £56.88
	"The Road to Little Dribbling: Adventures of an American in Britain (Notes From a Small Island #2)" £23.21
	"Neither Here nor There: Travels in Europe" £38.95
	"1,000 Places to See Before You Die" £26.08
Mystery
	"Sharp Objects" £47.82
	"In a Dark, Dark Wood" £19.63
	"The Past Never Ends" £56.50
	"A Murder in Time" £16.64
	"The Murder of Roger Ackroyd (Hercule Poirot #4)" £44.10
	"The Last Mile (Amos Decker #2)" £54.21
	"That Darkness (Gardiner and Renner #1)" £13.92
	"Tastes Like Fear (DI Marnie Rome #3)" £10.69
	"A Time of Torment (Charlie Parker #14)" £48.35
	"A Study in Scarlet (Sherlock Holmes #1)" £16.73
	"Poisonous (Max Revere Novels #3)" £26.80
	"Murder at the 42nd Street Library (Raymond Ambler #1)" £54.36
	"Most Wanted" £35.28
	"Hide Away (Eve Duncan #20)" £11.84
	"Boar Island (Anna Pigeon #19)" £59.48
	"The Widow" £27.26
	"Playing with Fire" £13.71
	"What Happened on Beale Street (Secrets of the South Mysteries #2)" £25.37
	"The Bachelor Girl's Guide to Murder (Herringford and Watts Mysteries #1)" £52.30
	"Delivering the Truth (Quaker Midwife Mystery #1)" £20.89
Historical Fiction
	"Tipping the Velvet" £53.74
	"Forever and Forever: The Courtship of Henry Longfellow and Fanny Appleton" £29.69
	"A Flight of Arrows (The Pathfinders #2)" £55.53
	"The House by the Lake" £36.95
	"Mrs. Houdini" £30.25
	"The Marriage of Opposites" £28.08
	"Glory over Everything: Beyond The Kitchen House" £45.84
	"Love, Lies and Spies" £20.55
	"A Paris Apartment" £39.01
	"Lilac Girls" £17.28
	"The Constant Princess (The Tudor Court #1)" £16.62
	"The Invention of Wings" £37.34
	"World Without End (The Pillars of the Earth #2)" £32.97
	"The Passion of Dolssa" £28.32
	"Girl With a Pearl Earring" £26.77
	"Voyager (Outlander #3)" £21.07
	"The Red Tent" £35.66
	"The Last Painting of Sara de Vos" £55.55
	"The Guernsey Literary and Potato Peel Pie Society" £49.53
	"Girl in the Blue Coat" £46.83
......
复制代码

快速跳转:

猫哥教你写爬虫 000--开篇.md
猫哥教你写爬虫 001--print()函数和变量.md
猫哥教你写爬虫 002--作业-打印皮卡丘.md
猫哥教你写爬虫 003--数据类型转换.md
猫哥教你写爬虫 004--数据类型转换-小练习.md
猫哥教你写爬虫 005--数据类型转换-小作业.md
猫哥教你写爬虫 006--条件判断和条件嵌套.md
猫哥教你写爬虫 007--条件判断和条件嵌套-小作业.md
猫哥教你写爬虫 008--input()函数.md
猫哥教你写爬虫 009--input()函数-人工智能小爱同学.md
猫哥教你写爬虫 010--列表,字典,循环.md
猫哥教你写爬虫 011--列表,字典,循环-小作业.md
猫哥教你写爬虫 012--布尔值和四种语句.md
猫哥教你写爬虫 013--布尔值和四种语句-小作业.md
猫哥教你写爬虫 014--pk小游戏.md
猫哥教你写爬虫 015--pk小游戏(全新改版).md
猫哥教你写爬虫 016--函数.md
猫哥教你写爬虫 017--函数-小作业.md
猫哥教你写爬虫 018--debug.md
猫哥教你写爬虫 019--debug-作业.md
猫哥教你写爬虫 020--类与对象(上).md
猫哥教你写爬虫 021--类与对象(上)-作业.md
猫哥教你写爬虫 022--类与对象(下).md
猫哥教你写爬虫 023--类与对象(下)-作业.md
猫哥教你写爬虫 024--编码&&解码.md
猫哥教你写爬虫 025--编码&&解码-小作业.md
猫哥教你写爬虫 026--模块.md
猫哥教你写爬虫 027--模块介绍.md
猫哥教你写爬虫 028--模块介绍-小作业-广告牌.md
猫哥教你写爬虫 029--爬虫初探-requests.md
猫哥教你写爬虫 030--爬虫初探-requests-作业.md
猫哥教你写爬虫 031--爬虫基础-html.md
猫哥教你写爬虫 032--爬虫初体验-BeautifulSoup.md
猫哥教你写爬虫 033--爬虫初体验-BeautifulSoup-作业.md
猫哥教你写爬虫 034--爬虫-BeautifulSoup实践.md
猫哥教你写爬虫 035--爬虫-BeautifulSoup实践-作业-电影top250.md
猫哥教你写爬虫 036--爬虫-BeautifulSoup实践-作业-电影top250-作业解析.md
猫哥教你写爬虫 037--爬虫-宝宝要听歌.md
猫哥教你写爬虫 038--带参数请求.md
猫哥教你写爬虫 039--存储数据.md
猫哥教你写爬虫 040--存储数据-作业.md
猫哥教你写爬虫 041--模拟登录-cookie.md
猫哥教你写爬虫 042--session的用法.md
猫哥教你写爬虫 043--模拟浏览器.md
猫哥教你写爬虫 044--模拟浏览器-作业.md
猫哥教你写爬虫 045--协程.md
猫哥教你写爬虫 046--协程-实践-吃什么不会胖.md
猫哥教你写爬虫 047--scrapy框架.md
猫哥教你写爬虫 048--爬虫和反爬虫.md
猫哥教你写爬虫 049--完结撒花.md

转载于:https://juejin.im/post/5cfc4adb51882512a675faf0

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值