爬虫框架 Beautiful Soup 4 使用心得

最新推荐文章于 2024-04-22 10:50:04 发布

原创最新推荐文章于 2024-04-22 10:50:04 发布 · 1.3k 阅读

CC 4.0 BY-SA版权

文章标签：

2 篇文章

订阅专栏

Beautiful Soup 4.4.0

目录结构如下：

小技巧：

如果WinZIP解压不了tar.gz，则可以先传到Linux机器上使用 tar 命令解压，然后回传到windows机器上来。

1) 电脑先要安装Python，这个可以搜索一下，下载相应的版本！我下载的是3.3.1

2) Python 3.3.1 环境下，build bs4.4会不成功，build bs4.3是成功的！

Take action!

我又编辑了一下文章，因为已经完成了完整的代码，爬取到了需要的子页面数据，保存到了excel中。开心！又会了一门东西！

def funtion1(param1, param2)

..omit...

return var1;

a. 判断对象/字符串是否为空

if (var1 is None):
	print("var1 is None")
else:
	#print("var1 is not None");

b. 判断列表是否为空

if len(list1) == 0:
	print("list is empty");
else:
	print("list is not empty");

异常捕捉

try:
	print("this is try clause");
except:
	print("handling clause in exception");
finally:
	print("finally clause");

迭代循环

for link in soup.find_all("a[href]"):
	href = link["href"];           # also used as link.get("href");
	print("href is %s" % (href));

soup.select("a[href]") --》选择带有href属性的<a> tag.

soup.select('div[title*="关键字"]') --》选择 title属性含有 “关键字“ 的<div> tag.