mofanpy01

最新推荐文章于 2024-09-15 15:27:19 发布

ccrispy

最新推荐文章于 2024-09-15 15:27:19 发布

阅读量134

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/xxiizzeefather/article/details/108622154

版权

python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

01

简易爬源码

使用python登录网站以后打印出源代码
由于老师提供的链接存在中文，read()以后需要进行decode('utf-8')进行中文转换。

from urllib.request import urlopen
#如有中文，请decode()
html = urlopen("https://mofanpy.com/static/scraping/basic-structure.html").read().decode('utf-8')
#获取html进行read(),并且使用decode()进行lang转换
print(html)

print出来的结果如下

<!DOCTYPE html>
<html lang="cn">
<head>
	<meta charset="UTF-8">
	<title>Scraping tutorial 1 | 莫烦Python</title>
	<link rel="icon" href="/static/img/description/tab_icon.png">
</head>
<body>
	<h1>爬虫测试1</h1>
	<p>
		这是一个在 <a href="/">莫烦Python</a>
		<a href="/tutorials/data-manipulation/scraping/">爬虫教程</a> 中的简单测试.
	</p>

</body>
</html>

匹配网页内容

使用python的正则表达式RegEx进行匹配文字，筛选信息。
选好需要筛选的tag名称，此处为<title></title>进行正则匹配。

import re
res = re.findall(r"<title>(.+?)</title>",html)
print("\nPage title is:",res[0])

print出来的结果如下

Page title is: Scraping tutorial 1 | 莫烦Python

因为文中夹杂着其他标签，需要使python对其他的标签不敏感，故需要添加flags = re.DOTALL

res = re.findall(r"<p>(.*?)</p>",html,flags=re.DOTALL)
print("\nPage paragraph is :",res[0])

print出来的结果如下

Page paragraph is : 
		这是一个在 <a href="/">莫烦Python</a>
		<a href="/tutorials/data-manipulation/scraping/">爬虫教程</a> 中的简单测试.

alllink

res = re.findall(r'href="(.*?)"',html)
linkall = ",https://mofanpy.com".join(res)
print("\nAll links: ", linkall,)

print

All links:  /static/img/description/tab_icon.png,https://mofanpy.com/,https://mofanpy.com/tutorials/data-manipulation/scraping/

个人总结

1）根据需求插件进行import；
2）填写html路径，根据source code分析是否需要进行lang转换；
3）使用正则表达式RegEx进行所获取的信息筛选。

所学教程传送门

ccrispy

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录