简单自定义字符串html信息爬取

最新推荐文章于 2023-07-16 00:30:00 发布

pythoncrawler

最新推荐文章于 2023-07-16 00:30:00 发布

阅读量352

点赞数

本文链接：https://blog.csdn.net/E_hero_/article/details/99625577

版权

html

html的概念：
超文本标记语言（hyper text markup language）

练习

html="""
  <a target="_blank" href="http://www.baidu1.com">我的链接1</a>
  <A href="http://www.baidu2.com" target="_blank">我的链接2</a>
  <a href="http://www.baidu3.com" target="_blank">我的链接3</a>
"""
res_url="<a.*?href=\"(.*?)>\".*?>"
urls=re.findall(res_url,html,re.M|re.S|re.I)
print(urls)
>>['http://www.baidu1.com', 'http://www.baidu2.com', 'http://www.baidu3.com']

res_a="<a.*?>(.*?)</a>"
a=re.findall(res_a,html,re.M|re.S|re.I)
print(a)
>>['我的链接1', '我的链接2', '我的链接3']

"""
http://www.baidu1.com   我的链接1
http://www.baidu2.com   我的链接2
http://www.baidu3.com   我的链接3
"""
写入csv文件应是如上格式

with open("c:/first.csv","wt"，newline="") as f:
	w=csv.writer()
	for i in range(3):
		temp=[urls[i],a[i]]
		#print(temp)
		w.writerow(temp)

输出结果
使用重新拼接的方法有风险，必须要求每个链接有相应的链接名称，但实际上可能并不是这样，有缺失的情况。不到万不得已，不要拼。

res_url="<a.*?href=\"(.*?)\".*?>(.*?)</a>"
r=re.findall(res_url,html,re.I|re.M|re.S)
print(r)
>>[('http://www.baidu1.com', '我的链接1'), ('http://www.baidu2.com', '我的链接2'), ('http://www.baidu3.com', '我的链接3')]
# 可以看到对提取的信息直接对应一组一组放入了元组中，元组再放在列表中

with open("c:\second.csv","wt",newline="") as f:
	w=csv.writer(f)
	for i in r:
		w.writerow(i)