python去掉html标签,去掉HTML标签以获取python中的字符串

最新推荐文章于 2023-04-18 09:51:16 发布

weixin_39563132

最新推荐文章于 2023-04-18 09:51:16 发布

阅读量219

点赞数

文章标签： python去掉html标签

I tried to get some strings from an HTML file with BeautifulSoup and everytime I work with it I get partial results.

I want to get the strings in every li element/tag. So far I've been able to get everything in ul like this.

#!/usr/bin/python

from bs4 import BeautifulSoup

page = open("page.html")

soup = BeautifulSoup(page)

source = soup.select(".sidebar li")

And what I get is this:

[

Def Leppard - Make Love Like A ManLive

Inxs - Never Tear Us Apart

Gary Moore - Over The Hills And Far Away

Linkin Park - Numb

Vita De Vie - Basul Si Cu Toba Mare

Nazareth - Love Hurts

U2 - I Still Haven't Found What I'm L

Blink 182 - All The Small Things

Scorpions - Wind Of Change

Iggy Pop - The Passenger

]

I want to get only the strings from this.

解决方案

Use beautiful soups - .strings method.

for string in soup.stripped_strings:

print(repr(string))

from the docs:

If there’s more than one thing inside a tag, you can still look at

just the strings. Use the .strings generator:

These strings tend to have a lot of extra whitespace, which you can

remove by using the .stripped_strings generator instead:

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

关注关注