python3.4 读取html,关于html：Python3.5 BeautifulSoup4从div中的’p’获取文本

最新推荐文章于 2024-10-08 12:39:40 发布

weixin_39664998

最新推荐文章于 2024-10-08 12:39:40 发布

阅读量181

点赞数

文章标签： python3.4 读取html

我试图从div类'caselawcontent searchable-content'中提取所有文本。此代码仅打印HTML，而不包含来自网页的文本。我想得到什么文本？

以下链接位于" finteredcasesdoc.text"文件中：

http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html

import requests

from bs4 import BeautifulSoup

with open('filteredcasesdoc.txt', 'r') as openfile1:

for line in openfile1:

rulingpage = requests.get(line).text

soup = BeautifulSoup(rulingpage, 'html.parser')

doctext = soup.find('div', class_='caselawcontent searchable-content')

print (doctext)

from bs4 import BeautifulSoup

import requests

url = 'http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html'

soup = BeautifulSoup(requests.get(url).text, 'html.parser')

我添加了一个更加可靠的.find方法(键：值)

whole_section = soup.find('div',{'class':'caselawcontent searchable-content'})

the_title = whole_section.center.h2

#e.g. Missouri Court of Appeals,Southern District,Division Two.

second_title = whole_section.center.h3.p

#e.g. STATE of Missouri, Plaintiff-Appellant v....

number_text = whole_section.center.h3.next_sibling.next_sibling

#e.g.

the_date = number_text.next_sibling.next_sibling

#authors

authors = whole_section.center.next_sibling

para = whole_section.findAll('p')[1:]

#Because we don't want the paragraph h3.p.

# we could aslso do findAll('p',recursive=False) doesnt pickup children

基本上，我已经解剖了整棵树

至于段落(例如，主要文本，var para)，则必须循环

print(authors)

# and you can add .text (e.g. print(authors.text) to get the text without the tag.

# or a simple function that returns only the text

def rettext(something):

return something.text

#Usage: print(rettext(authorts))

你们俩都帮了大忙！谢谢。

尝试打印doctext.text。这将为您摆脱所有HTML标记。

from bs4 import BeautifulSoup

cases = []

with open('filteredcasesdoc.txt', 'r') as openfile1:

for url in openfile1:

# GET the HTML page as a string, with HTML tags

rulingpage = requests.get(url).text

soup = BeautifulSoup(rulingpage, 'html.parser')

# find the part of the HTML page we want, as an HTML element

doctext = soup.find('div', class_='caselawcontent searchable-content')

print(doctext.text) # now we have the inner HTML as a string

cases.append(doctext.text) # do something useful with this !

很好的评论Theo，Ive还添加了一些其他方法来研究此问题。您以一种简单的方式对此进行了解释，为此提供了支持！

weixin_39664998

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。