spyder怎么运行html文件,使用Python3和BeautifulSoup4处理本地html文件

最新推荐文章于 2024-06-08 11:12:05 发布

高难饱

最新推荐文章于 2024-06-08 11:12:05 发布

阅读量1.4k

点赞数 1

文章标签： spyder怎么运行html文件

遇到的问题

在制作第三个微信小程序“法语背单词记忆小助手”时，我需要处理大量单词有关的数据，为了一劳永逸解决单词释义、单词例句等种种方面的问题，我打算提取 mdx 词典数据，将词典里面所有单词的数据做成数据表，并上传至云开发。这样的话，另一个小程序“法语动词变位记忆小助手”也能共享成果。

作为一个懒人，肯定不会手动去处理这么多数据(提取 mdx 之后有 60 万行数据，去除对我来说没用的动词变位数据，还有 15 万行，共计 12000 余个单词)。所以打算使用 Python 和 Beautiful Soup(以下可能简称 BS)进行数据处理。引用官方文档的说法：Beautiful Soup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库。它能够通过你喜欢的转换器实现惯用的文档导航、查找，修改文档的方式。Beautiful Soup 会帮你节省数小时甚至数天的工作时间。

初始需要处理的文本

初始文本如下，下面仅选取两个单词的详情页作为示例：

abandonner

<h1 class="Adresse" >abandonner</h1><br /><span class="CategorieGrammaticale" >verbe transitif </span><br />

<span class="Indicateur">(déserter) </span><br />

<span class="Locution2" id="48" >abandonner son poste</span>

</td></tr></table>

<span class="Indicateur">(laisser) </span><br />

<span class="Locution2" id="49" >abandonner un animal</span>

<div class="Traduction2chinois" >丢弃一只动物</div Traduction2>

<span class="Locution2" id="50" >partir en abandonnant femme et enfants</span>

<div class="Traduction2chinois" >抛弃妻子和孩子出走</div Traduction2>

</td></tr></table>

<span class="Indicateur">(renoncer à) </span><br />

<span class="Locution2" id="51" >abandonner ses études</span>

<div class="Traduction2chinois" >放弃自己的学业</div Traduction2>

</td></tr></table>

<span class="Indicateur">(se retirer de) </span><br />

<span class="Locution2" id="52" >il a abandonné la course</span> <span class="Traduction2chinois" >他在这次赛跑中弃权</span></td></tr></table>

<br /><br /> <h1 class="Adresse" >abandonner</h1><br /><span class="CategorieGrammaticale" >verbe intransitif </span><br />

<span class="Locution2" id="53" >après sa chute, le cycliste a abandonné</span> <span class="Traduction2chinois" >这个自行车运动员摔倒后就退出了比赛</span>

</zidingyi>

abat-jour

<br /><span class="CategorieGrammaticale" >nom masculin invariable</span><br />

</zidingyi>

abandonner

<h1class="Adresse">abandonner</h1><br/><spanclass="CategorieGrammaticale">verbetransitif</span><br/>

<spanclass="Indicateur">(déserter)</span><br/>

<divclass="Traductionchinois">擅离</div>

<spanclass="Locution2"id="48">abandonnersonposte</span>

<divclass="Traduction2chinois">擅离职守</divTraduction2>

</td></tr></table>

<spanclass="Indicateur">(laisser)</span><br/>

<divclass="Traductionchinois">抛弃</div>

<spanclass="Locution2"id="49">abandonnerunanimal</span>

<divclass="Traduction2chinois">丢弃一只动物</divTraduction2>

<spanclass="Locution2"id="50">partirenabandonnantfemmeetenfants</span>

<divclass="Traduction2chinois">抛弃妻子和孩子出走</divTraduction2>

</td></tr></table>

<spanclass="Indicateur">(renoncerà)</span><br/>

<divclass="Traductionchinois">放弃</div>

<spanclass="Locution2"id="51">abandonnersesétudes</span>

<divclass="Traduction2chinois">放弃自己的学业</divTraduction2>

</td></tr></table>

<spanclass="Indicateur">(seretirerde)</span><br/>

<divclass="Traductionchinois">弃权</div>

<spanclass="Locution2"id="52">ilaabandonnélacourse</span><spanclass="Traduction2chinois">他在这次赛跑中弃权</span></td></tr></table>

<br/><br/><h1class="Adresse">abandonner</h1><br/><spanclass="CategorieGrammaticale">verbeintransitif</span><br/>

<divclass="Traductionchinois">退出比赛</div>

<spanclass="Locution2"id="53">aprèssachute,lecyclisteaabandonné</span><spanclass="Traduction2chinois">这个自行车运动员摔倒后就退出了比赛</span>

</zidingyi>

abat-jour

<h1class="Adresse">abat-jour</h1>

<br/><spanclass="CategorieGrammaticale">nommasculininvariable</span><br/>

<divclass="Traductionchinois">灯罩</div>

</zidingyi>

搜索和替换的一些常用正则表达式

在最原始的文档中，有非常多无用的标签，需要将这些标签删除。如果这些标签是定值，那么直接就能用普通的搜索替换就行批量替换；但若是标签中有有规律变动的 id 或者是标签之间的文字有所变动时，就需要使用正则表达式进行查找。在使用过程中，最常用的表达式总结一些就是这样的：

<a[^>]*>(.*?)</a>

举例如下：之间有不规则的文字内容，但是我需要将所有和标签之间文字一起替换掉，例如下方的第一行。标签中存在 id 号，但是我需要将所有的类似标签(不同 id)全部替换掉，例如下方的第二行：

(.*?)</span Traduction_py><span class="Locution2" [^>]*>

(.*?)</spanTraduction_py><spanclass="Locution2"[^>]*>

Python3 中使用 beautifulsoup4

beautifulsoup4 是什么？

引用官方文档的说法：Beautiful Soup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库。它能够通过你喜欢的转换器实现惯用的文档导航、查找，修改文档的方式。Beautiful Soup 会帮你节省数小时甚至数天的工作时间。

安装 beautifulsoup4

从这部分开始就需要使用到 Python 了，至于如何方便快捷地 0 基础使用上 Python？这里可能会单独放一篇文章介绍，先立一个 flag。用简洁地话来说，需要配备一下几点：

先下载一个Anaconda(搜索即可，傻瓜安装)

装完之后搜索所安装的软件里有：Anaconda Prompt。打开。

输入下面代码即可安装完成 beautifulsoup4

$ pip install beautifulsoup4

$pipinstallbeautifulsoup4

搜索所安装的软件：Anaconda Navigator，选择 Spyder，把本文的代码修改一下贴上即可运行。

开始使用 beautifulsoup4

首先我们需要打开 html 文件，告诉程序你的文件存在什么地方。在 path 中需要将你的文件路径修改成自己的。html 文件怎么来？参照“初始需要处理的文本”，将代码保存在 Notepad++中另存为 html 即可开始实验。接下来两行就是打开 html 文件并且读取其中的内容。

path = 'D:/WORKS/larousse_original_test1.html'

htmlfile = open(path, 'r', encoding='utf-8')

htmlhandle = htmlfile.read()

path='D:/WORKS/larousse_original_test1.html'

htmlfile=open(path,'r',encoding='utf-8')

htmlhandle=htmlfile.read()

下一步就是调用 Beautifulsoup 解析功能，解析器使用 lxml。并且使用 Python 中的 panda 包来存储目标数据。注意此处 BeautifulSoup 的大小写，不然会报错。

from bs4 import BeautifulSoup

soup = BeautifulSoup(htmlhandle, 'lxml')

import pandas as pd

frombs4importBeautifulSoup

soup=BeautifulSoup(htmlhandle,'lxml')

importpandasaspd

创建一个计数的，然后创建 result，之后的所有的数据都存在这里面，到时候打开 excel 表时就可以看到‘word’、‘word_cixing’等等的列，而数据正是随着这些列进行逐行增加的。

count = 0

result = pd.DataFrame({},index=[0])

result['word'] = ''

result['word_cixing'] = ''

result['word_jieshi_fr'] = ''

result['word_jieshi_cn'] = ''

result['word_liju_fr'] = ''

result['word_liju_cn'] = ''

new = result

count=0

result=pd.DataFrame({},index=[0])

result['word']=''

result['word_cixing']=''

result['word_jieshi_fr']=''

result['word_jieshi_cn']=''

result['word_liju_fr']=''

result['word_liju_cn']=''

new=result

在这里建立一个循环。再初始 html 中我将原来 mdx 中的>替换成了。也就是说每一个单词的最外面罩着，每一个里面就是该单词的所有内容。

首先用了 find_all()命令，这样就能得到所有的标签的内容，并用循环遍历。每一次读到的内容存储在 item 里面，再通过 BS 的 CSS 选择器选择了标签为 h1 的内容，这是单词本身。接下来，需要将读到的 list 转化为 string，这个在下节会讲到。

BeautifulSoup 对象表示的是一个文档的全部内容.。大部分时候可以把它当作 Tag 对象，它支持遍历文档树和搜索文档树中描述的大部分的方法。再使用 get_text()，将所有标签之内的所有内容读出，储存到 new 的“word”字段里面，并且拼接到 result 中，为最后的文档输出做好准备。

这里只举了“word”一个例子，不同的字段对应着不同的样式或者是标签，可以从BS 的官方中文文档中寻找详细信息。

for item in soup.find_all('zidingyi'):

word = item.select("zidingyi > h1")

word = ';'.join(str(e) for e in word)

word = BeautifulSoup(word).get_text()

new['word'] = word

count += 1

result = result.append(new,ignore_index=True)</pre>

foriteminsoup.find_all('zidingyi'):

word=item.select("zidingyi > h1")

word=';'.join(str(e)foreinword)

word=BeautifulSoup(word).get_text()

new['word']=word

count+=1

result=result.append(new,ignore_index=True)</pre>

最后大功告成，将所有的数据保存到 excel 表格中。(具体路径和 excel 命名可以根据自己的实际需求改写)

result.to_excel('d:result.xlsx')

其他的一些小细节

Python3 中将 list 合并转为 string

使用 ''.join，引号内可以加上相应的分隔符

list1 = ['1', '2', '3']

str1 = ''.join(list1)

list1=['1','2','3']

str1=''.join(list1)

如果 list 是数字类型或者不是 string 类型，那需要在 join 之前转换。

list1 = [1, 2, 3]

str1 = ''.join(str(e) for e in list1)

list1=[1,2,3]

str1=''.join(str(e)foreinlist1)

最终的代码(Python3)

# -*- coding: utf-8 -*-

"""

Created on Sun Aug 4 14:13:54 2019

@author: https://xd.sh.cn

"""

path = 'D:/WORKS/larousse_original_test1.html'

htmlfile = open(path, 'r', encoding='utf-8')

htmlhandle = htmlfile.read()

from bs4 import BeautifulSoup

soup = BeautifulSoup(htmlhandle, 'lxml')

import pandas as pd

count = 0

result = pd.DataFrame({},index=[0])

result['word'] = ''

result['word_cixing'] = ''

result['word_jieshi_fr'] = ''

result['word_jieshi_cn'] = ''

result['word_liju_fr'] = ''

result['word_liju_cn'] = ''

new = result

for item in soup.find_all('zidingyi'):

print(item)

word = item.select("zidingyi > h1")

word = ';'.join(str(e) for e in word)

print(word)

word_cixing = item.select(".CategorieGrammaticale")

word_cixing = ';'.join(str(e) for e in word_cixing)

print(word_cixing)

word_jieshi_fr = item.select(".Indicateur")

word_jieshi_fr = ';'.join(str(e) for e in word_jieshi_fr)

print(word_jieshi_fr)

word_jieshi_cn = item.select(".Traductionchinois")

word_jieshi_cn = ';'.join(str(e) for e in word_jieshi_cn)

print(word_jieshi_cn)

word_liju_fr = item.select(".Locution2")

word_liju_fr = ';'.join(str(e) for e in word_liju_fr)

print(word_liju_fr)

word_liju_cn = item.select(".Traduction2chinois")

word_liju_cn = ';'.join(str(e) for e in word_liju_cn)

print(word_liju_cn)

word = BeautifulSoup(word).get_text()

word_cixing = BeautifulSoup(word_cixing).get_text()

word_jieshi_fr = BeautifulSoup(word_jieshi_fr).get_text()

word_jieshi_cn = BeautifulSoup(word_jieshi_cn).get_text()

word_liju_fr = BeautifulSoup(word_liju_fr).get_text()

word_liju_cn = BeautifulSoup(word_liju_cn).get_text()

new['word'] = word

new['word_cixing'] = word_cixing

new['word_jieshi_fr'] = word_jieshi_fr

new['word_jieshi_cn'] = word_jieshi_cn

new['word_liju_fr'] = word_liju_fr

new['word_liju_cn'] = word_liju_cn

count += 1

result = result.append(new,ignore_index=True)

result.to_excel('d:result.xlsx')

# -*- coding: utf-8 -*-

"""

Created on Sun Aug 4 14:13:54 2019

@author: https://xd.sh.cn

"""

path='D:/WORKS/larousse_original_test1.html'

htmlfile=open(path,'r',encoding='utf-8')

htmlhandle=htmlfile.read()

frombs4importBeautifulSoup

soup=BeautifulSoup(htmlhandle,'lxml')

importpandasaspd

count=0

result=pd.DataFrame({},index=[0])

result['word']=''

result['word_cixing']=''

result['word_jieshi_fr']=''

result['word_jieshi_cn']=''

result['word_liju_fr']=''

result['word_liju_cn']=''

new=result

foriteminsoup.find_all('zidingyi'):

print(item)

word=item.select("zidingyi > h1")

word=';'.join(str(e)foreinword)

print(word)

word_cixing=item.select(".CategorieGrammaticale")

word_cixing=';'.join(str(e)foreinword_cixing)

print(word_cixing)

word_jieshi_fr=item.select(".Indicateur")

word_jieshi_fr=';'.join(str(e)foreinword_jieshi_fr)

print(word_jieshi_fr)

word_jieshi_cn=item.select(".Traductionchinois")

word_jieshi_cn=';'.join(str(e)foreinword_jieshi_cn)

print(word_jieshi_cn)

word_liju_fr=item.select(".Locution2")

word_liju_fr=';'.join(str(e)foreinword_liju_fr)

print(word_liju_fr)

word_liju_cn=item.select(".Traduction2chinois")

word_liju_cn=';'.join(str(e)foreinword_liju_cn)

print(word_liju_cn)

word=BeautifulSoup(word).get_text()

word_cixing=BeautifulSoup(word_cixing).get_text()

word_jieshi_fr=BeautifulSoup(word_jieshi_fr).get_text()

word_jieshi_cn=BeautifulSoup(word_jieshi_cn).get_text()

word_liju_fr=BeautifulSoup(word_liju_fr).get_text()

word_liju_cn=BeautifulSoup(word_liju_cn).get_text()

new['word']=word

new['word_cixing']=word_cixing

new['word_jieshi_fr']=word_jieshi_fr

new['word_jieshi_cn']=word_jieshi_cn

new['word_liju_fr']=word_liju_fr

new['word_liju_cn']=word_liju_cn

count+=1

result=result.append(new,ignore_index=True)

result.to_excel('d:result.xlsx')

参考资料

高难饱

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
spyder怎么运行html文件,使用Python3和BeautifulSoup4处理本地html文件

遇到的问题在制作第三个微信小程序“法语背单词记忆小助手”时，我需要处理大量单词有关的数据，为了一劳永逸解决单词释义、单词例句等种种方面的问题，我打算提取 mdx 词典数据，将词典里面所有单词的数据做成数据表，并上传至云开发。这样的话，另一个小程序“法语动词变位记忆小助手”也能共享成果。作为一个懒人，肯定不会手动去处理这么多数据(提取 mdx 之后有 60 万行数据，去除对我来说没用的动词变位数据，...
复制链接

扫一扫