38. Python批量翻译英语单词

最新推荐文章于 2022-09-01 00:26:59 发布

hello rpa

最新推荐文章于 2022-09-01 00:26:59 发布

阅读量1.8k

点赞数 1

分类专栏： pandas 文章标签： python pandas 数据分析

本文链接：https://blog.csdn.net/lvlinjier/article/details/112853212

版权

pandas 专栏收录该内容

43 篇文章 23 订阅

订阅专栏

Python批量翻译英语单词

用途：
对批量的英语文本，生成英语-汉语翻译的单词本，提供Excel下载

本代码实现：

提供一个英文文章URL，自动下载网页；
实现网页中所有英语单词的翻译；
下载翻译结果的Excel

涉及技术：

pandas的读取csv、多数据merge、输出Excel
requests库下载HTML网页
BeautifulSoup解析HTML网页
Python正则表达式实现英文分词

1. 读取英语-汉语翻译词典文件

词典文件来自：https://github.com/skywind3000/ECDICT
使用步骤：

下载代码打包：https://github.com/skywind3000/ECDICT/archive/master.zip
解压master.zip，然后解压其中的‪stardict.csv文件

import pandas as pd

# 注意：stardict.csv的地址需要替换成你自己的文件地址
df_dict = pd.read_csv("D:/tmp/ECDICT-master/stardict.csv")

d:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3063: DtypeWarning: Columns (11) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

df_dict.shape

(3402564, 13)

df_dict.sample(10).head()

	word	phonetic	definition	translation	pos	collins	oxford	tag	bnc	frq	exchange	detail	audio
3370509	WWDH	NaN	NaN	[网络] 淇楄壋	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
518014	chauhtan (chotan)	NaN	NaN	卓丹	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
389953	breviarist	NaN	NaN	[网络] 短笛师	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
951231	electric-vehicle	NaN	NaN	abbr. “EV”的变体；“electric car”的变体\n[网络] 电动汽车	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
91258	Albionian	æl'biәniәn	NaN	[地质]阿尔比翁期	NaN	NaN	NaN	NaN	0.0	0.0	NaN	NaN	NaN

# 把word、translation之外的列扔掉
df_dict = df_dict[["word", "translation"]]
df_dict.head()

	word	translation
0	'a	na. 一\nn. 英文字母表的第一字母；【乐】A音\nart. 冠以不定冠词主要表示类别\...
1	'A' game	[网络] 游戏；一个游戏；一局
2	'Abbāsīyah	[地名] 阿巴西耶 ( 埃 )
3	'Abd al Kūrī	[地名] 阿卜杜勒库里岛 ( 也门 )
4	'Abd al Mājid	[地名] 阿卜杜勒马吉德 ( 苏丹 )

2. 下载网页，得到网页内容

import requests

# Pandas官方文档中的一个URL
url = "https://pandas.pydata.org/docs/user_guide/indexing.html"

html_cont = requests.get(url).text

html_cont[:100]

'\n\n<!DOCTYPE html>\n\n<html xmlns="http://www.w3.org/1999/xhtml">\n  <head>\n    <meta charset="utf-8" />'

3. 提取HTML的正文内容

即：去除HTML标签，获取正文

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_cont)
html_text = soup.get_text()

html_text[:500]

'\n\n\nIndexing and selecting data — pandas 1.0.1 documentation\n\n\n\n\n\n\n\n\n\n\n\n\nMathJax.Hub.Config({"tex2jax": {"inlineMath": [["$", "$"], ["\\\\(", "\\\\)"]], "processEscapes": true, "ignoreClass": "document", "processClass": "math|output_area"}})\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHome\n\n\nWhat\'s New in 1.0.0\n\n\nGetting started\n\n\nUser Guide\n\n\nAPI reference\n\n\nDevelopment\n\n\nRelease Notes\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIO tools (text, CSV, HDF5, â\x80¦)\n\n\nIndexing and selecting data\n\n\nMultiIndex / advanced indexing\n\n\nMerge, join, a'

4. 英文分词和数据清洗

# 分词
import re
word_list = re.split("""[ ,.\(\)/\n|\-:=\$\["']""",html_text)
word_list[:10]

['', '', '', 'Indexing', 'and', 'selecting', 'data', '—', 'pandas', '1']

# 读取停用词表，从网上复制的，位于当前目录下
with open("./datas/stop_words/stop_words.txt") as fin:
    stop_words=set(fin.read().split("\n"))
list(stop_words)[:10]

['',
 'itself',
 'showed',
 'throughout',
 'pointed',
 'n',
 'against',
 'name',
 'none',
 'ran']

# 数据清洗
word_list_clean = []
for word in word_list:
    word = str(word).lower().strip()
    # 过滤掉空词、数字、单个字符的词、停用词
    if not word or word.isnumeric() or len(word)<=1 or word in stop_words:
        continue
    word_list_clean.append(word)
word_list_clean[:20]

['indexing',
 'selecting',
 'data',
 'pandas',
 'documentation',
 'mathjax',
 'hub',
 'config',
 'tex2jax',
 'inlinemath',
 '\\\\',
 '\\\\',
 ']]',
 'processescapes',
 'true',
 'ignoreclass',
 'document',
 'processclass',
 'math',
 'output_area']

5. 分词结果构造成一个DataFrame

df_words = pd.DataFrame({
    "word": word_list_clean
})
df_words.head()

	word
0	indexing
1	selecting
2	data
3	pandas
4	documentation

df_words.shape

(4915, 1)

# 统计词频
df_words = (
    df_words
    .groupby("word")["word"]
    .agg(count="size")
    .reset_index()
    .sort_values(by="count", ascending=False)
)
df_words.head(10)

	word	count
620	df	161
659	dtype	87
1274	true	86
593	dataframe	80
1038	pd	75
917	loc	72
970	nan	72
721	false	58
914	list	58
835	indexing	53

6. 和单词词典实现merge

df_merge = pd.merge(
    left = df_dict,
    right = df_words,
    left_on = "word",
    right_on = "word"
)

df_merge.sample(10)

	word	translation	count
658	team	n. 队, 组\nvt. 把马(牛)套在同一辆车上, 把...编成一组\nvi. 驾驶卡车, 协作	3
523	providing	conj. 以...为条件, 假如	1
394	lines	n. 台词	1
118	columns	塔器	49
136	conforms	v. 遵守( conform的第三人称单数 ); 顺应; 相一致; 相符合	1
529	python	n. 大蟒, 巨蟒\n[计] Python 程序设计语言；人生苦短，我用 Python	26
185	determine	v. 决定, 决心	1
285	forward	a. 向前的, 早的, 迅速的, 在前的, 进步的\nvt. 促进...的生长, 转寄, 运...	1
49	arguments	n. 参数	3
564	reported	a. 报告的；据报道的	1

df_merge.shape

(718, 3)

7. 存入Excel

df_merge.to_excel("./38. batch_chinese_english.xlsx", index=False)

后续升级：

可以提供txt/excel/word/pdf的批量输入，生成单词本；
可以做成网页、微信小程序的形式，在线访问和使用
用户可以标记或上传“已经认识的词语”，每次过滤掉

hello rpa

关注

1
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录