我在pmc上爬取了一部分以“diabetes”(糖尿病)为关键词的论文,作为语料用gensim中的模型word2vec进行训练
数据预处理
Word2vec模型的输入数据应该是一个列表的列表。大列表中的子列表代表一个句子,每个子列表中的元素代表句子中的一个单词。
以下面这段话为例
I like eating apples. I also like eating bananas.
一共有两句话,所以输入数据的格式应是这样的
[
[‘I’, ‘like’, ‘eating’, ‘apples’],
[‘I’, ‘also’, ‘like’, ‘eating’, ‘bananas’]
]
把一句话处理成单词列表
def sentence_to_word(article): # 去掉句子中的html标签 article_text = BeautifulSoup(article).get_text() # 去掉句子中初字母以外的符号 article_text = re.sub('[^a-zA-Z]', " ", article_text) # 将所有单词转为小写 article_text = article_text.lower() words = article_text.split() return words
把一篇文章分成一句一句的句子
这里用的是nltk(natural language tool kit)
用之前先要运行以下代码:import nltk nltk.download()
这是函数体
def article_to_sentence(article, tokenizer): # 将文章分成一句一句的话 raw_sentences = tokenizer.tokenize(article.strip()) sentences = [] for sentence in raw_sentences: if len(sentence) > 0: sentences.append(sentence_to_word(sentence)) return sentences def process_data(): tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') sentences = [] counter = 0 with open('articles.jl', 'rb') as f: for item in json_lines.reader(f): # 从爬取的数据中得到文章 article = list(item.values())[0] # 注意list中‘+’与append的区别 sentences += article_to_sentence(article, tokenizer) counter = counter + 1 if counter % 500 == 0: print(counter) return sentences
训练模型
模型默认结构是skip-gram,默认算法是hierarchical softmax
模型的所有超参数见官方文档
https://radimrehurek.com/gensim/models/word2vec.html
def train_model(sentences):
# 模型中的超参数,可根据具体问题自行设置
# 每个word vector的长度
num_features = 300
# 忽略掉出现次数少于此数的单词
min_word_count = 40
# 用几个线程
num_workers = 4
# 窗口大小
context = 10
# 高频词被抽取的概率,见negative sampling
downsampling = 1e-3
# 输出log信息
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = word2vec.Word2Vec(
sentences=sentences,
workers=num_workers,
size=num_features,
min_count=min_word_count,
window=context,
sample=downsampling,
)
# 保存模型
model_name = "300features_40minwords_10context"
model.save(model_name)
模型验证
def validate():
model = word2vec.Word2Vec.load("300features_40minwords_10context")
# 查看与”insulin” 相近的词
print(model.most_similar("insulin"))
print(model.most_similar("diabetes"))
结果
1. insulin
word | similarity |
---|---|
glucocorticoid(糖皮质激素) | 0.5418301224708557 |
leptin(瘦蛋白) | 0.5139374732971191 |
castration(阉割) | 0.4962504804134369 |
gemcitabine (吉西他滨) | 0.4824090600013733 |
noninsulin(非胰岛素) | 0.48208504915237427 |
fluconazole(氟康唑) | 0.47333139181137085 |
glucocorticoids(糖皮质激素) | 0.4716144800186157 |
glucagon(胰高血糖素) | 0.455077588558197 |
octreotide(奥曲肽) | 0.4390123188495636 |
androgen(雄激素) | 0.43807411193847656 |
2. diabetes
word | similarity |
---|---|
dm(糖尿病) | 0.5663890838623047 |
diabetic(糖尿病的) | 0.5376691818237305 |
dysglycemia(血糖代谢障碍) | 0.46729063987731934 |
prediabetes(前驱糖尿病) | 0.3938789665699005 |
niddm(非胰岛素依赖型糖尿病) | 0.3806121349334717 |
dementia(痴呆) | 0.3769600987434387 |
diabetics (糖尿病的) | 0.37228599190711975 |
gdm(妊娠糖尿病) | 0.3707351088523865 |
dkd(糖尿病肾病) | 0.36553144454956055 |
nihms (??) | 0.3631674349308014 |
总体上结果还是比较理想的