[NLP]Python爬取某新闻网站某一专栏下的文本，用LSI模型计算文档相似度

最新推荐文章于 2023-10-19 08:29:42 发布

Shan10011001

最新推荐文章于 2023-10-19 08:29:42 发布

阅读量269

点赞数

分类专栏：文本相似度计算 LSI模型文章标签：自然语言处理 python 数据库

本文链接：https://blog.csdn.net/weixin_45027164/article/details/117479246

版权

文本相似度计算同时被 2 个专栏收录

1 篇文章 0 订阅

订阅专栏

LSI模型

1 篇文章 0 订阅

订阅专栏

使用LSI模型计算文档相似度

爬取网页文本

#获取网页内容
def getHTML(url):
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return ""
#获取文本内容
def getContent(url):
    html=getHTML(url)
    soup=BeautifulSoup(html,'html.parser')
    paras=soup.select('p')
    return paras

#将文本内容保存到数据库
def saveFile(title,link,text):
    sql="INSERT INTO newstb (title,linkurl, content) VALUES ('%s', '%s','%s')"
    sentence=""
    for t in text:
        if len(t)>0:
            sentence+=t.get_text()
    data=(title,link,sentence)
    try:
        cursor.execute(sql % data)
        connect.commit()
    except:
        connect.rollback()

#保存标题列表
Titles=[]
#保存url列表
Links=[]
for i in range(1,26):
    url = 'http:xxxxx'
    response =requests.get(url)
    html_str = response.content.decode()


    # 把json格式字符串转换成python对象
    jsonobj = json.loads(html_str)


    # 从根节点开始，匹配name节点
    titlelist = jsonpath.jsonpath(jsonobj,'$...Title')
    linklist = jsonpath.jsonpath(jsonobj,'$...LinkUrl')
    
    Titles+=titlelist
    Links+=linklist


a=len(Titles)#文本数量
text=[getContent(Links[i]) for i in range(a)]#文本内容


#将文档信息保存到数据库
for i in range(a):
    saveFile(Titles[i],Links[i],text[i])

用LSI模型计算文档相似度

##
##LSI模型相似度计算
##
##（1）LSI为一个向量变换模型，它将文本从一个向量空间转换到另外一个向量空间。
##（2）LSI可以识别文本的模式和文本中单词之间的关系和主题。


#对文档去停用词、分词
def tokenization(text):
    result = []
    stopwords = codecs.open(r'D:\自然语言处理\作业\stopworddic.txt','r',encoding='utf8').readlines()
    stopwords = [ w.strip() for w in stopwords ]
    stop_flag = ['x', 'c', 'u','d', 'p', 't', 'uj', 'm', 'f', 'r']
    words = pseg.cut(text)
    for word, flag in words:
        if flag not in stop_flag and word not in stopwords:
            result.append(word)
    return result

#查询数据库文档
def querysql():
    slt="select content from newstb"
    cursor.execute(slt)
    corpus = []
    for row in cursor.fetchall():
        corpus.append(tokenization(row[0]))
    print ("数据库中文档数量：",len(corpus))
    return corpus

#构建词袋模型，把一篇文本想象成一个个词构成的，所有词放入一个袋子里，没有先后顺序、没有语义
def LSImodel(corpus):
    dictionary = corpora.Dictionary(corpus)
    doc_vectors = [dictionary.doc2bow(text) for text in corpus]
    tfidf = models.TfidfModel(doc_vectors)
    tfidf_vectors = tfidf[doc_vectors]

    #构建LSI模型，设置主题数为2

    lsi = models.LsiModel(tfidf_vectors, id2word=dictionary, num_topics=2)
    lsi.print_topics(2)

    lsi_vector = lsi[tfidf_vectors]

    tid=int(input("请输入文档ID(1~177):"))
    slt="select content from newstb where id=%d"
    cursor.execute(slt % (tid))

    query = tokenization(cursor.fetchone()[0])
    query_bow = dictionary.doc2bow(query)
    query_lsi = lsi[query_bow]

    index = similarities.MatrixSimilarity(lsi_vector)
    sims = index[query_lsi]
    print (list(enumerate(sims)))

if __name__ == '__main__':
	LSImodel(querysql())

Shan10011001

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[NLP]Python爬取某新闻网站某一专栏下的文本，用LSI模型计算文档相似度

使用LSI模型计算文档相似度爬取网页文本#获取网页内容def getHTML(url): try: r=requests.get(url,timeout=30) r.raise_for_status() r.encoding=r.apparent_encoding return r.text except: return ""#获取文本内容def getContent(url): html=ge
复制链接

扫一扫