Python 查询Google+相似文档

CODE:

#!/usr/bin/python 
# -*- coding: utf-8 -*-

'''
Created on 2014-9-10
@author: guaguastd
@name: find_similiar_document.py
'''

# Finding similar documents using cosine similarity
import json
import nltk.cluster.util

# Load in human language data from wherever you've saved it
DATA = r'E:\eclipse\Google\dFile\107033731246200681024.json'
data = json.loads(open(DATA).read())

# Only consider content that's ~1000+ words
data = [post for post in json.loads(open(DATA).read())
        if len(post['object']['content']) > 1000]

all_posts = [post['object']['content'].lower().split()
             for post in data]

# Provides tf, idf, and tf_idf abstractions for scorin
tc = nltk.TextCollection(all_posts)

# Compute a term-document matrix
td_matrix = {}
for idx in range(len(all_posts)):
    post = all_posts[idx]
    fdist = nltk.FreqDist(post)

    doc_title = data[idx]['title']
    url = data[idx]['url']
    td_matrix[(doc_title, url)] = {}

    for term in fdist.iterkeys():
        td_matrix[(doc_title, url)][term] = tc.tf_idf(term, post)

# Build vectors such that term scores are in the same positions...
distances = {}
for (title1, url1) in td_matrix.keys():
    distances[(title1, url1)] = {}
    (min_dist, most_similar) = (1.0, ('', ''))
    for (title2, url2) in td_matrix.keys():
        # Take care not to mutate the original data structures
        # since we're in a loop and need the originals multiple times
        terms1 = td_matrix[(title1, url1)].copy()
        terms2 = td_matrix[(title2, url2)].copy()

        # Fill in gaps in each map so vectors of the same length can be computed
        for term1 in terms1:
            if term1 not in terms2:
                terms2[term1] = 0

        for term2 in terms2:
            if term2 not in terms1:
                terms1[term2] = 0

        # Create vectors from term maps
        v1 = [score for (term, score) in sorted(terms1.items())]
        v2 = [score for (term, score) in sorted(terms2.items())]
        
        # Compute similarity amongst documents
        distances[(title1, url1)][(title2, url2)] = nltk.cluster.util.cosine_distance(v1, v2)

        if url1 == url2:
            continue

        if distances[(title1, url1)][(title2, url2)] < min_dist:
            (min_dist, most_similar) = (distances[(title1, url1)][(title2, url2)], (title2, url2))

    print '''Most similar to %s (%s)
\t%s (%s)
\tscore %f
''' % (title1, url1, most_similar[0], most_similar[1], 1-min_dist)

RESULT:

Most similar to Great talk by Maciej Ceglowski.  Funny, smart, and with an important message.  Just like Maciej all ... (https://plus.google.com/107033731246200681024/posts/b17bWhGfkH3)
	Journalism vs. Punditry: NPR's Kelly McEvers on Why Reporting Matters

There was a great segment on ... (https://plus.google.com/107033731246200681024/posts/NGZmQLE392X)
	score 0.056840

Most similar to Journalism vs. Punditry: NPR's Kelly McEvers on Why Reporting Matters

There was a great segment on ... (https://plus.google.com/107033731246200681024/posts/NGZmQLE392X)
	Great talk by Maciej Ceglowski.  Funny, smart, and with an important message.  Just like Maciej all ... (https://plus.google.com/107033731246200681024/posts/b17bWhGfkH3)
	score 0.056840

Most similar to The Myth of the Spoiled Child

There is an interesting op-ed in the NY Times by Alfie Cohn, who has ... (https://plus.google.com/107033731246200681024/posts/c1f9KVXsivD)
	How to Raise Moral Children

I thought this article on child-raising had a lot of good ideas in it. ... (https://plus.google.com/107033731246200681024/posts/NVZVmG1ct6C)
	score 0.064629

Most similar to Why Common Core is Like Healthcare.gov

Draw a bold line between this piece on the failure of the Common... (https://plus.google.com/107033731246200681024/posts/XebEgwjhV35)
	"We don't need new policies. We need better implementation."

Last night, I hosted Oakland City Councilor... (https://plus.google.com/107033731246200681024/posts/M1kH7bErNDm)
	score 0.067829

Most similar to "We don't need new policies. We need better implementation."

Last night, I hosted Oakland City Councilor... (https://plus.google.com/107033731246200681024/posts/M1kH7bErNDm)
	Why Common Core is Like Healthcare.gov

Draw a bold line between this piece on the failure of the Common... (https://plus.google.com/107033731246200681024/posts/XebEgwjhV35)
	score 0.067829

Most similar to +Maria Konnikova's NY Times article about the role of time and attention scarcity in the cycle of poverty... (https://plus.google.com/107033731246200681024/posts/4qHJZJU6Dtb)
	How to Raise Moral Children

I thought this article on child-raising had a lot of good ideas in it. ... (https://plus.google.com/107033731246200681024/posts/NVZVmG1ct6C)
	score 0.046450

Most similar to How to Raise Moral Children

I thought this article on child-raising had a lot of good ideas in it. ... (https://plus.google.com/107033731246200681024/posts/NVZVmG1ct6C)
	The Myth of the Spoiled Child

There is an interesting op-ed in the NY Times by Alfie Cohn, who has ... (https://plus.google.com/107033731246200681024/posts/c1f9KVXsivD)
	score 0.064629


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
SQLAlchemy 是一个 SQL 工具包和对象关系映射(ORM)库,用于 Python 编程语言。它提供了一个高级的 SQL 工具和对象关系映射工具,允许开发者以 Python 类和对象的形式操作数据库,而无需编写大量的 SQL 语句。SQLAlchemy 建立在 DBAPI 之上,支持多种数据库后端,如 SQLite, MySQL, PostgreSQL 等。 SQLAlchemy 的核心功能: 对象关系映射(ORM): SQLAlchemy 允许开发者使用 Python 类来表示数据库表,使用类的实例表示表中的行。 开发者可以定义类之间的关系(如一对多、多对多),SQLAlchemy 会自动处理这些关系在数据库中的映射。 通过 ORM,开发者可以像操作 Python 对象一样操作数据库,这大大简化了数据库操作的复杂性。 表达式语言: SQLAlchemy 提供了一个丰富的 SQL 表达式语言,允许开发者以 Python 表达式的方式编写复杂的 SQL 查询。 表达式语言提供了对 SQL 语句的灵活控制,同时保持了代码的可读性和可维护性。 数据库引擎和连接池: SQLAlchemy 支持多种数据库后端,并且为每种后端提供了对应的数据库引擎。 它还提供了连接池管理功能,以优化数据库连接的创建、使用和释放。 会话管理: SQLAlchemy 使用会话(Session)来管理对象的持久化状态。 会话提供了一个工作单元(unit of work)和身份映射(identity map)的概念,使得对象的状态管理和查询更加高效。 事件系统: SQLAlchemy 提供了一个事件系统,允许开发者在 ORM 的各个生命周期阶段插入自定义的钩子函数。 这使得开发者可以在对象加载、修改、删除等操作时执行额外的逻辑。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值