Python 基于语句检测和语句频谱分析实现文本汇总算法 (document summary algorithm)

#!/usr/bin/python 
# -*- coding: utf-8 -*-

'''
Created on 2015-1-25
@author: beyondzhou
@name: document_summarize_algorithm.py
'''

import json
from summary import summarize

# Download nltk packages used in this example
#nltk.download('stopwords')

# Read data 
BLOG_DATA = r"E:\eclipse\Web\dFile\feed.json"
blog_data = json.loads(open(BLOG_DATA).read())

for post in blog_data:

    post.update(summarize(post['content']))

    print post['title']
    print '=' * len(post['title'])
    print
    print 'Top N Summary'
    print '-------------'
    print ' '.join(post['top_n_summary'])
    print
    print 'Mean Scored Summary'
    print '-------------'
    print ' '.join(post['mean_scored_summary'])
    print

def summarize(txt):
    N = 100 # Number of words to consider
    TOP_SENTENCES = 5 # Number of sentences to return for a "top n" summary
        
    sentences = [s for s in sent_tokenize(txt)]
    normalized_sentences = [s.lower() for s in sentences]

    words = [w.lower() for sentence in normalized_sentences for w in 
              word_tokenize(sentence)]

    fdist = nltk.FreqDist(words)

    top_n_words = [w[0] for w in fdist.items()
         if w[0] not in nltk.corpus.stopwords.words('english')][:N]

    scored_sentences = _score_sentences(normalized_sentences, top_n_words)

    # Summarization Approach 1:
    # Filter out nonsignificant sentences by using the average score plus a
    # fraction of the std dev as a filter
    avg = numpy.mean([s[1] for s in scored_sentences])
    std = numpy.std([s[1] for s in scored_sentences])
    mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences
                    if score > avg + 0.5 * std]

    # Summarization Approach 2:
    # Another approach would be to return only the top N randked sentences
    top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-TOP_SENTENCES:]
    top_n_scored = sorted(top_n_scored, key=lambda s:s[0])

    # Decorate the post object with summaries
    return dict(top_n_summary=[sentences[idx] for (idx, score) in top_n_scored],
                mean_scored_summary=[sentences[idx] for (idx, score) in mean_scored])

Four short links: 23 January 2015
=================================

Top N Summary
-------------
16 Andreessen-Horowitz Investment Areas — I’m struck by how they’re connected: there’s a cluster around cloud development, there are two maybe three on sensors … 
 Pattern — a web mining module for the Python programming language. It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and <canvas> visualization. Code Review — FogCreek’s code review checklist. In fields where people thought that raw talent was required, academic departments had lower percentages of women. (via WaPo )

Mean Scored Summary
-------------
16 Andreessen-Horowitz Investment Areas — I’m struck by how they’re connected: there’s a cluster around cloud development, there are two maybe three on sensors … 
 Pattern — a web mining module for the Python programming language. It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and <canvas> visualization.

Designing on a system level
===========================

Top N Summary
-------------
I recently sat down with Andy Goodman , designer and group director of Fjord’s US studios. Goodman has been designing and managing design teams around the globe for the past 20 years. Goodman is a contributor to Designing for Emerging Technologies — our conversation covers embeddables, wearables, and predictive analytics. To kick off the conversation, I asked Goodman to define “service design”: 
 “It’s well-known that if you ask a service designer to define “service design,” you get 10 different answers. You can actually design other things that are more about the way we live and work and play.” 
 (more…)

Mean Scored Summary
-------------
I recently sat down with Andy Goodman , designer and group director of Fjord’s US studios. Goodman is a contributor to Designing for Emerging Technologies — our conversation covers embeddables, wearables, and predictive analytics. To kick off the conversation, I asked Goodman to define “service design”: 
 “It’s well-known that if you ask a service designer to define “service design,” you get 10 different answers.

Bitcoin is just the first app to use blockchain technology
==========================================================

Top N Summary
-------------
Editor’s note: Lorne Lantz is a program co-chair for our O’Reilly Radar Summit: Bitcoin & the Blockchain on January 27, 2015, in San Francisco. For more on the program and for registration information, visit the Bitcoin & the Blockchain event website . The whole time I was sitting there, I thought these were a bunch of computer geeks playing around with nerd money. Instead, thousands of computers around the world verify transactions and manage a global decentralized ledger. This innovative technology is called the blockchain, and it provides a unique pathway that allows — for the first time — many computers that don’t trust each other to achieve consensus.

Mean Scored Summary
-------------
Editor’s note: Lorne Lantz is a program co-chair for our O’Reilly Radar Summit: Bitcoin & the Blockchain on January 27, 2015, in San Francisco. Instead, thousands of computers around the world verify transactions and manage a global decentralized ledger. This innovative technology is called the blockchain, and it provides a unique pathway that allows — for the first time — many computers that don’t trust each other to achieve consensus.

Blockchain scalability
======================

Top N Summary
-------------
Author note: Vitalik Buterin contributed to this article. Editor’s note: Kieren James-Lubin is a program co-chair for our O’Reilly Radar Summit: Bitcoin & the Blockchain on January 27, 2015, in San Francisco. For more on the program and for registration information, visit the Bitcoin & the Blockchain event website . In this article, we’ll explore several meanings of “blockchain scalability” and some high-level technical solutions to the issue. backward-incompatible change) to the bitcoin protocol.

Mean Scored Summary
-------------
Author note: Vitalik Buterin contributed to this article. Editor’s note: Kieren James-Lubin is a program co-chair for our O’Reilly Radar Summit: Bitcoin & the Blockchain on January 27, 2015, in San Francisco. For more on the program and for registration information, visit the Bitcoin & the Blockchain event website . In this article, we’ll explore several meanings of “blockchain scalability” and some high-level technical solutions to the issue. backward-incompatible change) to the bitcoin protocol.

Bringing an end to synthetic biology’s semantic debate
============================================================

Top N Summary
-------------
Editor’s note: this podcast is part of our investigation into synthetic biology and bioengineering . Tim Gardner , founder of Riffyn , has recently been working with the Synthetic Biology Working Group of the European Commission Scientific Committees to define synthetic biology, assess the risk assessment methodologies, and then describe research areas. I caught up with Gardner for this Radar Podcast episode to talk about the synthetic biology landscape and issues in research and experimentation that he’s addressing at Riffyn. We’ve wrapped it all together and said, ‘It basically advances in the capabilities of genetic engineering. '” 
 (more…)

Mean Scored Summary
-------------
Editor’s note: this podcast is part of our investigation into synthetic biology and bioengineering . Tim Gardner , founder of Riffyn , has recently been working with the Synthetic Biology Working Group of the European Commission Scientific Committees to define synthetic biology, assess the risk assessment methodologies, and then describe research areas. I caught up with Gardner for this Radar Podcast episode to talk about the synthetic biology landscape and issues in research and experimentation that he’s addressing at Riffyn.

Building and deploying large-scale machine learning pipelines
=============================================================

Top N Summary
-------------
There are many algorithms with implementations that scale to large data sets (this list includes matrix factorization, SVM, logistic regression, LASSO, and many others). In fact, machine learning experts are fond of pointing out: if you can pose your problem as a simple optimization problem then you’re almost done. Data scientists have to manage and maintain complex data projects , and the analytic problems they need to tackle usually involve specialized machine learning pipelines. Decisions at one stage affect things that happen downstream, so interactions between parts of a pipeline are an area of active research. In his Strata+Hadoop World New York presentation , UC Berkeley Professor Ben Recht described new UC Berkeley AMPLab projects for building and managing large-scale machine learning pipelines.

Mean Scored Summary
-------------
There are many algorithms with implementations that scale to large data sets (this list includes matrix factorization, SVM, logistic regression, LASSO, and many others). Data scientists have to manage and maintain complex data projects , and the analytic problems they need to tackle usually involve specialized machine learning pipelines. In his Strata+Hadoop World New York presentation , UC Berkeley Professor Ben Recht described new UC Berkeley AMPLab projects for building and managing large-scale machine learning pipelines.

Four short links: 22 January 2015
=================================

Top N Summary
-------------
Microsoft HoloLens Goggles (Wired) — a media release about the next thing from the person behind Kinect. Is it a games device like Kinect? The Facebook (YouTube) — brilliant fake 1995 ad for The Facebook. Just like the GUI overlapped and largely replaced the command line, NLP is now being used by robots, the Internet of things, wearables, and especially conversational systems like Apple’s Siri, Google’s Now, Microsoft’s Cortana, Nuance’s Nina, Amazon’s Echo and others. Microservices and Testing (Martin Fowler) — testing across component boundaries, in the face of failing data stores and HTTP timeouts.

Mean Scored Summary
-------------
Microsoft HoloLens Goggles (Wired) — a media release about the next thing from the person behind Kinect. The Facebook (YouTube) — brilliant fake 1995 ad for The Facebook. Just like the GUI overlapped and largely replaced the command line, NLP is now being used by robots, the Internet of things, wearables, and especially conversational systems like Apple’s Siri, Google’s Now, Microsoft’s Cortana, Nuance’s Nina, Amazon’s Echo and others. Microservices and Testing (Martin Fowler) — testing across component boundaries, in the face of failing data stores and HTTP timeouts.

How to make a UX designer
=========================

Top N Summary
-------------
In the case of Heather Wydeven , a UX designer at The Nerdery , she came to UX via theater and then graphic design. After spending several years working in theater, Wydeven decided to channel her creative skills into a career in graphic design. She came to UX design without even realizing what UX was, but the root of her motivation was something that’s familiar to many UX designers: a recognition that things could be better and a desire to solve problems. “While I was doing graphic design,” Wydeven said, “I started to become more curious about web design and UX design specifically, though at the time I didn’t know it was called ‘UX design.’ I was using websites and being frustrated about my experiences on those websites and thinking, ‘There’s got to be a way to make these better. This has got to be somebody’s job to design these websites better than they are now.’” (more…)

Mean Scored Summary
-------------
After spending several years working in theater, Wydeven decided to channel her creative skills into a career in graphic design. She came to UX design without even realizing what UX was, but the root of her motivation was something that’s familiar to many UX designers: a recognition that things could be better and a desire to solve problems.

The 3Ps of the blockchain: platforms, programs and protocols
============================================================

Top N Summary
-------------
Although it may be early to baptize new buzz lingo like “Blockchain as a Service” (BaaS) or “Blockchain as a Platform” (BaaP), there is a burgeoning landscape of various implementations and activity in and around the blockchain’s decentralized consensus protocol technologies. I’ve already covered the blockchain’s sweet spot as a development platform in “ Understanding the blockchain ,” so it is no surprise that its landscape will be made up of platforms, protocols, and (smart) programs. Breaking-up the bitcoin-blockchain paradigm 
 In a perfect world, we would have a single blockchain and a single cryptocurrency. But that doesn’t seem to be in the cards, whether it is technically feasible or not. Although wide-scale adoption and a critical mass of users aren’t there yet, the market is signaling for a diversification of choices, some based on the bitcoin currency and its blockchain protocol, and others not.

Mean Scored Summary
-------------
Although it may be early to baptize new buzz lingo like “Blockchain as a Service” (BaaS) or “Blockchain as a Platform” (BaaP), there is a burgeoning landscape of various implementations and activity in and around the blockchain’s decentralized consensus protocol technologies. Although wide-scale adoption and a critical mass of users aren’t there yet, the market is signaling for a diversification of choices, some based on the bitcoin currency and its blockchain protocol, and others not.

Four short links: 21 January 2015
=================================

Top N Summary
-------------
PC in a Mouse — 80s = PC in a keyboard. 2000s = PC in the screen. Estimating G+ Usage (BoingBoing) — of 2.2B profiles, 6.6M have made new public posts in 2015. Medium Data — too big for one machine, but barely worth the overhead of high-volume data processing. New Hardware for the DARPA Robotics Challenge Finals (IEEE) — in the future, we’ll all have a 3.7 kwh battery and a wireless router in our heads.

Mean Scored Summary
-------------
Estimating G+ Usage (BoingBoing) — of 2.2B profiles, 6.6M have made new public posts in 2015. Medium Data — too big for one machine, but barely worth the overhead of high-volume data processing. New Hardware for the DARPA Robotics Challenge Finals (IEEE) — in the future, we’ll all have a 3.7 kwh battery and a wireless router in our heads.

The Internet of Things is really about software
===============================================

Top N Summary
-------------
Download the free report The Internet of Things (IoT) is everywhere right now. It appeared on the cover of the Harvard Business Review  in November, and observers saw it in practically every demo at CES. A few years ago, many companies might plausibly have argued that they weren’t affected by developments in software. If you dealt in physical goods, it was hard to see how software that existed strictly in the virtual realm might touch your business. If you think of the IoT as a newly developing area in software, it’s easy to draw out some characteristics of it that are analogous to things we’ve seen in web software over the last decade or so.

Mean Scored Summary
-------------
Download the free report The Internet of Things (IoT) is everywhere right now. It appeared on the cover of the Harvard Business Review  in November, and observers saw it in practically every demo at CES. A few years ago, many companies might plausibly have argued that they weren’t affected by developments in software. If you dealt in physical goods, it was hard to see how software that existed strictly in the virtual realm might touch your business.

What containers can do for you
==============================

Top N Summary
-------------
If you read any IT news these days it’s hard to miss a headline about “the container revolution.” Docker’s year-and-a-half-old engine had a monopoly on the buzz until CoreOS launched its own project, Rocket , in December. The technology behind containers can seem esoteric, but the advantages of bringing containers to your organization are more compelling than ever. And containers’ inherent portability opens up exciting new opportunities for how organizations host their applications. Containerization is having its moment and there’s never been a better time to check it out for yourself. (more…)

Mean Scored Summary
-------------
If you read any IT news these days it’s hard to miss a headline about “the container revolution.” Docker’s year-and-a-half-old engine had a monopoly on the buzz until CoreOS launched its own project, Rocket , in December.

Four short links: 20 January 2015
=================================

Top N Summary
-------------
Matt Webb Joining British Govt Data Service — working on IoT for them. Phone/Skype calls, emails, and chats are all intensely mental activities, trying to picture the person behind the signal. MIT Faculty Search — two open gigs at MIT, one around climate change and one “undefined.” Great job ad. — evaluation of these systems, especially in the academic context, is lacking. When we actually look at performance, the benefits the scalable systems bring start to look much more sketchy.

Mean Scored Summary
-------------
Matt Webb Joining British Govt Data Service — working on IoT for them. Phone/Skype calls, emails, and chats are all intensely mental activities, trying to picture the person behind the signal. MIT Faculty Search — two open gigs at MIT, one around climate change and one “undefined.” Great job ad. When we actually look at performance, the benefits the scalable systems bring start to look much more sketchy.

Striking parallels between mathematics and software engineering
===============================================================

Top N Summary
-------------
Editor’s note: Alice Zheng will be part of the team teaching Large-scale Machine Learning Day at Strata + Hadoop World in San Jose. A wonderfully lucid textbook served as my guide: Introduction to Linear Algebra , written by Gilbert Strang. I was looking at various definitions — eigen decomposition, Jordan canonical forms, matrix inversions, etc. Prior to that moment, I had approached mathematics as if it were universal truth: transcendent in its perfection, almost unknowable by mere mortals. In that moment, mathematics went from being unknowable to reasonable.

Mean Scored Summary
-------------
Editor’s note: Alice Zheng will be part of the team teaching Large-scale Machine Learning Day at Strata + Hadoop World in San Jose. A wonderfully lucid textbook served as my guide: Introduction to Linear Algebra , written by Gilbert Strang. I was looking at various definitions — eigen decomposition, Jordan canonical forms, matrix inversions, etc. In that moment, mathematics went from being unknowable to reasonable.

Four short links: 19 January 2015
=================================

Top N Summary
-------------
Visiting the water cooler is fine, but somebody who spends all day there has no right to talk of being full. Google’s AI Brain — on the subject of Google’s AI ethics committee … Q: Will you eventually release the names? AVA is now Open Source (Laura Bell) — Assessment, Visualization and Analysis of human organisational information security risk. This map of people and interconnected entities can then be tested using a unique suite of customisable, on-demand, and scheduled information security awareness tests. Deep Learning for Torch (Facebook) — Facebook AI Research open sources faster deep learning modules for Torch , a scientific computing framework with wide support for machine learning algorithms .

Mean Scored Summary
-------------
Google’s AI Brain — on the subject of Google’s AI ethics committee … Q: Will you eventually release the names? AVA is now Open Source (Laura Bell) — Assessment, Visualization and Analysis of human organisational information security risk. This map of people and interconnected entities can then be tested using a unique suite of customisable, on-demand, and scheduled information security awareness tests. Deep Learning for Torch (Facebook) — Facebook AI Research open sources faster deep learning modules for Torch , a scientific computing framework with wide support for machine learning algorithms .

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值