text summerization :treerank and feature-based

Import BeautifulSoup and urllib libraries to fetch data from Wikipedia.

from bs4 import BeautifulSoup
from urllib.request import urlopen

Function to get data from Wikipedia

def get_only_text(url):
page = urlopen(url)
soup = BeautifulSoup(page)
text = ’ '.join(map(lambda p: p.text, soup.find_all(‘p’)))
print (text)
return soup.title.text, text

Mention the Wikipedia url

url=“https://en.wikipedia.org/wiki/Natural_language_processing”

Call the function created above

text = get_only_text(url)

Count the number of letters

len(".join(text))
Result:
Out[74]: 8519

Lets see first 1000 letters from the text

text[:1000]
Result :
Out[72]: '(‘Natural language processing - Wikipedia’,
'Natural language processing (NLP) is an area of computer
science and artificial intelligence concerned with the
interactions between computers and human (natural) languages,
in particular how to program computers to process and analyze
large amounts of natural language\xa0data.\n Challenges
in natural language processing frequently involve speech
recognition, natural language understanding, and natural
language generation.\n The history of natural language
processing generally started in the 1950s, although work can be
found from earlier periods.\nIn 1950, Alan Turing published
an article titled “Intelligence” which proposed what is now
called the Turing test as a criterion of intelligence.\n
The Georgetown experiment in 1954 involved fully automatic
translation of more than sixty Russian sentences into English.
The authors claimed that within three or five years, machine
translation would be a solved problem.[2] However, real
progress was ’

Import summarize from gensim

from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords

Convert text to string format

text = str(text)

#Summarize the text with ratio 0.1 (10% of the total words.)
summarize(text, ratio=0.1)
Result:
Out[77]: ‘However, part-of-speech tagging introduced the use
of hidden Markov models to natural language processing, and
increasingly, research has focused on statistical models,
which make soft, probabilistic decisions based on attaching
real-valued weights to the features making up the input data.
\nSuch models are generally more robust when given unfamiliar
input, especially input that contains errors (as is very
common for real-world data), and produce more reliable results
when integrated into a larger system comprising multiple
subtasks.\n Many of the notable early successes occurred in
the field of machine translation, due especially to work at
IBM Research, where successively more complicated statistical
models were developed.’
That’s it. The generated summary is as simple as that. If you read this
summary and whole article, it’s close enough. But still, there is a lot of
room for improvement.

Method 4-2 Feature-based text summarization
Your feature-based text summarization methods will extract a feature from
the sentence and check the importance to rank it. Position, length, term
frequency, named entity, and many other features are used to calculate the
score.
Luhn’s Algorithm is one of the feature-based algorithms, and we will
see how to implement it using the sumy library.

Install sumy

!pip install sumy

Import the packages

from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from sumy.summarizers.luhn import LuhnSummarizer

Extracting and summarizing

LANGUAGE = “english”
SENTENCES_COUNT = 10
url=“https://en.wikipedia.org/wiki/Natural_language_processing”
parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
summarizer = LsaSummarizer()
summarizer = LsaSummarizer(Stemmer(LANGUAGE))
summarizer.stop_words = get_stop_words(LANGUAGE)
for sentence in summarizer(parser.document, SENTENCES_COUNT):
print(sentence)
Result :
[2] However, real progress was much slower, and after the
ALPAC report in 1966, which found that ten-year-long research
had failed to fulfill the expectations, funding for machine
translation was dramatically reduced.
However, there is an enormous amount of non-annotated data
available (including, among other things, the entire content of
the World Wide Web ), which can often make up for the inferior
results if the algorithm used has a low enough time complexity
to be practical, which some such as Chinese Whispers do.
Since the so-called “statistical revolution”
in the late 1980s and mid 1990s, much natural language
processing research has relied heavily on machine learning .
Increasingly, however, research has focused on statistical
models , which make soft, probabilistic decisions based on
attaching real-valued weights to each input feature.
Natural language understanding Convert chunks of text into more
formal representations such as first-order logic structures
that are easier for computer programs to manipulate.
[18] ^ Implementing an online help desk system based on
conversational agent Authors: Alisa Kongthon, Chatchawal
Sangkeettrakarn, Sarawoot Kongyoung and Choochart
Haruechaiyasak.
[ self-published source ] ^ Chomskyan linguistics encourages
the investigation of " corner cases " that stress the limits of
its theoretical models (comparable to pathological phenomena
in mathematics), typically created using thought experiments ,
rather than the systematic investigation of typical phenomena
that occur in real-world data, as is the case in corpus
linguistics .
^ Antonio Di Marco - Roberto Navigili, “Clustering and
Diversifying Web Search Results with Graph Based Word Sense
Induction” , 2013 Goldberg, Yoav (2016).
Scripts, plans, goals, and understanding: An inquiry into human
knowledge structures ^ Kishorjit, N., Vidya Raj RK., Nirmal Y.,
and Sivaji B.
^ PASCAL Recognizing Textual Entailment Challenge (RTE-7)
https://tac.nist.gov//2011/RTE/ ^ Yi, Chucai; Tian, Yingli
(2012), “Assistive Text Reading from Complex Background for
Blind Persons” , Camera-Based Document Analysis and Recognition
, Springer Berlin Heidelberg, pp.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值