text summerization :treerank and feature-based

最新推荐文章于 2022-04-07 18:59:02 发布

BrainEditor

最新推荐文章于 2022-04-07 18:59:02 发布

阅读量336

点赞数

原文链接：https://doi.org/10.1007/978-1-4842-4267-4_6

版权

Import BeautifulSoup and urllib libraries to fetch data from Wikipedia.

from bs4 import BeautifulSoup
from urllib.request import urlopen

Function to get data from Wikipedia

def get_only_text(url):
page = urlopen(url)
soup = BeautifulSoup(page)
text = ’ '.join(map(lambda p: p.text, soup.find_all(‘p’)))
print (text)
return soup.title.text, text

Mention the Wikipedia url

url=“https://en.wikipedia.org/wiki/Natural_language_processing”

Call the function created above

text = get_only_text(url)

Count the number of letters

len(".join(text))
Result:
Out[74]: 8519

Lets see first 1000 letters from the text

text[:1000]
Result :
Out[72]: '(‘Natural language processing - Wikipedia’,
'Natural language processing (NLP) is an area of computer
science and artificial intelligence concerned with the
interactions between computers and human (natural) languages,
in particular how to program computers to process and analyze
large amounts of natural language\xa0data.\n Challenges
in natural language processing frequently involve speech
recognition, natural language understanding, and natural
language generation.\n The history of natural language
processing generally started in the 1950s, although work can be
found from earlier periods.\nIn 1950, Alan Turing published
an article titled “Intelligence” which proposed what is now
called the Turing test as a criterion of intelligence.\n
The Georgetown experiment in 1954 involved fully automatic
translation of more than sixty Russian sentences into English.
The authors claimed that within three or five years, machine
translation would be a solved problem.[2] However, real
progress was ’

Import summarize from gensim

from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords

Convert text to string format

text = str(text)

#Summarize the text with ratio 0.1 (10% of the total words.)
summarize(text, ratio=0.1)
Result:
Out[77]: ‘However, part-of-speech tagging introduced the use
of hidden Markov models to natural language processing, and
increasingly, research has focused on statistical models,
which make soft, probabilistic decisions based on attaching
real-valued weights to the features making up the input data.
\nSuch models are generally more robust when given unfamiliar
input, especially input that contains errors (as is very
common for real-world data), and produce more reliable results
when integrated into a larger system comprising multiple
subtasks.\n Many of the notable early successes occurred in
the field of machine translation, due especially to work at
IBM Research, where successively more complicated statistical
models were developed.’
That’s it. The generated summary is as simple as that. If you read this
summary and whole article, it’s close enough. But still, there is a lot of
room for improvement.

Method 4-2 Feature-based text summarization
Your feature-based text summarization methods will extract a feature from
the sentence and check the importance to rank it. Position, length, term
frequency, named entity, and many other features are used to calculate the
score.
Luhn’s Algorithm is one of the feature-based algorithms, and we will
see how to implement it using the sumy library.

Install sumy

!pip install sumy

Import the packages

from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from sumy.summarizers.luhn import LuhnSummarizer

Extracting and summarizing

LANGUAGE = “english”
SENTENCES_COUNT = 10
url=“https://en.wikipedia.org/wiki/Natural_language_processing”
parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
summarizer = LsaSummarizer()
summarizer = LsaSummarizer(Stemmer(LANGUAGE))
summarizer.stop_words = get_stop_words(LANGUAGE)
for sentence in summarizer(parser.document, SENTENCES_COUNT):
print(sentence)
Result :
[2] However, real progress was much slower, and after the
ALPAC report in 1966, which found that ten-year-long research
had failed to fulfill the expectations, funding for machine
translation was dramatically reduced.
However, there is an enormous amount of non-annotated data
available (including, among other things, the entire content of
the World Wide Web ), which can often make up for the inferior
results if the algorithm used has a low enough time complexity
to be practical, which some such as Chinese Whispers do.
Since the so-called “statistical revolution”
in the late 1980s and mid 1990s, much natural language
processing research has relied heavily on machine learning .
Increasingly, however, research has focused on statistical
models , which make soft, probabilistic decisions based on
attaching real-valued weights to each input feature.
Natural language understanding Convert chunks of text into more
formal representations such as first-order logic structures
that are easier for computer programs to manipulate.
[18] ^ Implementing an online help desk system based on
conversational agent Authors: Alisa Kongthon, Chatchawal
Sangkeettrakarn, Sarawoot Kongyoung and Choochart
Haruechaiyasak.
[ self-published source ] ^ Chomskyan linguistics encourages
the investigation of " corner cases " that stress the limits of
its theoretical models (comparable to pathological phenomena
in mathematics), typically created using thought experiments ,
rather than the systematic investigation of typical phenomena
that occur in real-world data, as is the case in corpus
linguistics .
^ Antonio Di Marco - Roberto Navigili, “Clustering and
Diversifying Web Search Results with Graph Based Word Sense
Induction” , 2013 Goldberg, Yoav (2016).
Scripts, plans, goals, and understanding: An inquiry into human
knowledge structures ^ Kishorjit, N., Vidya Raj RK., Nirmal Y.,
and Sivaji B.
^ PASCAL Recognizing Textual Entailment Challenge (RTE-7)
https://tac.nist.gov//2011/RTE/ ^ Yi, Chucai; Tian, Yingli
(2012), “Assistive Text Reading from Complex Background for
Blind Persons” , Camera-Based Document Analysis and Recognition
, Springer Berlin Heidelberg, pp.

BrainEditor

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
text summerization :treerank and feature-based

Import BeautifulSoup and urllib libraries to fetch data from Wikipedia.from bs4 import BeautifulSoupfrom urllib.request import urlopenFunction to get data from Wikipediadef get_only_text(url):pag...
复制链接

扫一扫