Text&Vision
📑 Text Intelligence
💯Lab_info
- NLTK - Natural Language Tool Kit
pip3 install numpy
pip3 install pandas
pip3 install nltk
import nltk
nltk.download()
# test NLTK - Basics
from nltk import sent_tokenize, word_tokenize, pos_tag
text = "Text and Vision Intelligence is a course that deal with interpreting texts and images computationally. This has become increasingly important in the last decade due to a large amount of texts and images online as well offline."
print(sent_tokenize(text))
print(word_tokenize(text))
print(pos_tag(word_tokenize(text)))
❔ What is text intelligence?
The theory and practice of computationally extracting and comprehending knowledge from natural texts(are human generated and hence the information is unstructured.)
Means extracting information for intelligence from free text.
🏷 Related names of text intelligence
Data Science (includes text processing)
Natural Language Processing (NLP)
Text Mining
Natural Language Engineering
Text Processing
🏷 Disciplines of Text Processing
Linguistics : How words, phrases, and sentences are formed.
Psycholinguistics : How people understand and communicate using a human language.
Computational linguistics : Deals with models and computational aspects of NLP.
Artificial intelligence : issues related to knowledge representation and reasoning.
NL Engineering : implementation of large, realistic systems.
❓ Why is it a difficult problem?
Free text contains unstructured information
High degree of ambiguity in naturally occurring texts.
Meaning derived from context
Context can be external to the text being processed.
❓ Why we need text intelligence?
Structured information is organized, hence easy to comprehend
Eg. Databases, spreadsheet and XML files.
❓ Why Text Processing?
Human society took more than 300,000 years to create 12 exabytes (1 billion gigabytes) of data
We are expected to double that in the next 3 years!
Needed to take advantage of the vast amount of information encoded in natural languages, online as well as offline.
Needed even to interface with vast amount of organized information in databases.
Needed to be able to communicate with machines using natural language.
🏷 Applicaions of NLP
Text-based applications:
finding documents on certain topics (document catégorisation)
information retrieval; search for keywords or concepts.
(free) information extraction; relevant to a topic.
text comprehension
translation from a language to another
summarization
knowledge management
Dialogue-based applications:
human-machine communication
question-answering
tutoring systems
problem solving
Speech processing:
Voice to text and vice versa conversions
🏷levels of language processing
Phonetic-Morphological Knowledge-Syntactic Knowledge-Semantic Knowledge-Pragmatic Knowledge-Discourse Knowledge
🌟Inference in Discourse Processing
There are several possible ways to interpret an utterance in context
We need to find the most likely interpretation
Discourse model provides a computational framework for this search
🌟Some Models of Discourse Structure
Investigation of lexical ▶️connectivity patterns as the reflection of discourse structure
Specification of a small set of ▶️rhetorical relations among discourse segments
Adaption of the notion of ▶️grammar
Examination of ▶️intentions and relations among them as the foundation of discourse structure
🌟State of the art in NLP Research
ACL - Association of Computational Linguistics
AAAI -every year /IJCAI -every second year
MUC - Message Understanding Conf.
DUC – Document Understanding Conf.
SIGIR – Special Interest Group in IR
📑 POS Tagging
💯Lab_info
- POS-tags
import nltk
#####################################
tokens = nltk. word_tokenize("AUT is in New Zealand")
postags = nltk.pos_tag(tokens)
print(postags)
#####################################
The code will give you a result similar to one shown below.
[('AUT', 'NNP'), ('is', 'VBZ'), ('in', 'IN'), ('New', 'NNP'), ('Zealand', 'NNP')]
#####################################
Instantiate the following taggers from NLTK.
a.Unigram tagger
b.TnT tagger
c.Perceptron tagger
d.CRF tagger
import nltk
######################Simple Tagging###############################################################################################
text = nltk.word_tokenize("The city of Auckland is in New Zeland which is in the Pacific")
print(nltk.pos_tag(text))
# Many words can function in different roles, such as run,live and talk.
text = nltk.word_tokenize("The talk was boring")
print(nltk.pos_tag(text))
text = nltk.word_tokenize("You should talk more in class")
print(nltk.pos_tag(text))
############################################################################################################################
# The tags for tokens computed from the context in which they appear.
# The text.similar() method takes a word w, finds all contexts w1 w w2, then finds all words w' that appear in the same context, i.e. w1 w'w2.
# You can allocate same tag to w'
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
text.similar('woman')
print("-------------------------------------------------------------")
text.similar('bought')
print("-------------------------------------------------------------")
text.similar('over')
print("-------------------------------------------------------------")
text.similar('the')
###########################################################################################################################
#Representing tagged tokens
tagged_token = nltk.tag.str2tuple('fly/NN')
# tagged_token
# ('fly', 'NN')
print(tagged_token[0])
print(tagged_token[1])
#Reading tagged corpora
print(nltk.corpus.nps_chat.tagged_words())
print(nltk.corpus.conll2000.tagged_words())
print(nltk.corpus.treebank.tagged_words())
#Taggged corpora for several other languages are also available
print(nltk.corpus.sinica_treebank.tagged_words())
print(nltk.corpus.indian.tagged_words())
print(nltk.corpus.mac_morpho.tagged_words())
print(nltk.corpus.conll2002.tagged_words())
print(nltk.corpus.cess_cat.tagged_words())
# ##########################################################################################################################f
#
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
print(tag_fd.most_common())
tag_fd.plot(cumulative=False)
#
# ##########################################################################################################################f
#Lets see what parts of speech frequently occur before a noun
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
word_tag_pairs = nltk.bigrams(brown_news_tagged)
noun_preceders = [a[1] for (a, b) in word_tag_pairs if b[1] == 'NOUN']
print(noun_preceders)
fdist = nltk.FreqDist(noun_preceders)
print([tag for (tag, _) in fdist.most_common()])
# ##########################################################################################################################
#
#Explore the corpora
#Lets see which word most oftern follows the word "often". Verb is the highest and nouns never even appear.
from nltk.corpus import brown
brown_learned_text = brown.words(categories='learned')
print(sorted(set(b for (a, b) in nltk.bigrams(brown_learned_text) if a == 'often')))
#Probably better to see the POS that follows the word "often"
brown_lrnd_tagged = brown.tagged_words(categories='learned', tagset='universal')
tags = [b[1] for (a, b) in nltk.bigrams(brown_lrnd_tagged) if a[0] == 'often']
fd = nltk.FreqDist(tags)
fd.tabulate()
#
# ##########################################################################################################################
#Look at trigram context, get all "verb TO verb" trigrams.
from nltk.corpus import brown
def process(sentence):
for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence):
if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')):
print(w1, w2, w3)
for tagged_sent in brown.tagged_sents():
process(tagged_sent)
# ##########################################################################################################################
#
#Lets look at words that are hardest to tag, ie, they are most ambiguous.
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
data = nltk.ConditionalFreqDist((word.lower(), tag) for (word, tag) in brown_news_tagged)
for word in sorted(data.conditions()):
if len(data[word]) > 2:
tags = [tag for (tag, _) in data[word].most_common()]
print(word, ' '.join(tags))
#
# ##########################################################################################################################
#
# Indexed lists versus dictionaries
# Indexed list - is a lookup table with index numbers and and an entry which is a string. Eg. a document is represented as a list
#Dictionary - is a again a table but this time the lookup is done using a string and you get back a value which can be a number or another string. Eg. a frequency dist. table.
# Eg of a dictionary
pos = {}
pos['ideas'] = 'N'
pos['sleep'] = 'V'
pos['furiously'] = 'ADV'
print(pos)
print('ideas')
print("\nUseful calls to konw for dictionary iterations")
print(list(pos))
print(sorted(pos))
for word in sorted(pos):
print(word + ":" + pos[word])
print(list(pos.keys()))
print(list(pos.values()))
print(list(pos.items()))
for key, val in sorted(pos.items()):
print(key + ":", val)
##########################################################################################################################
#Lets use some datasets from NLTK
from collections import defaultdict
counts = defaultdict(int)
from nltk.corpus import brown
for (word, tag) in brown.tagged_words(categories='news', tagset='universal'):
counts[tag] += 1
print(counts['NOUN'])
print(sorted(counts))
from operator import itemgetter
print(sorted(counts.items(), key=itemgetter(1), reverse=True))
##########################################################################################################################
#A handy trick to extract an element from a tuple
from nltk.corpus import brown
tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
print(tags)
⚛️ POS tags
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-HEmyUjZV-1618573377646)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416111035571.png)]
⚛️ Syntactic tags
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GSYc8BL8-1618573377647)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416103912950.png)]
😮 9 traditional word classes of parts of speech
Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction
💣Example
N noun chair, bandwidth, pacing
V verb study, debate, munch
ADJ adjective purple, tall, ridiculous
ADV adverb unfortunately, slowly
P preposition of, by, to, for, at
PRO pronoun I, me, mine, he his, her
DET determiner the, a, an,that, those
🏷Evaluation Matrices-1
TP - True Positives: Machine identified positives which are also similarly identified positives by human.
FP - False Positives: Machine identified positives which have been identified as negatives as by human.
FN - False Negatives: Machine identified negatives which have been identified as positives by human.
TN - True Negatives: Machine identified negatives which have been identified as negatives by human.
🏷Evaluation Matrices-2
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fYvxStag-1618573377649)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416112636315.png)]
📑 Vector Space Model/Similarity Computations
💯Lab_info
"""Convert words to vectores that can be used with classifiers"""
from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]
# create the transform
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(vector.toarray())
print(vectorizer.vocabulary_)
#Try another sentence
text2 = ["the quick puppy"]
vector = vectorizer.transform(text2)
print(vector.toarray())
"""BOW model is not very effeictive. represents presence or absence of a token in a document.
Lets keep count of tokens in a document
Using TFIDF instead of BOW, TFIDF also takes into account the frequency instead of just the occurance.
calculated as:
Term frequency (normalized) = (Number of Occurrences of a word)/(Total words in the document) : normalizes based on the size of the document.
IDF(word) = Log((Total number of documents)/(Number of documents containing the word)) : reduces the impact words that are common across documents, eg. the.
TF-IDF is the product of the two."""
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform([text[2]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())
""" Extracting n grams from text """
import nltk
text = nltk.word_tokenize("The quick brown fox jumped on the dog")
def find_bigrams(input_list):
bigram_list = []
for i in range(len(input_list)-1):
bigram_list.append((input_list[i], input_list[i+1]))
return bigram_list
#get individual items from the bigram
bigrams = find_bigrams(text)
print(bigrams)
print(bigrams[0].__getitem__(0))
print(bigrams[0].__getitem__(1))
#Now write a function to generate trigrams.
"""using the nltk ngrams function"""
from nltk import ngrams
sentence = 'The quick brown fox jumped over the dog.'
n = 6
sixgrams = ngrams(sentence.split(), n)
ngrams = []
for grams in sixgrams:
ngrams.append(grams)
print(ngrams)
##########################################################################################################################
# Distance metrices
from nltk.metrics import *
s1 = "John went to town on a bike"
s2 = "Peter went to town in a bus"
print("Edit Distnance same string: ",edit_distance(s1,s1))
print("Edit Distnance: ",edit_distance(s1,s2))
print("Binary Distnance: ",binary_distance(set(s1),set(s2)))
print("Jaccard Distnance: ",jaccard_distance(set(s1),set(s2)))
print("Masi Distnance: ",masi_distance(set(s1),set(s2)))
##########################################################################################################################
🏷BOW - Bag of words model
Vector representation does not consider the ordering of words in a document
The dog bit the man and The man bit the dog would have same representation
This is called the bag of words model.
We will see later that there are models that recover the positional information
However the BOW model is surprisingly effective in most situations.
🏷TF: Term Frequency
which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear many more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pbcwOBDg-1618573377652)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416135529200.png)]
🏷IDF: Inverse Document Frequency
which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZlgNHSjj-1618573377653)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416135628376.png)]
🏷Tf-idf weighting scheme
The tf‐idf weight of a term is the product of its tf weight and its idf weight.
tf-idf = log(1+ tf) * log(N/df)
tf-idf = tf * log(N/df) - alternative
🏷Euclidean Distance
Is the default measure metric
Measuring distance between text documents, given two documents da and db represented by their term vectors ta and tb respectively, the Euclidean distance of the two documents is defined as[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EZYZPD9K-1618573377655)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416142611411.png)]
🏷Cosine Distance
Defined as the cosine of the angle between two vectors.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3BTrFcMM-1618573377656)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416142648405.png)]
🏷Jaccard Coefficient
Compares the sum weight of shared terms to the sum weight of terms that are present in either of the two document but are not the shared terms.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-F7JCYIz8-1618573377656)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416142720112.png)]
🏷Levenshtein Distance
Also known as the edit distance.
Is the minimum number of single character edits (insertions, deletions or substitutions) required to change one sentence into another.
🏷Hamming Distance
Between two strings of equal length is the number of positions at which the corresponding symbols are different.
ie, the minimum number of substitutions required to change one string into the other.
Or (or originally) the minimum number of errors that could have transformed one string into the other.
📑 Information Extraction
methods:
Named Entity Recognition Relation detection and Classification Event Processing Temporal Processing Author/source detection Main Concept/theme detection and tracking Specific Information tracking
💯Lab_info
NER using HMM Learnert
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
import nltk
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
prev = None
continuous_chunk = []
current_chunk = []
#print(chunked)
for i in chunked:
if type(i) == Tree:
current_chunk.append(" ".join([token for token, pos in i.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
if continuous_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
return continuous_chunk
txt = "Jacinda Ardern is the Prime Minister of New Zealand but Roenzo isn't."
print (get_continuous_chunks(txt))
for sent in nltk.sent_tokenize(txt):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
print(chunk.label(), ' '.join(c[0] for c in chunk))
🌟Named Entity Recognition(NER)
- Rule Based NER (1)
Create regular expressions to extract:
Telephone number
Capitalized nameseg. blocks of digits separated by hyphens
RegEx = (\d+\-)+\d+
matches valid phone numbers like 0210-126-1125 and 09-816-225
incorrectly extracts social security numbers 123-45-6789
fails to identify numbers like 800.865.1125 and (800)865-CARE Improved RegEx = (\d{3}[-.\ ()]){1,2}[\dA-Z]{4}
- Rule Based NER (2)
Regular expressions provide a flexible way to match strings of text, such as particular characters, words, or patterns of characters
Perl RegEx (similar to grep regex and python)
\w (word char) any alpha-numeric
\d (digit char) any digit
\s (space char) any whitespace
. (wildcard) anything (single character)
\b word boundary
^ beginning of string
$ end of string
? For 0 or 1 occurrencesfor 1 or more occurrences
specific range of number of occurrences: {min,max}.
A{1,5} One to five A’s.
A{5,} Five or more A’s
A{5} Exactly five A’s
- Rule Based NER (3)
Create rules to extract locations
Capitalized word + {city, center, river} indicates location
Ex. New York city
Hudson riverCapitalized word + {street, boulevard, avenue} indicates location
Ex. Fifth avenue
- Rule Based NER (4)
Use context patterns
[PERSON] earned [MONEY]
Ex. Frank earned $20[PERSON] joined [ORGANIZATION]
Ex. Sam joined IBM[PERSON],[JOBTITLE]
Ex. Mary, the teacherstill not so simple:
[PERSON|ORGANIZATION|ANIMAL] fly to [LOCATION|PERSON|EVENT]
Ex. Jerry flew to Japan
Sarah flies to the party
Delta flies to Europe
bird flies to trees
bee flies to the wood
❓Why simple things would not work?
- Capitalization is a strong indicator for capturing proper names, but it can be tricky
first word of a sentence is capitalized
sometimes titles in web pages are all capitalized
nested named entities contain non-capital words
University of Southern California is Organization
all nouns in German are capitalized
Tweets/Micro-blogs have “loose” capitalization
- No lexicon contains all existing proper names.
- New proper names constantly emerge
movie titles
books
singers
restaurants
etc.
💠Learning System
- Supervised learning
labeled training examples
methods: Hidden Markov Models, k-Nearest Neighbors, Decision Trees, AdaBoost, SVM, NN…
example: NE recognition, POS tagging, Parsing
- Unsupervised learning
labels must be automatically discovered
method: clustering
example: NE disambiguation, text classification
- Semi-supervised learning
small percentage of training examples are labeled, the rest is unlabeled
methods: bootstrapping, active learning, co-training, self-training
example: NE recognition, POS tagging, Parsing, …
❗️Two stage NER - NEI and NEC
NEI
: Identify named entities using BIO tags
B beginning of an entity
I continues the entity
O word outside the entity
NEC
: Classify into a predefined set of categories
Person names
Organizations (companies, governmental organizations, etc.)
Locations (cities, countries, etc.)
Miscellaneous (movie titles, sport events, etc.)
🍬Decision Trees
- The classifier has a tree structure, where each node is either:
a leaf node which indicates the value of the target attribute (class) of examples
OR
a decision node which specifies some test to be carried out on a single attribute-value, with one branch and sub-tree for each possible outcome of the test
- An instance xp is classified by starting at the root of the tree and moving through it until a leaf node is reached, which provides the classification of the instance
🤒Building Decision Trees
- Select which attribute to test at each node in the tree.
- The goal is to select the attribute that is most useful for classifying examples.
- Top-down, greedy search through the space of possible decision trees. It picks the best attribute and never looks back to reconsider earlier choices.
📑 Formal Grammar CFG/Dependency Parsing
💯Lab_info
# #Collect all nouns and their modifiers
# import spacy
# nlp = spacy.load('en_core_web_sm')
# doc = nlp("Wall Street Journal just published an interesting piece on crypto currencies")
# for chunk in doc.noun_chunks:
# print(chunk.text, chunk.label_, chunk.root.text)
""" Dep parsing example"""
import spacy
"""
You will need to install the following particular version of spacy.
pip3 install nltk pip install spacy==2.3.5 pip install
You will also need to install en_core_web_sm using the following.
pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz
"""
nlp = spacy.load('en_core_web_sm')
# doc = nlp('John ate icecream and Peter ate apple')
# doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')
doc = nlp('A man with a knife and a boy hit the dazed shopkeeper on the head yesterday.')
for token in doc:
print("{0}/{1} <--{2}-- {3}/{4}".format(
token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))
🍊Dependency Grammars
- In CFG-style phrase-structure grammars the main focus is on constituents.
- But it turns out you can get a lot done with just binary relations among the words in an utterance.
- In a dependency grammar framework, a parse is a tree where
the nodes stand for the words in an utterance
The links between the words represent dependency relations between pairs of words.
Relations may be typed (labeled), or not.
- Dependency Relations
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MNYZBuK8-1618573377657)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416193303310.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0PCdrsPU-1618573377658)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416193312289.png)]
🆚Dependency parsing V CFG parsing
- CFG
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lgUjTXxz-1618573377659)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416193424563.png)]
- Dependency
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5WbVE90I-1618573377660)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416193438434.png)]
💛Dependency Parsing
Each linguistic word is connected via a directed link.
The parse tree captures the (unidirectional) relationship between words and phrases.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ip869gDy-1618573377660)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416193524178.png)]
😒A typical Information Extraction task
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-k3lTwYW5-1618573377662)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416193558498.png)]
😎Summary
Context-free grammars can be used to model various facts about the syntax of a language.
When paired with parsers, such grammars constitute a critical component in many applications.
Constituency is a key phenomena easily captured with CFG rules.
Dependency parsing is based on words and their binary relations and is easier to do then to do then CFG,
[外链图片转存中…(img-5WbVE90I-1618573377660)]
💛Dependency Parsing
Each linguistic word is connected via a directed link.
The parse tree captures the (unidirectional) relationship between words and phrases.
[外链图片转存中…(img-Ip869gDy-1618573377660)]
😒A typical Information Extraction task
[外链图片转存中…(img-k3lTwYW5-1618573377662)]
😎Summary
Context-free grammars can be used to model various facts about the syntax of a language.
When paired with parsers, such grammars constitute a critical component in many applications.
Constituency is a key phenomena easily captured with CFG rules.
Dependency parsing is based on words and their binary relations and is easier to do then to do then CFG,
Has less information, however is sufficient for most applications