Web Mining复习笔记3 Web Content Mining

大白要努力啊

已于 2024-06-08 13:47:15 修改

阅读量911

点赞数 16

分类专栏：笔记文章标签：笔记 RNN Web Web Mining

于 2024-06-02 22:36:50 首次发布

本文链接：https://blog.csdn.net/weixin_45012798/article/details/139396755

版权

笔记专栏收录该内容

40 篇文章 14 订阅

订阅专栏

3. Web Content Mining

3.1 Introduction to Sentiment Analysis / Opinion Mining

Detection of stances and opinions towards people, companies, and products/services has a tremendous business value: Improving products and services, targeted advertising, revealing trends in election campaigns, …

Sentiment analysis or opinion mining is the computational study of people’s opinions, appraisals, attitudes, and emotions towards. (Entities,individuals,issues,events,topics,and their attributes (aspects))

A general sentiment analysis framework aims to answer

Who is the opinion holder? -> Opinion holder
Towards whom or what is opinion/sentiment expressed? -> Target
What is the polarity and intensity of the opinion?
Is an opinion associated with a time-span?

Opinion

3.2 Constructing Sentiment Lexicons

Sentiment clues (opinion words, sentiment-bearing words) – words and phrases used to express some desired or undesired state
Positive clues: good, amazing, beautiful
Negative clues: bad, awful, terrible, poor

Sentiment clues are often domain-dependent => Separate sentiment lexicons need to be constructed for different domains
Example: Quiet speaker phone vs. quiet car engine

3.2.1 Automated acquisition of sentiment lexicons

Automated acquisition of sentiment lexicon is most often semi-supervised (or weakly supervised)

Start from a small seed lexicon of sentiment words
Iteratively augment the lexicon based on links between words already in the lexicon and words in the large general lexicon or large corpus
Stop when there are no more reliable candidate words to be added to the lexicon

Approaches for constructing sentiment lexicons are either Dictionary-based or Corpus-based

Often there is a final step of manual cleansing of automatically derived sentiment lexicons

3.2.1.1 Dictionary-Based Sentiment Lexicon Acquisition

Bootstrapping using a small seed sentiment lexicon. E.g.,10 positive and 10 negative sentiment words
Idea: exploit semantic links between words in the general lexicon E.g.,synonymy and antonymy links in WordNet. The procedure is typically iterative
Additional information can be used to make better lists: WordNet glosses or Machine learning(classification based on concept definitions)

Cons:

Limited Coverage: they may miss out on nuanced or domain-specific sentiments.
Lack of Context Understanding: These approaches often treat words in isolation without considering their context.
Difficulty Handling Negations and Modifiers: Sentiment analysis dictionaries may struggle with handling negations (e.g., “not good”) or modifiers (e.g., “very good”)(Next page)
Limited Adaptability: Dictionary-based approaches may not easily adapt to new domains or languages without significant manual effort to update or create new sentiment lexicons.
Vulnerability to Ambiguity: Some words may have multiple meanings or sentiments depending on the context, making it challenging for dictionary-based approaches to accurately capture their sentiment.
Difficulty with Sarcasm and Irony: Sentiment dictionaries may struggle to detect sarcasm, irony, or other forms of figurative language, which can lead to misinterpretations of sentiment.

SentiWordNet is a general sentiment lexicon derived from WordNet. It contains automated annotations of all WordNet synsets with sentiment scores.

3.2.1.2 Corpus-Based Sentiment Lexicon Acquisition

Methodologically, corpus-based induction of sentiment lexicons resembles to the dictionary-based: Semi-supervised learning from small initial seed sets and Graph-based propagation of positive and negative sentiment
Difference:
Graph for label propagation is computed from word co-occurrences in a large corpus
The resulting lexicon specific to the domain of the corpus

Some (simple) approaches:
(1) Sentiment consistency, conjunction of adjectives (Hatzivassiloglou & McKeown, 1997)
Adjectives conjoined by “and” have same polarity. Adjectives conjoined by “but” do not.

Step 1: Label seed set of 1336 adjectives
Step 2: Expand seed set to conjoined adjectives (look in the corpus)
Step 3: Supervised classifier assigns “polarity similarity” to word pair
Step 4: Clustering for partitioning the graph into two

(2) Pointwise mutual information (PMI) of candidate words with seed set words (Turney & Littman, 2002)

Step 1: Extract a phrasal lexicon from reviews
Step 2: Learn polarity of each phrase
Step 3: Rate a review by the average polarity of its phrase

Step 1

PMI

Step 2

Step 3

(3)PMI-induced graph with PageRank label propagation and supervised learning (Glavaš and Šnajder, 2012)

3.3 Sentiment Classification

The goal is to classify an opinionated portion of text (e.g., product review) as expressing (dominantly) positive or negative sentiment.

Assumption: entire text portion addresses a single entity (Holds for product reviews but not for social media posts)

Capturing the overall sentiment expressed toward the entity. Sentiment toward specific aspects of the entity ignored

Methodological approaches:

Supervised learning (i.e., supervised text classification; dominantly)
Unsupervised learning

3.3.1 Supervised sentiment classification

Typically formulated as a ternary (Positive, Negative, Neutral) text classification task
Training and testing data – typically product reviews

Classification:

Feature-design algorithms
The usual suspects: logistic regression, SVM, …
Features
- Bag of words, POS tags, opinion clues and phrases (from dictionary)
- Negations (change opinion orientation) and syntactic dependencies
Semantic representation-based algorithms
- CNNs, RNNs, Autoencoders, Recursive NN (for sentiment classification)
- Raw text input (word or character embeddings), no need for manually designed features

3.3.1.1 Logistic Regression

Intro to logistic regression

The linear combination of features and coefficients isn’t a probability, it’s just a number -> use a function of z that goes from 0 to 1
logistic function

sigmoid

Two phases of logistic regression
Training: we learn weights w using stochastic gradient descent and cross-entropy loss.
Test: Given a test example x we compute p(y|x) using learned weights w, and return whichever label (y = 1 or y = 0) has higher probability.

computing probabilities

using the output of the sigmoid as a classifier

Future Design
The key question is how to come up with good (useful) features

Two approaches:

Use your intuition (insight, linguistic/domain expertise), and design a small set of good features that you think should work
Throw in everything you can (the “kitchen sink” approach), and them maybe prune later

You will often want to see which features work and which don’t:

Ablation study – turn off some features, retrain the model and see how the performance changes
Feature selection – use a method to select the best features. This can also improve the performance (especially in a “kitchen sink” approach)

One of the great advantages of deep learning for NLP is the absence of feature engineering

Text Classification in logistic regression: summary

3.3.1.2 Multinomial Logistic Regression

more than 2 classes
Idea: compute the probability distribution over k classes from the linear combination of (class-specific) weights and input features
For this, we need first to define a generalization of the sigmoid for multiple classes, where the output (i.e., the total probability mass) over all classes must sum up to 1

The Softmax Function
softmax

softmax in multinomial logistic regression

features in binary vs multinomial logistic regression

3.3.2 Unsupervised Sentiment Classification

If user ratings are not available, we need manual labelling for supervised machine learning methods -> tedious,expensive,time-consuming

A typical unsupervised approach to sentiment classification:
Step 1: Extract candidate phrases (e.g., matching predefined POS patterns)
Step 2: For reach word/phrase, compute some association score (e.g., pointwise mutual information) with sentiment lexicon entries, on a large corpus
- Association scores (e.g., PMI) with positive seed words
- Association scores (e.g., PMI) with negative seed words
Step 3: The sentiment orientation of each phrase is computed as:

Step 4: The sentiment of the document is determined by summing or averaging the sentiment orientations of phrases it contains
Example

Example

3.4 Sarcasm Detection

Non-transparent expressions of sentiment cause most errors in sentiment analysis and opinion mining: Irony and sarcasm being most salient
Sarcasm is a sharp, bitter, or cutting expression or remark; a bitter gibe or taunt
Sarcasm is notoriously difficult to detect in text, even for humans!

Computational approaches focus merely on specific types of sarcasm - Sarcasm as contrast of negative situations and positive sentiment, eg. “Oh how I love being ignored.”

Boostrapping rule-based algorithm that automatically learns positive sentiment phrases and negative situation phrases:

Start with (1) single positive sentiment word (love) and (2) a set of tweets with hashtag #sarcasm or #sarcastic
Negative situation candidates – n-grams (1-3) that directly follow positive sentiment phrases and fulfill pre-defined POS patterns
Positive sentiment candidates – n-grams (1-3) near the negative situation phrases that satisfy POS patterns
Candidates are scored based on ratio of frequencies in sarcastic (with hashtags) vs. non-sarcastic tweets

3.5 HateSpeech Detection

Hate speech (HS) is commonly defined as any communication that disparages a person or a group; on the basis of some characteristic such as race, color, ethnicity, gender, sexual orientation, nationality, religion, or other.
Expressions that:
(i) incite discrimination or violence due to racial hatred, xenophobia, sexual orientation and other types of intolerance;
(ii) foster hostility through prejudice and intolerance

One of the major issues consists in the intrinsic complexity in defining HS and in a widespread vagueness in the use of related terms (such as abusive, toxic, dangerous, offensive or aggressive language), that often overlap and are prone to strongly subjective interpretations

Lexicons for hate speech / offensive language - HurtLex

Typically addressed as a text classification task - Binary or multi-label. Supervised
specific hate speech detection

3.6 Named Entity Recognition

Information extraction (IE) is the automatic identification of selected types of entities, relations, or events in free text
Traditionally, IE tasks tasks are the following:

Named entity recognition and classification (NERC)
Coreference resolution
Relation extraction
Event extraction

The following tasks loosely belong to IE: Keywords/keyphrase extraction, Terminology extraction, Collocation extraction

3.6.1 Supervised Named Entity Recognition

A named entity is a real-world object, such as a person, location, organization, product, etc., that can be denoted with a proper name. It can be abstract or have a physical existence.

Named-entityrecognition(NER) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

Named Entity Recognition (NER) is considered a sequence labeling task in natural language processing (NLP). Sequence labeling tasks involve assigning a label to each token (word or subword) in a sequence of text.

We need: a corpus manually annotated with named entities
Annotations done according to annotation standard [The most renowned annotation standard: MUC-7(Chinchor & Robinson, 1997)]
MUC-7 named entity types

Entity names (ENAMEX) – Person, Organization, Location
Temporal expressions (TIMEX) – Date, Time
Quantities (NUMEX) – Monetary value, Percentage

Annotation of named entities is not particularly demanding

No need to hire experts (e.g., linguists)
Virtually any native speaker can annotate (after training)

3.6.1.1 B-I-O annotation scheme

B – Begins a named entity (i.e., first NE token)
I – Inside a named entity (i.e., second and subsequent NE tokens)
O – Outside of a named entity (i.e., token is not part of any NE)
B-I-O annotation scheme

3.6.1.2 Supervised approaches to NER

1. Token-level classification

Naive Bayes, SVM, Logistic regression, Feed-forward NN
Cannot use labels from both token sides as features

2. Sequence labelling

Hidden Markov Models (HMM), Conditional Random Fields (CRF) - Require manual feature design
Recurrent (or gated convolutional) neural networks
Word embeddings as input, no feature design
State-of-the-art results

Common features (for feature-based learning algorithms):

Linguistic features: word, lemma, POS-tag, sentence start, capitalization, …
Gazetteer features: is gazetteer entry, starts gazetteer entry, inside of a gazetteer entry (for all gazetteers)

3.6.1.3 NER - Document Level

Sequence models predict BIO labels at the sentence level. Thus, it’s possible to have different labels for the same named entity at the document level. Enforcing document-level consistency improves NER performance.

3.6.2 Rule-Based Named Entity Recognition

Large number of extraction patterns / rules. Each pattern detects some type of named entities. Unfortunately, most rules have exceptions… => We can add additional rules to handle exceptions.
E.g. Gazetteers: word lists for each of the NER categories
Problem: Gazetteers are always incomplete
Generally, too many rules, difficult to maintain, etc.

3.7 Evaluation

Comparing system predicted Named Entities (NEs) with gold- annotated Nes. In terms of precision, recall, and F-score

Lenient(aka MUC) evaluation
• System NE and gold NE need to be of the same type and overlap in token spans in order to count as a match (i.e., true positive)
Strict(aka Exact) evaluation
• System NE and gold NE need to be of the same type and exactly the same token span order to count as a match (i.e., true positive)

F1 score

3.7.1 Scenario 1: Surface string and entity type match

Surface String Match: This criterion assesses whether the text span of the identified NE (surface string) matches that of the gold standard NE. For example, if the system identifies “New York” as a location, it is considered a surface string match if the gold standard annotation also identifies “New York” as a location.

Entity Type Match: This criterion evaluates whether the type or category assigned to the identified NE by the system matches the type assigned in the gold standard annotation. Each NE is typically categorized into predefined types such as person names, etc. A match occurs when the system assigns the same type to an NE as the gold standard annotation. For instance, if the system identifies “New York” as a location and the gold standard annotation also labels it as a location, it is considered an entity type match.
Surface string and entity type match

3.7.2 Scenario 2: System hypothesized an entity

System hypothesized an entity

3.7.3 Scenario 3: System misses an entity

System misses an entity

Note that considering only this 3 scenarios,and discarding every other possible scenario we have a simple classification evaluation that can be measured in terms of false negatives, true positives and false positives, and subsequently compute precision, recall and f1-score for each named-entity type.But of course we are discarding partial matches, or other scenarios when the NER system gets the named-entity surface string correct but the type wrong, and we might also want to evaluate these scenarios again at a full-entity level.

3.7.4 Scenario 4: System assigns the wrong entity type

System assigns the wrong entity type

3.7.5 Scenario 5: System gets the boundaries of the surface string wrong

System gets the boundaries of the surface string wrong

3.7.6 Scenario 6: System gets the boundaries and entity type wrong

System gets the boundaries and entity type wrong

NER Tasks

Precision is the percentage of Named Entities found by the learning system that are correct.
Recall is the percentage of Named Entities present in the corpus that are found by the system.
A named entity is correct only if it is an exact match of the corresponding entity in the data file.

so basically it only considers scenarios 1-3, the others described scenarios are not considered for evaluation.

3.8 RNNs

Recurrent Neural Networks: A network that contains a cycle within its network connections, meaning that the value of some unit is directly, or indirectly, dependent on its own earlier outputs as an input. It explicitly take into account the sequences.

General RNN model

3.8.1 Elman RNN

3.8.1.1 Overview

The goal is to learn a representation of a sequence by maintaining a hidden state vector that act as form of memory (or context) to encode the sequence seen so far
The hidden layer includes a recurrent connection as part of its input
The hidden state vector is computed from both a current input vector and the previous hidden state vector.

Input vector from the current time step and the hidden state vector from the previous time step are mapped to the hidden state vector of the current time step.

Hidden-to-hidden and input to hidden weights are shared across the different time steps
Weights are adjusted so that the RNN is learning how to incorporate incoming information and maintain a state representation summarizing the input seen so far.
RNN does not have any way of knowing which time step it is on: RNN is “only” learning how to transition from one time step to another and maintain a state representation that will minimize its loss.

Elman (1990) or “Simple” RNN

Unrolling the simple RNN

Forward inference (mapping a sequence of inputs to a sequence of outputs) requires an inference algorithm that proceeds from the start of the sequence to the end. The matrices U, V and W are shared across time, while new values for h and y are calculated with each time step.

3.8.1.2 Problems

The Problem with Vanilla RNNs (or Elman/Simple RNNs)
(1) The inability to retain information for long-range predictions:

at each time step we simply updated the hidden state vector regardless of whether it made sense
no control over which values are retained and which are discarded in the hidden state - entirely determined by the input; no way to decide if the update is optional or not

(2) Vanishing & exploding gradients

3.8.1.3 Vanishing & exploding gradients

Vanishing Gradient Intuition
Vanishing Gradients1

Vanishing Gradients2

Simple (Elman) architecture suffers from a problem known as vanishing gradients. Error signals from later steps in the sequence diminish quickly in the backpropagation algorithm. Thus, the updates for early inputs that come from errors in later steps are very small.

Solution: Gated architectures

Do not update the whole state at every step
Gate vectors define which parts of the new state are taken from the previous state and which from the current input
Ex.: Long short-term memory (LSTM), Gated Recurrent Unit (GRU)

3.8.1.4 LSTM

Gated architectures
gate mechanism

Example

Memory cell: Internal state serves as a memory Gates:
Gate: when to reset its memory? when to let the input in? when to let the output out
LSTM

LSTM

Gates: common design pattern
All gates consist of a feed-forward layer, a sigmoid activation function, and a pointwise multiplication with the layer being gated

Sigmoid as the activation function pushes its outputs to either 0 or 1.

Combined with a pointwise multiplication it acts a sort of binary mask:

Values in the layer being gated that align with values near 1 in the mask are passed through nearly unchanged
Values corresponding to lower values are essentially erased

Forget Gate

Input/add Gate
Input Gate

Output Gate

Together

3.8.2 A RNN Language Model

A RNN Language Model

RNN Advantages:

Can process any length input
Computation for step t can (in theory) use information from many steps back
Model size doesn’t increase for longer input
Same weights applied on every timestep, so there is symmetry in how inputs are processed

RNN Disadvantages:

Recurrent computation is slow
In practice, difficult to access information from many steps back

3.8.3 Generating with an RNN LM

Also known as autoregressive generation or causal LM generation

Step 1: Sample a word in the output from the softmax distribution that results from using the beginning of sentence marker, <s>, as the first input

Step 2: Use the word embedding for that first word as the input to the network at the next time step, and then sample the next word in the same fashion

Step 3: Continue generating until the end of sentence marker, , is sampled or a fixed length limit is reached

Using a language model as generator through sampling
Example: unigram case