501 -- Introduction to Unsupervised Method

最新推荐文章于 2024-09-10 00:00:00 发布

There Uncle

最新推荐文章于 2024-09-10 00:00:00 发布

阅读量270

点赞数

分类专栏： R 文章标签：机器学习

本文链接：https://blog.csdn.net/qq_32297631/article/details/122450169

版权

R 专栏收录该内容

4 篇文章 1 订阅

订阅专栏

◼ Purpose: To discover unknown relationship

◼ To discover patterns in the data that perhaps you hadn’t previously suspected

◼ Ways of finding relationships and patterns that can be used to build predictive models

Cluster analysis

• Find groups with similar characteristics. Natural grouping

• K-means and hierarchical clustering

Association rule mining

• Find elements or properties that tend to occur together

Why Unsupervised Learning?

Unsupervised machine learning finds all kind of unknown patterns in data.
Unsupervised methods help you to find features which can be useful for categorization
It is taken place in real time, so all the input data to be analyzed and labeled in the presence of learners.
It is easier to get unlabeled data from a computer than labeled data, which needs manual intervention.

◼ The goal is to group the observations in your data into clusters such that every datum in a cluster is more similar to other datums in the same cluster than it is to datums in other clusters

◼ The method of identifying similar groups of data in a dataset

Example: A company that offers guided tours might want to cluster its clients by behavior and tastes: which countries they like to visit; adventure tours, luxury tours, or educational tours.

Clustering Real Life Examples

◼ Customer segmentation, or understanding different customer groups around which to build marketing or other business strategies. Differentiating groups of customers based on some attributes

◼ Biology - Genetics, for example clustering DNA patterns to analyse evolutionary biology.

◼ Medical imaging - for distinguishing between different kinds of tissues

◼ Recommender systems, which involve grouping together users with similar viewing patterns in order to recommend similar content - giving you better Amazon purchase suggestions or Netflix movie matches.

◼ Anomaly detection, including fraud detection or detecting defective mechanical parts (i.e., predictive maintenance).

Methods of Cluster Analysis

Hierarchical clustering

• finds nested groups of clusters

• example: plant taxonomy

k-means

• a quick and popular way of finding clusters in quantitative data.

How to cluster:

Euclidean Distance

edist(x, y) <- sqrt((x[1]-y[1])^2 + (x[2]-y[2])^2 + ...)

The distance between two points with coordinates (x, y) and (a, b)

The Euclidean distance between two points is the length of the path connecting them.

real-valued data

Hamming Distance

hdist(x, y) <- sum((x[1] != y[1]) + (x[2] != y[2]) + ...)

• A metric for comparing two binary data strings.

• While comparing two binary strings of equal length, Hamming distance is the number of bit positions in which the two bits are different.

• Mostly used for categorical data (male/female or small/medium/large)

For example, if a column had the categories ‘red,’ ‘green,’ and ‘blue,’ you might one hot encode each example as a bitstring with one bit for each column. •red = [1, 0, 0] •green = [0, 1, 0] •blue = [0, 0, 1]

汉明距离是以理查德·卫斯里·汉明的名字命名的。在信息论中，两个等长字符串之间的汉明距离是两个字符串对应位置的不同字符的个数。换句话说，它就是将一个字符串变换成另外一个字符串所需要替换的字符个数。例如：

1011101 与 1001001 之间的汉明距离是 2。

2143896 与 2233796 之间的汉明距离是 3。

"toned" 与 "roses" 之间的汉明距离是 3。

Manhattan Distance

mdist(x, y) <- sum(abs(x[1]-y[1]) + abs(x[2]-y[2]) + ...)

In a simple way of saying it is the total sum of the difference between the x-coordinates and y-coordinates.

Cosine Similarity

◼ Finds the normalized dot product of the two attributes.

◼ By determining the cosine similarity, we would effectively try to find the cosine of the angle between the two objects.

Hierarchical Clustering hclust()

◼ Problem: To assess whether a given cluster is “real”— does the cluster represent actual structure in the data, or is it an artifact of the clustering algorithm? Clusters of “other” tend to be made up of data points that have no real relationship to each other; they just don’t fit anywhere else.

◼ Purpose: To evaluate how stability of a given cluster/To assess whether a cluster represents true structure

◼ Rule of thumb:

◼ Clusters with a stability value less than 0.6 should be considered unstable.

◼ Values between 0.6 and 0.75 indicate that the cluster is measuring a pattern in the data, but there isn’t high certainty about which points should be clustered together.

◼ Clusters with stability values above about 0.85 can be considered highly stable (they’re likely to be real clusters)

Picking the Number of Clusters – find k

◼ Solution: Compute clustering algorithm (e.g., kmeans clustering) for a few different values of k. ◼ For instance, by varying k from 1 to 10 clusters

◼ For each k, calculate the total within-cluster sum of square (WSS) for different values of k and look for an “elbow” in the curve.

◼ Define the cluster’s centroid as the point that is the mean value of all the points in the cluster.

• The total WSS will decrease as the number of clusters increases, because each cluster will be smaller and tighter.

• The WSS decreases will slow down for k beyond the optimal number of clusters.

• The graph will flatten out beyond the optimal k, so the optimal k will be at the “elbow” of the graph.

Picking the Number of Clusters - 2

Calinski-Harabasz index – estimating the number of clusters, based on an observations/variables-matrix

Total sum of squares (TSS) = the squared distance of all the data points from the dataset’s centroid
WSS(k) = the total WSS of a clustering with k clusters
Between sum of squares or BSS(k) = TSS - WSS(k)
within-cluster variance W = WSS(k)/(n-k)
between-cluster variance = BSS(k)/(k-1)
A good clustering has a small WSS(k) and a large BSS(k).

Optimal k: WSS vs CH

◼ You see that the CH criterion is maximized at k=2, with another local maximum at k=5.

◼ If you squint your eyes, the WSS plot has an elbow at k=2.

K-means

◼ Kmeans is a centroid based clustering algorithm where we calculate the distances to assign a point to a cluster

◼ Fairly ad hoc and has the major disadvantage that you must pick k in advance.

◼ Plus side: easy to implement (one reason it’s so popular) and can be faster than hierarchical clustering on large datasets.

◼ Use only numerical attributes

Clustering takeaways

◼ The goal of clustering is to discover or draw out similarities among subsets of your data.

◼ In a good clustering, points in the same cluster should be more similar (nearer) to each other than they are to points in other clusters.

◼ Like visualization, it’s more iterative and interactive, and less automated than supervised methods.

◼ Different clustering algorithms will give different results. You should consider different approaches, with different numbers of clusters.

◼ There are many heuristics for estimating the best number of clusters. Again, you should consider the results from different heuristics and explore various numbers of clusters.

Association Rules

Market Basket Analysis-Association Rules

◼ Purpose: Used to find objects which occur together. The unit of togetherness when mining association rules is called a transaction

◼ Input: the simple point-of-sale transaction data

◼ Output: Most frequent affinities among items

◼ Example: according to the transaction data…

“Customer who bought a lap-top computer and a virus protection software, also bought extended service plan 70 percent of the time."

◼ How do you use such a pattern/knowledge?

◼ Put the items next to each other

◼ Promote the items as a package

◼ Place items far apart from each other!

◼ When you want to determine which products tend to be purchased together

◼ Rules:

1. Look for all the itemsets (subsets of transactions) that occur more often than in a minimum fraction of the transactions.

2. Turn those itemsets into rules.

S: Support: how frequent an itemset is in all the transactions / how often X and Y go together

C: Confidence: the likeness of occurrences of how often Y goes together with X

L: Lift: the conditional probability of occurrence of {Y} given {X}

Association Rules in R

Package: arules for association rule mining.

◼ Apriori is an algorithm for discovering itemsets (group of items) occurring frequently in a transaction database

Association rule takeaways

◼ The goal of association rule mining is to find relationships in the data: items or attributes that tend to occur together

◼ A good rule “if X, then Y” should occur more often than you’d expect to observe by chance. You can use lift or Fisher’s exact test to check if this is true.

◼ When a large number of different possible items can be in a basket (in our example, thousands of different books), most events will be rare (have low support).

◼ Association rule mining is often interactive, as there can be many rules to sort and sift through

Topic Modelling

◼ Definition: A form of unsupervised learning that can be applied to unstructured data. In the case of text documents, it identifies words or phrases that have a similar meaning and groups them into ‘topics’ using statistical techniques.

◼ Automatically extract topics from text documents

◼ It discovers a repeating group of statistically significant tokens or words in a corpus. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body.

◼ Articles are labelled with tags (e.g. politics, economy, sports, ...)

counting words and grouping similar word patterns to infer topics within unstructured data

Topic Modelling Algorithms

Latent Semantic Analysis (LSA)
Latent Dirichlet Allocation (LDA)
Probabilistic Latent Semantic Analysis (PLSA)
Correlated Topic Model (CTM)

Model Evaluation

Coherence score

◼ The coherence score of an LDA model measures the degree of semantic similarity between words in each topic. All else equal, a higher coherence score is better, as it indicates a higher degree of likeness in the meaning of the words within each topic.

Interpret the Topic

When applied to natural language, topic modeling requires interpretation of the identified topics — this is where judgment plays a role. The goal is to ensure that the topics and their allocations make sense for the context and purpose of the modeling exercise.

Sentiment Analysis

◼ Sentiment → belief, view, opinion, conviction

◼ Sentiment analysis → opinion mining, subjectivity analysis, and appraisal extraction

◼ By analyzing data related to opinions of many using a variety of automated tools.

◼ Explicit versus Implicit sentiment

◼ Sentiment polarity

◼ Positive versus Negative,

… versus Neutral?

Explicit- directly expresses an opinion e.g. It’s a good day

Implicit – text implies on opinion e.g. the handle brakes too easily

R Packages for Sentiment Analysis

◼ Tidytext - bing, afinn and nrc - https://cran.rproject.org/web/packages/tidytext/tidytext.pdf

◼ Syuzhet – syuzhet, bing, afinn and nrc- https://cran.rproject.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html?

◼ SentimentAnalysis - CRAN https://www.rdocumentation.org/packages/SentimentAnalysis/versions/1.3-4

◼ Sentimentr - https://github.com/trinker/sentimentr

◼ Rsentiment - https://cran.rproject.org/web/packages/RSentiment/RSentiment.pdf

Why is sentiment analysis so important?

◼ Businesses today are heavily dependent on data. Majority of this data, however, is unstructured text coming from sources like emails, chats, social media, surveys, articles, and documents.

◼ The micro-blogging content coming from Twitter and Facebook poses serious challenges, not only because of the amount of data involved, but also because of the kind of language used in them to express sentiments, i.e., short forms, memes and emoticons.

◼ Sifting through huge volumes of this text data is difficult as well as timeconsuming. Also, it requires a great deal of expertise and resources to analyze all of that.

◼ Sentiment Analysis enables companies to make sense out of data by being able to automate this entire process!