LDA的介绍

最新推荐文章于 2022-04-10 22:42:23 发布

碗豆先生

最新推荐文章于 2022-04-10 22:42:23 发布

阅读量1.3k

点赞数 1

分类专栏： python 文章标签： LDA

python 专栏收录该内容

3 篇文章 1 订阅

订阅专栏

LDA的介绍

Introduction to Latent Dirichlet Allocation

By Edwin Chen
http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

Introduction

假设你有以下句子：

I like to eat broccoli and bananas.
I ate a banana and spinach smoothie for breakfast.
Chinchillas and kittens are cute.
My sister adopted a kitten yesterday.
Look at this cute hamster munching on a piece of broccoli.

什么是LDA呢？它是一种自动寻找出句子的主题的方法，比如，这些句子有两个主题，那么LDA算法将会得出下面的结论：

Sentences 1 and 2: 100% Topic A
Sentences 3 and 4: 100% Topic B
Sentence 5: 60% Topic A, 40% Topic B
Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (此时，你可以认为主题是关于food的。)
Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (此时，你可以认为主题是关于cute animals。)

当然了，问题的关键是，LDA是怎么得出这一发现的呢？

LDA Model

更详细的说，LDA以混合主题来表示文档，这里混合主题就是具有某种概率的词语。假设文档有以下格式：当写每一段文档时，你

确定文本中词的个数N (根据泊松分布).
为文本选择一个混合主题（根据狄利克雷条件，设定K个主题），比如，在上面的句子中我们假设有food和cute animals两个主题，你可能会选择文本由1/3food和2/3cute animals组成。
Generate each word w_i in the document by:
- 选择一个主题（根据上面的多项式，如food有1/3的概率，cute animals有2/3的概率）
- 用主题去生成这个词本身（根据主题的多项式组合，如，我们选择food这个主题，我们认为broccoli有30%的概率，bananas有15%的概率）

假设这个生成模型就是这些文档的模型，LDA模型试图从文档中回溯找到这个主题集合。

Example

Let’s make an example. According to the above process, when generating some particular document D, you might

Pick 5 to be the number of words in D.
Decide that D will be 1/2 about food and 1/2 about cute animals.
Pick the first word to come from the food topic, which then gives you the word “broccoli”.
Pick the second word to come from the cute animals topic, which gives you “panda”.
Pick the third word to come from the cute animals topic, giving you “adorable”.
Pick the fourth word to come from the food topic, giving you “cherries”.
Pick the fifth word to come from the food topic, giving you “eating”.
So the document generated under the LDA model will be “broccoli panda adorable cherries eating” (note that LDA is a bag-of-words model).

Learning

So now suppose you have a set of documents. You’ve chosen some fixed number of K topics to discover, and want to use LDA to learn the topic representation of each document and the words associated to each topic. How do you do this? One way (known as collapsed Gibbs sampling) is the following:

Go through each document, and randomly assign each word in the document to one of the K topics.
Notice that this random assignment already gives you both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).
So to improve on them, for each document d…
Go through each word w in d…
And for each topic t, compute two things: 1) p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t, and 2) p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w. Reassign w a new topic, where we choose topic t with probability p(topic t | document d) * p(word w | topic t) (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word’s topic with this probability). (Also, I’m glossing over a couple of things here, in particular the use of priors/pseudocounts in these probabilities.)
In other words, in this step, we’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.
After repeating the previous step a large number of times, you’ll eventually reach a roughly steady state where your assignments are pretty good. So use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).

Layman’s Explanation

In case the discussion above was a little eye-glazing, here’s another way to look at LDA in a different domain.

Suppose you’ve just moved to a new city. You’re a hipster and an anime fan, so you want to know where the other hipsters and anime geeks tend to hang out. Of course, as a hipster, you know you can’t just ask, so what do you do?

Here’s the scenario: you scope out a bunch of different establishments (documents) across town, making note of the people (words) hanging out in each of them (e.g., Alice hangs out at the mall and at the park, Bob hangs out at the movie theater and the park, and so on). Crucially, you don’t know the typical interest groups (topics) of each establishment, nor do you know the different interests of each person.

So you pick some number K of categories to learn (i.e., you want to learn the K most important kinds of categories people fall into), and start by making a guess as to why you see people where you do. For example, you initially guess that Alice is at the mall because people with interests in X like to hang out there; when you see her at the park, you guess it’s because her friends with interests in Y like to hang out there; when you see Bob at the movie theater, you randomly guess it’s because the Z people in this city really like to watch movies; and so on.

Of course, your random guesses are very likely to be incorrect (they’re random guesses, after all!), so you want to improve on them. One way of doing so is to:

Pick a place and a person (e.g., Alice at the mall).
Why is Alice likely to be at the mall? Probably because other people at the mall with the same interests sent her a message telling her to come.
In other words, the more people with interests in X there are at the mall and the stronger Alice is associated with interest X (at all the other places she goes to), the more likely it is that Alice is at the mall because of interest X.
So make a new guess as to why Alice is at the mall, choosing an interest with some probability according to how likely you think it is.
Go through each place and person over and over again. Your guesses keep getting better and better (after all, if you notice that lots of geeks hang out at the bookstore, and you suspect that Alice is pretty geeky herself, then it’s a good bet that Alice is at the bookstore because her geek friends told her to go there; and now that you have a better idea of why Alice is probably at the bookstore, you can use this knowledge in turn to improve your guesses as to why everyone else is where they are), and eventually you can stop updating. Then take a snapshot (or multiple snapshots) of your guesses, and use it to get all the information you want:

For each category, you can count the people assigned to that category to figure out what people have this particular interest. By looking at the people themselves, you can interpret the category as well (e.g., if category X contains lots of tall people wearing jerseys and carrying around basketballs, you might interpret X as the “basketball players” group).
For each place P and interest category C, you can compute the proportions of people at P because of C (under the current set of assignments), and these give you a representation of P. For example, you might learn that the people who hang out at Barnes & Noble consist of 10% hipsters, 50% anime fans, 10% jocks, and 30% college students.

Real-World Example

Finally, I applied LDA to a set of Sarah Palin’s emails a little while ago (see here for the blog post, or here for an app that allows you to browse through the emails by the LDA-learned categories), so let’s give a brief recap. Here are some of the topics that the algorithm learned:

Trig/Family/Inspiration: family, web, mail, god, son, from, congratulations, children, life, child, down, trig, baby, birth, love, you, syndrome, very, special, bless, old, husband, years, thank, best, …
Wildlife/BP Corrosion: game, fish, moose, wildlife, hunting, bears, polar, bear, subsistence, management, area, board, hunt, wolves, control, department, year, use, wolf, habitat, hunters, caribou, program, denby, fishing, …

Energy/Fuel/Oil/Mining: energy, fuel, costs, oil, alaskans, prices, cost, nome, now, high, being, home, public, power, mine, crisis, price, resource, need, community, fairbanks, rebate, use, mining, villages, …

Gas: gas, oil, pipeline, agia, project, natural, north, producers, companies, tax, company, energy, development, slope, production, resources, line, gasline, transcanada, said, billion, plan, administration, million, industry, …

Education/Waste: school, waste, education, students, schools, million, read, email, market, policy, student, year, high, news, states, program, first, report, business, management, bulletin, information, reports, 2008, quarter, …

Presidential Campaign/Elections: mail, web, from, thank, you, box, mccain, sarah, very, good, great, john, hope, president, sincerely, wasilla, work, keep, make, add, family, republican, support, doing, p.o, …
Here’s an example of an email which fell 99% into the Trig/Family/Inspiration category (particularly representative words are highlighted in blue):