DaDaDa 2016 - Bayesian Semantics: Verb Sense Induction

Abstract

This talk presents research on verb sense induction—the unsupervised discovery of distinct verb meanings from unlabeled text corpora using Bayesian nonparametric clustering methods. The approach employs the Pólya Urn model (mathematically equivalent to the Chinese Restaurant Process) to automatically cluster verb instances based on their syntactic contexts without requiring labeled training data or predetermined sense inventories. The model treats each sentence containing a verb as a data point characterized by slot-word pairs (subject-verb-object dependencies) and uses Gibbs sampling to iteratively assign instances to sense clusters. Each cluster represents a distinct verb sense characterized by a multinomial distribution over slot-word pairs, drawn from a Dirichlet prior. The nonparametric nature of the model allows it to discover an arbitrary number of senses while maintaining a conservative "rich-get-richer" bias against excessive cluster proliferation. Applied to verb instances extracted from a billion web pages, the system discovered 437 distinct senses of the verb "enter," including clearly differentiated meanings such as entering agreements/contracts, joining organizations, and physical entry through doors or into rooms. The results demonstrate both the power of unsupervised semantic learning and its limitations—the model produces fine-grained over-clustering (likely more senses than humans would distinguish) and some noise inherent to MCMC sampling methods. The work represents a simplified variant of Latent Dirichlet Allocation (LDA) adapted for word sense disambiguation, showing how topic modeling machinery applies to semantic clustering when slot-word pairs serve as vocabulary and senses function as topics.

About the speaker

Daniel W. Peterson is a PhD researcher specializing in natural language processing and computational semantics, with particular focus on verb sense disambiguation and unsupervised machine learning methods. His doctoral work explores the application of Bayesian nonparametric models to semantic problems in NLP, combining theoretical foundations in probabilistic modeling with practical applications to large-scale web text analysis. While emphasizing that the research may not have direct business applications, he presents the work as an educational opportunity to learn about fundamental techniques in unsupervised semantic clustering that underpin various natural language understanding tasks.

Transcript summary

A PhD researcher presents work on verb sense induction—automatically discovering different meanings of verbs from unlabeled corpus data using Bayesian nonparametric clustering. Through the Pólya Urn model (equivalent to the Chinese Restaurant Process), the system learns to distinguish senses like "enter an agreement" versus "enter a room" versus "enter a competition" without supervised examples. Applied to a billion web pages, the method discovered 437 distinct senses of "enter," demonstrating both the power and limitations of unsupervised semantic clustering.

Opening and Context

"I actually have seen some of you before... a lot of you guys are enjoying yourselves." The speaker notes this work is from their PhD—"not actually direct business application, but still hopefully something you'll be able to learn from." The talk is about verb semantics, and in particular, verb sense induction.

What Are Verb Senses?

The speaker provides an example with the verb "enter" in different sentences. "Ron introduces a book—in the South" is qualitatively a different kind of "enter" than when "John enters an essay in a competition" or when "you enter a classroom or port." These different senses of "enter" can be labeled with colors representing each different meaning.

Example Senses Explained

The speaker goes through the examples: entering a book in the South (publishing/introducing—you enter an organization, normally to join), entering a competition (submitting), entering a room (walking into). "We've got these identified senses. What's really cool about this is that if we were doing this set of sentences, we would all be able to generate this idea—these are the different senses there are. We'd pretty much say these are pretty much the same, these are such-and-such here, this is basically its own thing."

The Task: Sense Induction

The goal: label the senses from a corpus with limited prior knowledge and no labeled examples whatsoever, with an unknown number of senses. "We're going to cluster. We've got sentences, we're going to cluster them, and those clusters are going to be the senses of the verb." It's straightforward to do, and "it's going to help us whatever we're going to do with semantics later. Trying to understand the meaning of a sentence, we need to understand the meaning of a verb pretty well, and making these kinds of distinctions turned out to be really useful."

The Clustering Approach

Let's talk about how we do clustering. "You may have heard of a Chinese Restaurant Process or Dirichlet Process or Stick-Breaking Process—these are all kind of equivalent mathematical constructions." The speaker's favorite formality for it is the Pólya Urn, and the reason for this: "it makes way more sense than the Chinese restaurant."

The Pólya Urn Model

"We're going to add colored balls to an urn. We've got an urn, we're going to add things to it, and the color is going to be the cluster assignment, just like it was on the slide before." First, we add a new item—each item is a sentence in the corpus. We're going to have to select what color or what cluster it belongs to, and we're first going to select the color based on the state of the urn as it is. "So we draw from the urn, note the color, and replace that ball. Using that color, we add a new instance for clustering."

The Process Explained

At the nth step, we're adding the nth ball—there are already n-1 balls in the urn. "We're going to draw one of them, note the color, put it back, and then add a new ball of that color." Each time drawing with replacement, then adding a new ball. The urn starts off with some number α (alpha) of black balls already inside there. "Whenever we see a black ball, it's a special thing that says 'oh, use a color we've never seen before.'" So in fact, the urn starts off with α, and at the nth step, there are n-1 + α balls.

Questions About Alpha

An audience member asks about alpha. "Alpha does not define how many clusters you have at the end—it defines how likely, exactly how likely, at each step are you to choose a new cluster." The probability of selecting color k is proportional to the count of how many times we've already seen that color (n_k) in the urn if it's already there, otherwise it's proportional to α. "To turn these proportional probabilities into proper probabilities, we need to divide by the number of balls in the urn plus α. So you can see that the chance of accumulating a new color decreases over time as the urn is more and more full, because this fixed α is not growing—it's not a fixed proportion, it's a fixed number of balls."

Clarification Through Dialogue

An audience member asks for clarification. "So there are α black balls initially?" "Yes." "Then you begin to put random balls into it?" "You're going to put a new ball in." "And you draw?" "Yes, in order to choose what color of ball to use, we first draw from the urn, note the color of the one we drew, and then put it back." "So initially there's only black balls?" "Exactly. The first time we're going to put something in, we're going to draw—we know for sure a black ball because there's only black balls in there. We'll choose a new color—say it's red—we put red in the urn. Now there's one red ball and α black balls."

Subsequent Steps

The next step: "We'll either draw a red ball, in which case the next ball we put in will also be red—there'll be two red balls and still α black balls—or we'll draw black and we'll have a new cluster, say green, put that in there." This clustering process will keep adding over time. "We can calculate the chances that we're going to choose a particular color ball, because that is basically chosen by sampling from the urn with replacement. So if you draw a red ball, you're going to put the red ball inside again." "Yes."

Alpha as Hyperparameter

Another question: "So α is a total amount of black balls which is known at the start?" "Known at the start, yeah. This is a hyperparameter. It's almost always set to be one. There's α black balls in the urn."

The Chinese Restaurant Process Analogy

"This is also called a Chinese Restaurant Process because there's another way you can have this analogy." Imagine a Chinese restaurant where the host is going to seat you sometimes with strangers. "There exist restaurants like this—they're less common, I think, but apparently they were normal." The host is going to seat new customers one at a time and choose the tables that are the most popular so far, with a fixed probability of seating you at a new table. "It's the exact same mathematical description, and how you seat people at tables. But it requires you to imagine an infinite number of tables of infinite capacity in some Chinese restaurant with a very weird host."

Why Pólya Urn Is Better

"So yeah, I like Pólya Urn, but theoretically there's no restriction on α. It could be 0.1—it's proportional." The key insight: "What's really nice about this is that it's infinite in some sense. There is no upper bound on the number of clusters you can have. As we add new data, we can always choose to use a new cluster."

The Rich-Get-Richer Effect

"But much more importantly, it's also conservative. It has a rich-get-richer effect where if you have an urn with a billion red balls in it, you are a million times more likely to use red than you are to start a new cluster." With α equal to one, "you're not going to use too many clusters to describe your data, and that's a terrific prior. What's a prior that keeps the number of clusters down? This is a way to say 'I don't know how many clusters there are exactly—there's no upper bound, but there's not too many.' This is the key."

Bayesian Clustering with Bayes Rule

That fits into the framework: "We're Bayesian clustering—we're actually going to use Bayes Rule." This is standard: "We have conditional probabilities. The probability we're going to choose a cluster k given some data x and the urn u is going to be proportional to the probability of choosing that color in the urn, times the probability of generating x from that cluster."

Computing the Posterior

"This is the Bayesian framework. We need to compute this for all possible k, which there are some number of, and there's also the possibility of a new one." The first factor is exactly given by the proportional Chinese Restaurant Process or Pólya Urn scheme we had before. "So we've introduced a prior on the clustering—we'll use a small number of clusters, arbitrarily large but still conservatively small number of clusters."

The Likelihood: Data Similarity

The second factor is where "we're writing out our intuitions about the data. We want like items to be paired together. If you're sitting at a table at a Chinese restaurant, you want to be sitting with your friends. So we need to describe how much alike a data point is to the rest of the cluster that we have in some given state."

Gibbs Sampling

"We're going to do this with Gibbs sampling because it gives us a really nice way to iteratively update our assignments." Basically, "it's like a random walker. You're going to end up trying to find a good cluster assignment, but the problem is that there are a lot—a lot—of possible clusterings of any given dataset."

The Combinatorial Explosion

"Let's say we're trying to cluster even a thousand items into up to an infinite number of clusters. In the worst case, we use a thousand clusters—one item per cluster. Also, we could use all the same cluster—that's a kind of degenerate clustering. Every single possibility of partition in between is some way that we could use. Each one has some likelihood of being generated due to this process and could possibly be described."

Leveraging Terrible Clusterings

"But if we define our data right, most of those clusterings are going to be just terrible where you've got stuff that's absolutely not looking like it should be grouped together with no sense, no rhyme, no reason. And this is actually the part we're going to leverage: because most possible clusterings are terrible, we only need to find a clustering that looks reasonably likely under this first part—the probability of the clustering given the data and our prior idea."

Why Gibbs Sampling Works

"If we can draw from the overall posterior distribution, we're good." Gibbs sampling: "We update one variable at a time, and it basically is like a random walker who takes one step. Your random walker is such that every step he makes is locally sensible according to some probability distribution, and over a long enough time, he will walk to a place that's globally sensible."

Convergence Properties

"It converges to a stationary period—it doesn't matter where we started, we're going to end up with something with relatively high probability. So we don't need to worry about where we started. We can initialize this entirely randomly, just use Gibbs sampling, and we'll get good results." "You guys have seen Gibbs sampling before? Good."

The Verb Sense Mixture Model

Let's put this together into a verb sense mixture model. "I'm going to say we've got dependency-parsed sentences. We know the verb, we also know the slots—subject, direct object, preposition, whatever parts came out of it. We're going to assume that shared context words suggest the same sense. So if two sentences have the same object, they're likely to be in the same sense. If they have the same subject, they're likely to be in the same sense."

Cluster Characteristics

"Each cluster has a higher likelihood to generate its frequent slot-filler arguments. So we can describe a cluster by the kinds of vocabulary—slot-word pairs—that it generates." Each cluster k has a multinomial distribution over slot-word pairs, and "these distributions are drawn from a Dirichlet prior with a small uniform parameter θ (theta)."

Relation to Topic Models

"This looks like topic modeling." An audience member asks about slots. "Slot is a dependency arc slot, so it's a syntactic slot. It could be subject, direct object, prepositional object." They're just going to be the direct syntactic dependencies of the verb. "We're only using verbs because [verbs are] pretty much the better [unit for sense distinctions]."

Treating Slot-Word Pairs

"We're going to treat each of those slot-word pairs as an independent vocabulary item, and we're going to treat a cluster as having a multinomial distribution over those vocabulary items, drawn from a Dirichlet prior with parameter β (beta)." Updating the probability of drawing an instance: "It's going to be the probability of w (the slot-word pairs inside the instance)—that count of that vocabulary item inside the cluster plus β, and then normalize." This is the exact same way "we would see if a word has some probability of being drawn from a topic."

The Dirichlet Process Mixture Model

"This is our Dirichlet Process mixture model. We've got some Dirichlet Process G with parameter α (which is always 1), and then we have n verb instances. For each cluster, each of them has distinct word-like slot-words w's. Those are going to be drawn from the cluster-specific θ, which is drawn from a prior up there."

Comparison to LDA

"So this looks kind of like LDA except without the extra layer of nesting that's required for us to do document-specific distributions. This is basically a simplified version of LDA with a nonparametric prior. It's a nice straightforward application of principles that you've seen in other places, I hope."

Speed-Up Techniques

A few things they've done to speed things up: "When I say we want things that are alike to be together, we went ahead and in some sensible order initially combined things. So all verbs that have the same exact slot-word argument—if they have the same direct object, they are in the same cluster to start. If they have the same subject, they are in the same cluster to start." This gave them some structure initially. "I'm going to say direct object takes precedence in the initial assignments over the same subject."

Discarding Low-Count Clusters

"We've also discarded the initial clusters that didn't gather enough counts into them to get started. So that initial clustering helped reduce the number of initial clusters and kind of gave us a head start getting deep into the problem."

Results: 437 Senses of "Enter"

"We got some senses, ran this over a billion web pages, and these are senses of 'enter'—437. I will show you some that we generated through this." The way to read this: "This is a gloss of the vocabulary inside the cluster—slot-word pairs—simplified by adding a slot label on the left side and word on each line, and the slot counts."

Sense 1: Entering Agreements

"N-subj name" means "there are 60,000 instances in this cluster that had n-subj name." Name is just shorthand for named entity—"we didn't do named entity types, but that would be good work to do." This is a cluster where "you are joining some sort of agreement or negotiation. So this is a kind of 'enter into a contract' sense or event, which is clean and looks nice and also is a sensible way to see there's an 'enter' sense."

Sense 2: Joining Organizations

"Sense two: we've got again people (those coming late), members joining schools, college, or land. This one is probably a more joining sense—it's very clean."

Sense 3: Entering Through Physical Openings

"The next one, sense three: we've got people entering houses, buildings. They're coming through the door, the window." This is the physical entry through openings.

Sense 4: Entering Rooms

"The next sense: the same people are coming in—the room, the office. So you can see this is a categorization, but this is still people entering rooms, people entering. It's the same sense of 'enter,' but we've got slightly different phrasings."

Evaluation and Takeaways

"Takeaways are that the senses are fairly fine-grained—a lot of them kind of duplicate one another. There are 437 senses in total that we automatically induced. I don't believe there are 437 different kinds of entering, so we've got some duplication." There's also a little bit of noise "because Gibbs sampling is a random process. We're trying to end up in a region that's relatively likely, not come up with a maximum."

No Local Optima Problem

"The fact we're not ever trying to maximize means we can't get stuck in local optima, but it also means we don't end up in quite as nice a place."

Closing

"So that's the talk. I've got plenty of time for questions."

Key Technical Insights

This presentation demonstrates several important concepts in unsupervised natural language processing and Bayesian machine learning. First, the Pólya Urn / Chinese Restaurant Process provides an elegant solution to the fundamental problem of not knowing how many clusters exist—the nonparametric prior allows infinite possible clusters while the rich-get-richer dynamic keeps the count conservatively small. This balance between flexibility and parsimony is crucial for real-world clustering problems.

Second, Gibbs sampling's random walk property solves the combinatorial explosion problem. With thousands of items and potentially thousands of clusters, enumerating all possible partitions is intractable (the number of ways to partition n items is the Bell number B_n, which grows faster than exponential). But because most partitions are terrible given the data, a random walker that takes locally sensible steps will naturally drift toward high-probability regions without exhaustive search.

Third, the simplified LDA-style model shows how topic modeling machinery applies to word sense disambiguation. By treating slot-word pairs (subject-person, object-house, through-door) as vocabulary and senses as topics, the model learns which argument patterns co-occur. The key assumption—that sentences sharing arguments likely share senses—provides the clustering signal that allows unsupervised learning.

Fourth, the initialization strategy of grouping sentences by shared arguments (same direct object → same initial cluster) demonstrates practical engineering for MCMC methods. While Gibbs sampling theoretically converges from any initialization, good initialization dramatically speeds convergence in practice. The speaker's choice to prioritize direct objects over subjects reflects linguistic intuition about argument salience.

Fifth, the fine-grained over-clustering problem (437 senses of "enter") reveals a fundamental tension in unsupervised learning: the model optimizes likelihood, not human-interpretable sense granularity. Without supervision about how coarse or fine senses should be, the model splits based on distributional differences (entering rooms vs. entering buildings) even when humans might consider these the same sense. This highlights the gap between statistical clustering and semantic categories.

Sixth, the comparison to supervised methods is implicit but important: this approach requires no labeled training data, no sense inventories, and automatically adapts to corpus-specific usage patterns. The trade-off is over-splitting and noise, but for low-resource languages or domain-specific text where sense-labeled data doesn't exist, unsupervised induction provides the only practical path to sense-aware NLP systems.

Finally, the Bayesian framework's transparency allows interpreting what the model learned: the slot-word distributions for each cluster provide explicit descriptions ("this sense involves people as subjects, agreements as objects") rather than opaque feature vectors. This interpretability helps both debugging and downstream applications—you can inspect why the model made particular sense distinctions and whether they're useful for your task.