A Practical Exploration System for Search Advertising
Parikshit Shah (Yahoo Research);Ming Yang (Yahoo);Sachidanand Alle (Yahoo);Adwait Ratnaparkhi (Yahoo Research);Ben Shahshahani (Yahoo Research);Rohit Chandra (Yahoo)
In this paper, we describe an exploration system that was implemented by the search-advertising team of a prominent web-portal to address the cold ads problem. The cold ads problem refers to the situation where, when new ads are injected into the system by advertisers, the system is unable to assign an accurate quality to the ad (in our case, the click probability). As a consequence, the advertiser may suffer from low impression volumes for these cold ads, and the overall system may perform sub-optimally if the click probabilities for new ads are not learnt rapidly. We designed a new exploration system that was adapted to search advertising and the serving constraints of the system. In this paper, we define the problem, discuss the design details of the exploration system, new evaluation criteria, and present the performance metrics that were observed by us.
The Maximum Entropy Framework
To explain what maximum entropy is, it will be simplest to quote from Manning and Schutze* (p. 589):
Maximum entropy modeling is a framework for integrating information from many heterogeneous information sources for classification. The data for a classification problem is described as a (potentially large) number of features. These features can be quite complex and allow the experimenter to make use of prior knowledge about what types of informations are expected to be important for classification. Each feature corresponds to a constraint on the model. We then compute the maximum entropy model, the model with the maximum entropy of all the models that satisfy the constraints. This term may seem perverse, since we have spent most of the book trying to minimize the (cross) entropy of models, but the idea is that we do not want to go beyond the data. If we chose a model with less entropy, we would add `information' constraints to the model that are not justified by the empirical evidence available to us. Choosing the maximum entropy model is motivated by the desire to preserve as much uncertainty as possible.
So that gives a rough idea of what the maximum entropy framework is. Don't assume anything about your probability distribution other than what you have observed.
On the engineering level, using maxent is an excellent way of creating programs which perform very difficult classification tasks very well. For example, precision and recall figures for programs using maxent models have reached (or are) the state of the art on tasks like part of speech tagging, sentence detection, prepositional phrase attachment, and named entity recognition. On the engineering level, an added benefit is that the person creating a maxent model only needs to inform the training procedure of the event space, and need not worry about independence between features.
While the authors of this implementation of maximum entropy are generally interested using maxent models in natural language processing, the framework is certainly quite general and useful for a much wider variety of fields. In fact, maximum entropy modeling was originally developed for statistical physics.
For a very in-depth discussion of how maxent can be used in natural language processing, try reading Adwait Ratnaparkhi's dissertation. Also, check out Berger, Della Pietra, and Della Pietra's paper A Maximum Entropy Approach to Natural Language Processing, which provides an excellent introduction and discussion of the framework.
*Foundations of statistical natural language processing . Christopher D. Manning, Hinrich Schutze.
Cambridge, Mass. : MIT Press, c1999.
We have tried to make the opennlp.maxent implementation easy to use. To create a model, one needs (of course) the training data, and then implementations of two interfaces in the opennlp.maxent package, EventStream and ContextGenerator. These have fairly simple specifications, and example implementations can be found in the OpenNLP Grok Library preprocessing components. More details are given in the opennlp.maxent HOWTO.
We have also set in place some interfaces and code to make it easier to automate the training and evaluation process (the Evalable interface and the TrainEval class). It is not necessary to use this functionality, but if you do you'll find it much easier to see how well your models are doing. The opennlp.grok.preprocess.namefind package is an example of a maximum entropy component which uses this functionality.
We have managed to use several techniques to reduce the size of the models when writing them to disk, which also means that reading in a model for use is much quicker than with less compact encodings of the model. This was especially important to us since we use many maxent models in the Grok library, and we wanted the start up time and the physical size of the library to be as minimal as possible. As of version 1.2.0, maxent has an io package which greatly simplifies the process of loading and saving models in different formats.
The opennlp.maxent package was originally built by Jason Baldridge, Tom Morton, and Gann Bierner. We owe a big thanks to Adwait Ratnaparkhi for his work on maximum entropy models for natural language processing applications. His introduction to maxent for NLP and dissertation are what really made opennlp.maxent and our Grok maxent components (POS tagger, end of sentence detector, tokenizer, name finder) possible!
Eric Friedman has been steadily improving the efficiency and design of the package since version 1.2.0.