At EMNLP, a prestigious international conference in Natural Language Processing (NLP), a team of Pryon researchers consisting of Steven Rennie, Etienne Marcheret, Neil Mallinar, David Nahamoo, and Vaibhava Goel, presented cutting-edge research into the customization of natural language question-answering models. Dr. Goel, who heads the research efforts at Pryon, stated, “Our approach alleviates the need for laborious human annotation of data when the question-answering models are customized for specific domains and use cases.”

Let’s take a step back and see how this differs from the old way of building and customizing AI models. AI models rely on massive amounts of data – specifically annotated data – to ensure it can “learn” to do what it is programmed to do. Annotated data, though, is immensely expensive because it requires a huge number of manhours to comb through troves of data and tag it manually. In fact, by some estimates, about 80% of time spent on AI projects is used to aggregate, clean, label, and augment data for machine learning (ML) models.

Automated Annotation

Not only is annotated data expensive, but you need different data for each use case. For example, let’s say you have an AI model designed to take a question from a user, figure out what it means, find the answer, and then present it to the user. If the model has learned from retail data, for example, it will have a hard time understanding the documents and data of a law firm. Therefore, it is necessary to have customized annotated data for each application.

Since producing annotated and customized data is too laborious and expensive to do manually, AI researchers have been exploring methods of automatically generating annotated data for a target domain using generative models trained on pre-existing (potentially mismatched) source data. This is called unsupervised adaptation via generative self-training. This kind of generative customization for question-answering systems is a very active area of exploration in the research community. While effective approaches based on intuitive heuristics have been proposed, the main drawback is that they lack mathematical grounding.

Grounding Unsupervised Customization

At Pryon, we started by applying a mathematical lens to what others have been doing to create machine-generated custom data and carry out unsupervised customization. With this lens, we were able to understand the process more rigorously and therefore improve upon it.

In essence, we were able to formulate a mathematical link between semi-supervised self-training, a general approach for improving a model’s performance using its own predictions and maximizing the probability of the client data that we want to customize a model for. This connection allowed us to extend upon and generalize existing approaches to model customization.

In the paper, “Unsupervised Adaptation of Question Answering Systems via Generative Self-Training,” we show that the current approach to creating customized data and updating the answer selection model is only a portion of what is possible. Under our semi-supervised framework, the machine gets better and better at creating annotated data as it iterates between pre-training on synthetic target data, and fine-tuning on available ground-truth data. In fact, by the third iteration of one scenario, our auto-generated Q&A samples on target data were good enough to train an answer selection model, which was better than the answer selection model trained on the manually annotated source data. This demonstrated that the quality of generated samples was good enough to carry out a model training just on those samples.

These developments mark significant steps in our journey to automate the creation of accurate and effective AI models for new domains by completely eliminating the need for expensive and time consuming human labor to generate domain specific annotated data.

Please click here to download the full paper.