Browse wiki

Jump to: navigation, search
Learning for information extraction: from named entity recognition and disambiguation to relation extraction
Abstract Information Extraction, the task of locatiInformation Extraction, the task of locating textual mentions of specific types of entities and their relationships, aims at representing the information contained in text documents in a structured format that is more amenable to applications in data mining, question answering, or the semantic web. The goal of our research is to design information extraction models that obtain improved performance by exploiting types of evidence that have not been explored in previous approaches. Since designing an extraction system through introspection by a domain expert is a laborious and time consuming process, the focus of this thesis will be on methods that automatically induce an extraction model by training on a dataset of manually labeled examples. Named Entity Recognition is an information extraction task that is concerned with finding textual mentions of entities that belong to a predefined set of categories. We approach this task as a phrase classification problem, in which candidate phrases from the same document are collectively classified. Global correlations between candidate entities are captured in a model built using the expressive framework of Relational Markov Networks. Additionally, we propose a novel tractable approach to phrase classification for named entity recognition based on a special Junction Tree representation. Classifying entity mentions into a predefined set of categories achieves only a partial disambiguation of the names. This is further refined in the task of Named Entity Disambiguation, where names need to be linked to their actual denotations. In our research, we use Wikipedia as a repository of named entities and propose a ranking approach to disambiguation that exploits learned correlations between words from the name context and categories from the Wikipedia taxonomy. Relation Extraction refers to finding relevant relationships between entities mentioned in text documents. Our approaches to this information extraction task differ in the type and the amount of supervision required. We first propose two relation extraction methods that are trained on documents in which sentences are manually annotated for the required relationships. In the first method, the extraction patterns correspond to sequences of words and word classes anchored at two entity names occurring in the same sentence. These are used as implicit features in a generalized subsequence kernel, with weights computed through training of Support Vector Machines. In the second approach, the implicit extraction features are focused on the shortest path between the two entities in the word-word dependency graph of the sentence. Finally, in a significant departure from previous learning approaches to relation extraction, we propose reducing the amount of required supervision to only a handful of pairs of entities known to exhibit or not exhibit the desired relationship. Each pair is associated with a bag of sentences extracted automatically from a very large corpus. We extend the subsequence kernel to handle this weaker form of supervision, and describe a method for weighting features in order to focus on those correlated with the target relation rather than with the individual entities. The resulting Multiple Instance Learning approach offers a competitive alternative to previous relation extraction methods, at a significantly reduced cost in human supervision.icantly reduced cost in human supervision.
Added by wikilit team Added on initial load  +
Collected data time dimension Cross-sectional  +
Conclusion The complexity of the resulting graphical The complexity of the resulting graphical model allows only for approximate inference. Motivated by the superior accuracy of exact inference methods, we have also presented a second approach to named entity recognition, cast as phrase based classification with local correlations, in which exact inference can be efficiently achieved in time that is linear in the number of candidate entities. Compared to token classification approaches, our phrase classification models can easily incorporate phrase based features. The classification of textual occurrences of entity names into predefined categories, as done in named entity recognition, results only in a partial disambiguation of the names. We have therefore presented an approach to named entity disambiguation that tries to fully disambiguate proper names by linking them to the appropriate entries in Wikipedia, a large online encyclopedia. We have modeled disambiguation as a ranking problem, and showed that improved accuracy is obtained by exploiting learned correlations between context words and categories in the Wikipedia taxonomy. Overall, the research described in this thesis has contributed with learning models that leverage useful new types of evidence in order to obtain improved extraction performance. On a long term scale, we see the proposed methods as a useful step towards building an integrated information extraction model that is robust, accurate, and not overly demanding in terms of human supervision.y demanding in terms of human supervision.
Conference location United States, Texas +
Data source Experiment responses  + , Wikipedia pages  +
Google scholar url http://scholar.google.com/scholar?ie=UTF-8&q=%22Learning%2Bfor%2Binformation%2Bextraction%3A%2Bfrom%2Bnamed%2Bentity%2Brecognition%2Band%2Bdisambiguation%2Bto%2Brelation%2Bextraction%22  +
Has author Razvan Constantin Bunescu + , Raymond J. Mooney +
Has domain Computer science +
Has topic Information extraction +
Peer reviewed Yes  +
Publication type Thesis  +
Published in The University of Texas at Austin +
Research design Experiment  +
Research questions The goal of this thesis is to derive inforThe goal of this thesis is to derive information extraction models with improved performance by exploiting types of evidence that have not been used in previous systems. Since designing an IE system through introspection by a domain expert is a laborious and time consuming process, the focus of this thesis will be on methods that automatically induce the extraction model by training on a dataset of manually labeled examples. Because every model is designed such that its parameters can be learned through training on supervised data, we also address efficiency related issues, such as the time complexity of the algorithms used for inference and training. The advantages of our proposed models are empirically validated through experimental evaluations in which the new method is compared against previously proposed methods. The contributions of this thesis are outlined below.butions of this thesis are outlined below.
Revid 10,846  +
Theories Undetermined
Theory type Design and action  +
Title Learning for information extraction: from named entity recognition and disambiguation to relation extraction
Unit of analysis Article  +
Url http://dl.acm.org/citation.cfm?id=1354680  +
Wikipedia coverage Sample data  +
Wikipedia data extraction Dump  +
Wikipedia language English  +
Wikipedia page type Article  +
Year 2007  +
Creation dateThis property is a special property in this wiki. 15 March 2012 20:29:26  +
Categories Information extraction  + , Computer science  + , Publications with missing comments  + , Publications  +
Modification dateThis property is a special property in this wiki. 30 January 2014 20:29:22  +
show properties that link here 

 

Enter the name of the page to start browsing from.