Learning for information extraction: from named entity recognition and disambiguation to relation extraction

From WikiLit
Jump to: navigation, search
Publication (help)
Learning for information extraction: from named entity recognition and disambiguation to relation extraction
Authors: Razvan Constantin Bunescu, Raymond J. Mooney [edit item]
Citation: The University of Texas at Austin  : . 2007. United States, Texas.
Publication type: Thesis
Peer-reviewed: Yes
Database(s):
DOI: Define doi.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
Learning for information extraction: from named entity recognition and disambiguation to relation extraction is a publication by Razvan Constantin Bunescu, Raymond J. Mooney.


[edit] Abstract

Information Extraction, the task of locating textual mentions of specific types of entities and their relationships, aims at representing the information contained in text documents in a structured format that is more amenable to applications in data mining, question answering, or the semantic web. The goal of our research is to design information extraction models that obtain improved performance by exploiting types of evidence that have not been explored in previous approaches. Since designing an extraction system through introspection by a domain expert is a laborious and time consuming process, the focus of this thesis will be on methods that automatically induce an extraction model by training on a dataset of manually labeled examples. Named Entity Recognition is an information extraction task that is concerned with finding textual mentions of entities that belong to a predefined set of categories. We approach this task as a phrase classification problem, in which candidate phrases from the same document are collectively classified. Global correlations between candidate entities are captured in a model built using the expressive framework of Relational Markov Networks. Additionally, we propose a novel tractable approach to phrase classification for named entity recognition based on a special Junction Tree representation. Classifying entity mentions into a predefined set of categories achieves only a partial disambiguation of the names. This is further refined in the task of Named Entity Disambiguation, where names need to be linked to their actual denotations. In our research, we use Wikipedia as a repository of named entities and propose a ranking approach to disambiguation that exploits learned correlations between words from the name context and categories from the Wikipedia taxonomy. Relation Extraction refers to finding relevant relationships between entities mentioned in text documents. Our approaches to this information extraction task differ in the type and the amount of supervision required. We first propose two relation extraction methods that are trained on documents in which sentences are manually annotated for the required relationships. In the first method, the extraction patterns correspond to sequences of words and word classes anchored at two entity names occurring in the same sentence. These are used as implicit features in a generalized subsequence kernel, with weights computed through training of Support Vector Machines. In the second approach, the implicit extraction features are focused on the shortest path between the two entities in the word-word dependency graph of the sentence. Finally, in a significant departure from previous learning approaches to relation extraction, we propose reducing the amount of required supervision to only a handful of pairs of entities known to exhibit or not exhibit the desired relationship. Each pair is associated with a bag of sentences extracted automatically from a very large corpus. We extend the subsequence kernel to handle this weaker form of supervision, and describe a method for weighting features in order to focus on those correlated with the target relation rather than with the individual entities. The resulting Multiple Instance Learning approach offers a competitive alternative to previous relation extraction methods, at a significantly reduced cost in human supervision.

[edit] Research questions

"The goal of this thesis is to derive information extraction models with improved performance by exploiting types of evidence that have not been used in previous systems. Since designing an IE system through introspection by a domain expert is a laborious and time consuming process, the focus of this thesis will be on methods that automatically induce the extraction model by training on a dataset of manually labeled examples. Because every model is designed such that its parameters can be learned through training on supervised data, we also address efficiency related issues, such as the time complexity of the algorithms used for inference and training. The advantages of our proposed models are empirically validated through experimental evaluations in which the new method is compared against previously proposed methods. The contributions of this thesis are outlined below."

Research details

Topics: Information extraction [edit item]
Domains: Computer science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Sample data [edit item]
Theories: "Undetermined" [edit item]
Research design: Experiment [edit item]
Data source: Experiment responses, Wikipedia pages [edit item]
Collected data time dimension: Cross-sectional [edit item]
Unit of analysis: Article [edit item]
Wikipedia data extraction: Dump [edit item]
Wikipedia page type: Article [edit item]
Wikipedia language: English [edit item]

[edit] Conclusion

"The complexity of the resulting graphical model allows only for approximate inference. Motivated by the superior accuracy of exact inference methods, we have also presented a second approach to named entity recognition, cast as phrase based classification with local correlations, in which exact inference can be efficiently achieved in time that is linear in the number of candidate entities. Compared to token classification approaches, our phrase classification models can easily incorporate phrase based features. The classification of textual occurrences of entity names into predefined categories, as done in named entity recognition, results only in a partial disambiguation of the names. We have therefore presented an approach to named entity disambiguation that tries to fully disambiguate proper names by linking them to the appropriate entries in Wikipedia, a large online encyclopedia. We have modeled disambiguation as a ranking problem, and showed that improved accuracy is obtained by exploiting learned correlations between context words and categories in the Wikipedia taxonomy.

Overall, the research described in this thesis has contributed with learning models that leverage useful new types of evidence in order to obtain improved extraction performance. On a long term scale, we see the proposed methods as a useful step towards building an integrated information extraction model that is robust, accurate, and not overly demanding in terms of human supervision."

[edit] Comments


Further notes[edit]

Facts about "Learning for information extraction: from named entity recognition and disambiguation to relation extraction"RDF feed
AbstractInformation Extraction, the task of locatiInformation Extraction, the task of locating textual mentions of specific types of entities and their relationships, aims at representing the information contained in text documents in a structured format that is more amenable to applications in data mining, question answering, or the semantic web. The goal of our research is to design information extraction models that obtain improved performance by exploiting types of evidence that have not been explored in previous approaches. Since designing an extraction system through introspection by a domain expert is a laborious and time consuming process, the focus of this thesis will be on methods that automatically induce an extraction model by training on a dataset of manually labeled examples. Named Entity Recognition is an information extraction task that is concerned with finding textual mentions of entities that belong to a predefined set of categories. We approach this task as a phrase classification problem, in which candidate phrases from the same document are collectively classified. Global correlations between candidate entities are captured in a model built using the expressive framework of Relational Markov Networks. Additionally, we propose a novel tractable approach to phrase classification for named entity recognition based on a special Junction Tree representation. Classifying entity mentions into a predefined set of categories achieves only a partial disambiguation of the names. This is further refined in the task of Named Entity Disambiguation, where names need to be linked to their actual denotations. In our research, we use Wikipedia as a repository of named entities and propose a ranking approach to disambiguation that exploits learned correlations between words from the name context and categories from the Wikipedia taxonomy. Relation Extraction refers to finding relevant relationships between entities mentioned in text documents. Our approaches to this information extraction task differ in the type and the amount of supervision required. We first propose two relation extraction methods that are trained on documents in which sentences are manually annotated for the required relationships. In the first method, the extraction patterns correspond to sequences of words and word classes anchored at two entity names occurring in the same sentence. These are used as implicit features in a generalized subsequence kernel, with weights computed through training of Support Vector Machines. In the second approach, the implicit extraction features are focused on the shortest path between the two entities in the word-word dependency graph of the sentence. Finally, in a significant departure from previous learning approaches to relation extraction, we propose reducing the amount of required supervision to only a handful of pairs of entities known to exhibit or not exhibit the desired relationship. Each pair is associated with a bag of sentences extracted automatically from a very large corpus. We extend the subsequence kernel to handle this weaker form of supervision, and describe a method for weighting features in order to focus on those correlated with the target relation rather than with the individual entities. The resulting Multiple Instance Learning approach offers a competitive alternative to previous relation extraction methods, at a significantly reduced cost in human supervision.icantly reduced cost in human supervision.
Added by wikilit teamAdded on initial load +
Collected data time dimensionCross-sectional +
ConclusionThe complexity of the resulting graphical The complexity of the resulting graphical model allows only for approximate inference.

Motivated by the superior accuracy of exact inference methods, we have also presented a second approach to named entity recognition, cast as phrase based classification with local correlations, in which exact inference can be efficiently achieved in time that is linear in the number of candidate entities. Compared to token classification approaches, our phrase classification models can easily incorporate phrase based features. The classification of textual occurrences of entity names into predefined categories, as done in named entity recognition, results only in a partial disambiguation of the names. We have therefore presented an approach to named entity disambiguation that tries to fully disambiguate proper names by linking them to the appropriate entries in Wikipedia, a large online encyclopedia. We have modeled disambiguation as a ranking problem, and showed that improved accuracy is obtained by exploiting learned correlations between context words and categories in the Wikipedia taxonomy.

Overall, the research described in this thesis has contributed with learning models that leverage useful new types of evidence in order to obtain improved extraction performance. On a long term scale, we see the proposed methods as a useful step towards building an integrated information extraction model that is robust,

accurate, and not overly demanding in terms of human supervision.
y demanding in terms of human supervision.
Conference locationUnited States, Texas +
Data sourceExperiment responses + and Wikipedia pages +
Google scholar urlhttp://scholar.google.com/scholar?ie=UTF-8&q=%22Learning%2Bfor%2Binformation%2Bextraction%3A%2Bfrom%2Bnamed%2Bentity%2Brecognition%2Band%2Bdisambiguation%2Bto%2Brelation%2Bextraction%22 +
Has authorRazvan Constantin Bunescu + and Raymond J. Mooney +
Has domainComputer science +
Has topicInformation extraction +
Peer reviewedYes +
Publication typeThesis +
Published inThe University of Texas at Austin +
Research designExperiment +
Research questionsThe goal of this thesis is to derive inforThe goal of this thesis is to derive information extraction models with improved

performance by exploiting types of evidence that have not been used in previous systems. Since designing an IE system through introspection by a domain expert is a laborious and time consuming process, the focus of this thesis will be on methods that automatically induce the extraction model by training on a dataset of manually labeled examples. Because every model is designed such that its parameters can be learned through training on supervised data, we also address efficiency related issues, such as the time complexity of the algorithms used for inference and training. The advantages of our proposed models are empirically validated through experimental evaluations in which the new method is compared against previously

proposed methods. The contributions of this thesis are outlined below.
butions of this thesis are outlined below.
Revid10,846 +
TheoriesUndetermined
Theory typeDesign and action +
TitleLearning for information extraction: from named entity recognition and disambiguation to relation extraction
Unit of analysisArticle +
Urlhttp://dl.acm.org/citation.cfm?id=1354680 +
Wikipedia coverageSample data +
Wikipedia data extractionDump +
Wikipedia languageEnglish +
Wikipedia page typeArticle +
Year2007 +