Automatising the learning of lexical patterns: an application to the enrichment of WordNet by extracting semantic relationships from Wikipedia

From WikiLit
Jump to: navigation, search
Publication (help)
Automatising the learning of lexical patterns: an application to the enrichment of WordNet by extracting semantic relationships from Wikipedia
Authors: Maria Ruiz-Casado, Enrique Alfonseca, Pablo Castells [edit item]
Citation: Data and Knowledge Engineering 61 (3): 484-499. 2007.
Publication type: Journal article
Peer-reviewed: Yes
Database(s):
DOI: 10.1016/j.datak.2006.06.011.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
Automatising the learning of lexical patterns: an application to the enrichment of WordNet by extracting semantic relationships from Wikipedia is a publication by Maria Ruiz-Casado, Enrique Alfonseca, Pablo Castells.


[edit] Abstract

This paper describes an automatic approach to identify lexical patterns that represent semantic relationships between concepts in an on-line encyclopedia. Next, these patterns can be applied to extend existing ontologies or semantic networks with new relations. The experiments have been performed with the Simple English Wikipedia and WordNet 1.7. A new algorithm has been devised for automatically generalising the lexical patterns found in the encyclopedia entries. We have found general patterns for the hyperonymy, hyponymy, holonymy and meronymy relations and, using them, we have extracted more than 2600 new relationships that did not appear in WordNet originally. The precision of these relationships depends on the degree of generality chosen for the patterns and the type of relation, being around 60-70% for the best combinations proposed.

[edit] Research questions

"n this paper, we present a procedure for automatically enriching an existing lexical semantic network with new relationships extracted from on-line encyclopedic information. The approach followed is mainly based in the use of lexical patterns that model each type of relationship and natural language processing resources. The semantic network chosen is WordNet [10], given that it is currently used in many applications, although the procedure is general enough to be used with other ontologies. The encyclopedia used is the Wikipedia, a collaborative web-based resource which is being constantly updated by its users"

Research details

Topics: Other natural language processing topics [edit item]
Domains: Computer science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Sample data [edit item]
Theories: "Undetermined" [edit item]
Research design: Experiment [edit item]
Data source: Experiment responses, Websites, Wikipedia pages [edit item]
Collected data time dimension: Cross-sectional [edit item]
Unit of analysis: Article [edit item]
Wikipedia data extraction: Live Wikipedia [edit item]
Wikipedia page type: Article [edit item]
Wikipedia language: English [edit item]

[edit] Conclusion

"The algorithm has been evaluated with the whole Simple English Wikipedia entries, as available on September 27, 2005. Each of the entries was disambiguated using the procedure described in [63]. An evaluation of 360 entries, performed by two human judges, indicates that the precision of the disambiguation is 92% (87% for polysemous words). The high figure should not come as a surprise, given that, as can be expected, it is an easier problem to disambiguate the title of an encyclopedia entry (for which there exist much relevant data) than a word inside unrestricted text.

The next step consisted in extracting, from each Wikipedia entry e, a list of sentences containing references to other entries f which are related with e inside WordNet. This resulted in 485 sentences for hyponymy, 213 for hyperonymy, 562 for holonymy and 509 for meronymy. When analysing these patterns, however, we found that, both for hyperonymy and meronymy, most of the sentences extracted only contained the name of the entry f (the target of the relationship) with no contextual information around it. The reason was unveiled by examining the web pages:

• In the case of hyponyms and holonyms, it is very common to express the relationship with natural language, with expressions such as A dog is a mammal, or A wheel is part of a car.

• On the other hand, when describing hyperonyms and meronyms, their hyponyms and holonyms are usually expressed with enumerations, which tend to be formatted as HTML bullet lists. Therefore, the sentence splitter chunks each hyponym and each holonym as belonging to a separate sentence.

All the results in these experiments have been evaluated by hand by two judges. The total inter-judge agreement reached 95%. In order to unify the criteria, in the doubtful cases, similar relations were looked inside WordNet, and the judges tried to apply the same criteria as shown by those examples. The cases in which the judges disagree have not been taking into consideration for calculating the accuracy."

[edit] Comments

"Wikipedia pages - Websites (WordNet)"


Further notes[edit]