Browse wiki

Jump to: navigation, search
Extracting key terms from noisy and multitheme documents
Abstract We present a novel method for key term extWe present a novel method for key term extraction from text documents. In our method, document is modeled as a graph of semantic relationships between terms of that document. We exploit the following remarkable feature of the graph: the terms related to the main topics of the document tend to bunch up into densely interconnected subgraphs or communities, while non-important terms fall into weakly interconnected communities, or even become isolated vertices. We apply graph community detection techniques to partition the graph into thematically cohesive groups of terms. We introduce a criterion function to select groups that contain key terms discarding groups with unimportant terms. To weight terms and determine semantic relatedness between them we exploit information extracted from Wikipedia. Using such an approach gives us the following two advantages. First, it allows effectively processing multi-theme documents. Second, it is good at filtering out noise information in the document, such as, for example, navigational bars or headers in web pages. Evaluations of the method show that it outperforms existing methods producing key terms with higher precision and recall. Additional experiments on web pages prove that our method appears to be substantially more effective on noisy and multi-theme documents than existing methods.lti-theme documents than existing methods.
Added by wikilit team Added on initial load  +
Collected data time dimension Cross-sectional  +
Comments "[The proposed] method for extracting key terms from a text document...does not require any training, as it works upon the Wikipedia-based knowledge base." p. 669
Conclusion One of the advantages of our method is thaOne of the advantages of our method is that it does not require any training, as it works upon the Wikipedia-based knowledge base. The important and novel feature of our method is that it produces groups of key terms, while each group contains key terms related to one of the main topics of the document. Thus, our method implicitly identifies main document topics, and further categorization and clustering of this document can greatly benefit from that. From implementation viewpoint the novel feature of our method is that, for the first time, an algorithm for detecting community structure of a network is applied to analyze a semantic graph of terms extracted from a document. Our experimental results show that our method produces high-quality key terms comparable to the ones produced by state-of-the-art systems developed in the area. Evaluation proved that our method produces key terms with 67.7% recall and 46.1% precision, that we consider being significantly high. We also conducted experiments for multi-theme and noisy web pages with performance figures significantly higher than competitive methods. It allows us to conclude that a promising application of our method is to improve content-targeted advertising systems, which have to deal with such web pages.s, which have to deal with such web pages.
Data source Experiment responses  + , Archival records  + , Wikipedia pages  +
Doi 10.1145/1526709.1526798 +
Google scholar url  +
Has author Maria Grineva + , Maxim Grinev + , Dmitry Lizorkin +
Has domain Computer science +
Has topic Information extraction + , Semantic relatedness +
Peer reviewed Yes  +
Publication type Conference paper  +
Published in WWW '09 Proceedings of the 18th international conference on World wide web +
Research design Experiment  +
Research questions In this paper, we propose a new approach tIn this paper, we propose a new approach to key terms extraction that is different in two points. First, instead of using statistics gathered from a training set, which might be stale or domain-specific, we use semantic information derived from a universal knowledge base (namely, Wikipedia), which provides up-to-date and extensive coverage for a broad range of domains. Second, our method utilizes more information from within the processed document by identifying and analyzing semantic relatedness between terms in the document.relatedness between terms in the document.
Revid 10,766  +
Theories The edges removed being identified using the graph-theoretic measure of betweenness, which assigns a number to each edge that is large if the edge lies ”between” many pairs of nodes.
Theory type Design and action  +
Title Extracting key terms from noisy and multitheme documents
Unit of analysis Article  +
Url  +
Wikipedia coverage Main topic  +
Wikipedia data extraction Dump  +
Wikipedia language Not specified  +
Wikipedia page type Article  +
Year 2009  +
Creation dateThis property is a special property in this wiki. 15 March 2012 20:27:56  +
Categories Information extraction  + , Semantic relatedness  + , Computer science  + , Publications  +
Modification dateThis property is a special property in this wiki. 30 January 2014 20:27:31  +
hide properties that link here 
  No properties link to this page.


Enter the name of the page to start browsing from.