Extracting key terms from noisy and multitheme documents
|Extracting key terms from noisy and multitheme documents|
|Authors:||Maria Grineva, Maxim Grinev, Dmitry Lizorkin|
|Citation:||WWW '09 Proceedings of the 18th international conference on World wide web : . 2009.|
|Publication type:||Conference paper|
|Google Scholar cites:||Citations|
|Added by Wikilit team:||Added on initial load|
|Article:||Google Scholar BASE PubMed|
|Other scholarly wikis:||AcaWiki Brede Wiki WikiPapers|
|Web search:||Bing Google Yahoo! — Google PDF|
We present a novel method for key term extraction from text documents. In our method, document is modeled as a graph of semantic relationships between terms of that document. We exploit the following remarkable feature of the graph: the terms related to the main topics of the document tend to bunch up into densely interconnected subgraphs or communities, while non-important terms fall into weakly interconnected communities, or even become isolated vertices. We apply graph community detection techniques to partition the graph into thematically cohesive groups of terms. We introduce a criterion function to select groups that contain key terms discarding groups with unimportant terms. To weight terms and determine semantic relatedness between them we exploit information extracted from Wikipedia. Using such an approach gives us the following two advantages. First, it allows effectively processing multi-theme documents. Second, it is good at filtering out noise information in the document, such as, for example, navigational bars or headers in web pages. Evaluations of the method show that it outperforms existing methods producing key terms with higher precision and recall. Additional experiments on web pages prove that our method appears to be substantially more effective on noisy and multi-theme documents than existing methods.
"In this paper, we propose a new approach to key terms extraction that is different in two points. First, instead of using statistics gathered from a training set, which might be stale or domain-specific, we use semantic information derived from a universal knowledge base (namely, Wikipedia), which provides up-to-date and extensive coverage for a broad range of domains. Second, our method utilizes more information from within the processed document by identifying and analyzing semantic relatedness between terms in the document."
|Topics:||Information extraction, Semantic relatedness|
|Theory type:||Design and action|
|Wikipedia coverage:||Main topic|
|Theories:||"The edges removed being identified using
the graph-theoretic measure of betweenness, which assigns a number to each edge that is large if the edge lies ”between” many pairs of nodes."
|Data source:||Experiment responses, Archival records, Wikipedia pages|
|Collected data time dimension:||Cross-sectional|
|Unit of analysis:||Article|
|Wikipedia data extraction:||Dump|
|Wikipedia page type:||Article|
|Wikipedia language:||Not specified|
"One of the advantages of our method is that it does not require any training, as it works upon the Wikipedia-based knowledge base. The important and novel feature of our method is that it produces groups of key terms, while each group contains key terms related to one of the main topics of the document. Thus, our method implicitly identifies main document topics, and further categorization and clustering of this document can greatly benefit from that. From implementation viewpoint the novel feature of our method is that, for the first time, an algorithm for detecting community structure of a network is applied to analyze a semantic graph of terms extracted from a document. Our experimental results show that our method produces high-quality key terms comparable to the ones produced by state-of-the-art systems developed in the area. Evaluation proved that our method produces key terms with 67.7% recall and 46.1% precision, that we consider being significantly high. We also conducted experiments for multi-theme and noisy web pages with performance figures significantly higher than competitive methods. It allows us to conclude that a promising application of our method is to improve content-targeted advertising systems, which have to deal with such web pages."
""[The proposed] method for extracting key terms from a text document...does not require any training, as it works upon the Wikipedia-based knowledge base." p. 669"