Browse wiki

Jump to: navigation, search
Exploiting external/domain knowledge to enhance traditional text mining using graph-based methods
Abstract Finding the best way to utilize external/dFinding the best way to utilize external/domain knowledge to enhance traditional text mining has been a challenging task. The difficulty centers on the lack of means in representing a document with external/domain knowledge integrated. Graphs are powerful and versatile tools, useful in various subfields of science and engineering for their simple illustration of complicated problems. However, the graph-based approach on knowledge representation and discovery remains relatively unexplored. In this thesis, I propose a graph-based text mining system to incorporate semantic knowledge, document section knowledge, document linkage knowledge, and document category knowledge into the tasks of text clustering and topic analysis. I design a novel term-level graph knowledge representation and a graph-based clustering algorithm to incorporate semantic and document section knowledge for biomedical literature clustering and topic analysis. I present a Markov Random Field {(MRF)} with a Relaxation Labeling {(RL)} algorithm to incorporate document linkage knowledge. I evaluate different types of linkage among documents, including explicit linkage such as hyperlink and citation link, implicit linkage such as coauthor link and co-citation link, and pseudo linkage such as similarity link. I develop a novel semantic-based method to integrate Wikipedia concepts and categories as external knowledge into traditional document clustering. In order to support these new approaches, I develop two automated algorithms to extract multiword phrases and ontological concepts, respectively. The evaluations of news collection, web dataset, and biomedical literature prove the effectiveness of the proposed methods. In the experiment of document clustering, the proposed term-level graph-based method not only outperforms the baseline k-means algorithm in all configurations but also is superior in terms of efficiency. The {MRF-based} algorithm significantly improves spherical k-means and model-based k-means clustering on the datasets containing explicit or implicit linkage; the Wikipedia knowledge-based clustering also improves the document-content-only-based clustering. On the task of topic analysis, the proposed graph presentation, sub graph detection, and graph ranking algorithm can effectively identify corpus-level topic terms and cluster-level topic terms.Finding the best way to utilize external/domain knowledge to enhance traditional text mining has been a challenging task. The difficulty centers on the lack of means in representing a document with external/domain knowledge integrated. Graphs are powerful and versatile tools, useful in various subfields of science and engineering for their simple illustration of complicated problems. However, the graph-based approach on knowledge representation and discovery remains relatively unexplored. In this thesis, I propose a graph-based text mining system to incorporate semantic knowledge, document section knowledge, document linkage knowledge, and document category knowledge into the tasks of text clustering and topic analysis. I design a novel term-level graph knowledge representation and a graph-based clustering algorithm to incorporate semantic and document section knowledge for biomedical literature clustering and topic analysis. I present a Markov Random Field {(MRF)} with a Relaxation Labeling {(RL)} algorithm to incorporate document linkage knowledge. I evaluate different types of linkage among documents, including explicit linkage such as hyperlink and citation link, implicit linkage such as coauthor link and co-citation link, and pseudo linkage such as similarity link. I develop a novel semantic-based method to integrate Wikipedia concepts and categories as external knowledge into traditional document clustering. In order to support these new approaches, I develop two automated algorithms to extract multiword phrases and ontological concepts, respectively. The evaluations of news collection, web dataset, and biomedical literature prove the effectiveness of the proposed methods. In the experiment of document clustering, the proposed term-level graph-based method not only outperforms the baseline k-means algorithm in all configurations but also is superior in terms of efficiency. The {MRF-based} algorithm significantly improves spherical k-means and model-based k-means clustering on the datasets containing explicit or implicit linkage; the Wikipedia knowledge-based clustering also improves the document-content-only-based clustering. On the task of topic analysis, the proposed graph presentation, sub graph detection, and graph ranking algorithm can effectively identify corpus-level topic terms and cluster-level topic terms.topic terms and cluster-level topic terms.
Added by wikilit team Added on initial load  +
Collected data time dimension Cross-sectional  +
Conclusion In this chapter, we present a general framIn this chapter, we present a general framework for leveraging Wikipedia concept and category information to improve text clustering performance. Based on two different mapping techniques, exact-match and relatedness-match, we are able to create a Wikipedia concept vector and a Wikipedia category vector for each document in a collection. The concept vector and category vector provide background knowledge about a document. They are linearly combined with a text word vector to measure document similarity. The proposed framework is tested with two clustering approaches (agglomerative and partitional clustering) on three datasets: 20NG, LATimes, and TDT2. In order to comprehensively evaluate the effect of Wikipedia concept and category information on clustering performance, we experiment with seven different clustering schemes—Concept, Category, WordConcept, WordCategory, ConceptCategory, and WordConceptCategory. Based on the empirical results, we can draw the following conclusions: (1) Category information is most useful for improving clustering results. In both agglomerative clustering and partitional clustering, combining category information with document content information generates the best results in most cases. Compared to the baseline scheme, it can significantly improve clustering performance for all three datasets when using agglomerative clustering approach, and for dataset 20 Newsgroup when using partitional clustering. (2) Clustering based on all three document vectors (word vector, concept vector, and category vector) also gets significantly better results than the baseline. However, it does not outperform clustering based only on the word vector and category vector. (3) Concept information is not as useful as category information for improving clustering performance, due to the noise information it contains and the sense ambiguity problem. (4) The effect of category and concept information on k-means clustering is not as significant as it is on agglomerative clustering. But, in most cases, WordCategory-based clustering still achieves the best performance among all clustering schemes. (5) The effect of the two mapping schemes depends on the dataset, quality metric, and clustering approach. Based on the results of partitional clustering, exact-match is more effective than relatedness-match for dataset LATimes and TDT2 but the contrary for 20 Newsgroups. We believe that our findings can be extended to other applications based on document similarity measurement, such as information retrieval and text classification. For future work, we will further improve our concept mapping techniques, such as introducing sense disambiguation functions into the concept mapping process. Moreover, we will explore how to utilize the link structure among Wikipedia concepts for document clustering. Question 1: Where and how can we get domain knowledge through a graph representation, and how can we effectively utilize this knowledge? In this chapter, we attained external Wikipedia knowledge through two semantic mapping schemes: exact-match and relatedness-match. We developed a metric to integrate the attained Wikipedia concept and category information into the text clustering process. Question 3: How can we utilize Wikipedia to improve traditional text clustering? We presented a framework to extract knowledge from Wikipedia (exact-match and relatedness-match) and to enrich the representation of original documents (a similarity metric to combine documents' content and Wikipedia knowledge). Question 4: Do the proposed graph-based algorithms improve traditional text clustering? From the experimental results in section 5.3, the hierarchical clustering using Wikipedia concept and category information (wordcategory and word_concet_category) significantly outperforms the one without Wikipedia knowledge at P<0.05 level. For partitional clustering, we also observe a performance increase.g, we also observe a performance increase.
Conference location United States, Pennsylvania +
Data source Archival records  + , Experiment responses  + , Wikipedia pages  +
Google scholar url http://scholar.google.com/scholar?ie=UTF-8&q=%22Exploiting%2Bexternal%2Fdomain%2Bknowledge%2Bto%2Benhance%2Btraditional%2Btext%2Bmining%2Busing%2Bgraph-based%2Bmethods%22  +
Has author Xiaodan Zhang +
Has domain Computer science +
Has topic Data mining +
Peer reviewed Yes  +
Publication type Thesis  +
Published in Drexel University +
Research design Experiment  +
Research questions In this thesis, I propose a graph-based teIn this thesis, I propose a graph-based text mining system to incorporate semantic knowledge, document section knowledge, document linkage knowledge, and document category knowledge into the tasks of text clustering and topic analysis. I design a novel term-level graph knowledge representation and a graph-based clustering algorithm to incorporate semantic and document section knowledge for biomedical literature clustering and topic analysis. I present a Markov Random Field (MRF) with a Relaxation Labeling (RL) algorithm to incorporate document linkage knowledge. I evaluate different types of linkage among documents, including explicit linkage such as hyperlink and citation link, implicit linkage such as coauthor link and co-citation link, and pseudo linkage such as similarity link. I develop a novel semantic-based method to integrate Wikipedia concepts and categories as external knowledge into traditional document clustering. In order to support these new approaches, I develop two automated algorithms to extract multiword phrases and ontological concepts, respectively. The evaluations of news collection, web dataset, and biomedical literature prove the effectiveness of the proposed methods.the effectiveness of the proposed methods.
Revid 10,760  +
Theories graph theory
Theory type Design and action  +
Title Exploiting external/domain knowledge to enhance traditional text mining using graph-based methods
Unit of analysis Article  +
Url http://proquest.umi.com/pqdweb?did=1818331311&Fmt=7&clientId=10306&RQT=309&VName=PQD  +
Wikipedia coverage Sample data  +
Wikipedia data extraction Live Wikipedia  +
Wikipedia language English  +
Wikipedia page type Article  +
Year 2009  +
Creation dateThis property is a special property in this wiki. 15 March 2012 20:26:34  +
Categories Data mining  + , Computer science  + , Publications with missing comments  + , Publications  +
Modification dateThis property is a special property in this wiki. 30 January 2014 20:26:12  +
hide properties that link here 
  No properties link to this page.
 

 

Enter the name of the page to start browsing from.