Exploiting external/domain knowledge to enhance traditional text mining using graph-based methods

From WikiLit
Jump to: navigation, search
Publication (help)
Exploiting external/domain knowledge to enhance traditional text mining using graph-based methods
Authors: Xiaodan Zhang [edit item]
Citation: Drexel University  : . 2009. United States, Pennsylvania.
Publication type: Thesis
Peer-reviewed: Yes
Database(s):
DOI: Define doi.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
Exploiting external/domain knowledge to enhance traditional text mining using graph-based methods is a publication by Xiaodan Zhang.


[edit] Abstract

Finding the best way to utilize external/domain knowledge to enhance traditional text mining has been a challenging task. The difficulty centers on the lack of means in representing a document with external/domain knowledge integrated. Graphs are powerful and versatile tools, useful in various subfields of science and engineering for their simple illustration of complicated problems. However, the graph-based approach on knowledge representation and discovery remains relatively unexplored. In this thesis, I propose a graph-based text mining system to incorporate semantic knowledge, document section knowledge, document linkage knowledge, and document category knowledge into the tasks of text clustering and topic analysis. I design a novel term-level graph knowledge representation and a graph-based clustering algorithm to incorporate semantic and document section knowledge for biomedical literature clustering and topic analysis. I present a Markov Random Field {(MRF)} with a Relaxation Labeling {(RL)} algorithm to incorporate document linkage knowledge. I evaluate different types of linkage among documents, including explicit linkage such as hyperlink and citation link, implicit linkage such as coauthor link and co-citation link, and pseudo linkage such as similarity link. I develop a novel semantic-based method to integrate Wikipedia concepts and categories as external knowledge into traditional document clustering. In order to support these new approaches, I develop two automated algorithms to extract multiword phrases and ontological concepts, respectively. The evaluations of news collection, web dataset, and biomedical literature prove the effectiveness of the proposed methods. In the experiment of document clustering, the proposed term-level graph-based method not only outperforms the baseline k-means algorithm in all configurations but also is superior in terms of efficiency. The {MRF-based} algorithm significantly improves spherical k-means and model-based k-means clustering on the datasets containing explicit or implicit linkage; the Wikipedia knowledge-based clustering also improves the document-content-only-based clustering. On the task of topic analysis, the proposed graph presentation, sub graph detection, and graph ranking algorithm can effectively identify corpus-level topic terms and cluster-level topic terms.Finding the best way to utilize external/domain knowledge to enhance traditional text mining has been a challenging task. The difficulty centers on the lack of means in representing a document with external/domain knowledge integrated. Graphs are powerful and versatile tools, useful in various subfields of science and engineering for their simple illustration of complicated problems. However, the graph-based approach on knowledge representation and discovery remains relatively unexplored. In this thesis, I propose a graph-based text mining system to incorporate semantic knowledge, document section knowledge, document linkage knowledge, and document category knowledge into the tasks of text clustering and topic analysis. I design a novel term-level graph knowledge representation and a graph-based clustering algorithm to incorporate semantic and document section knowledge for biomedical literature clustering and topic analysis. I present a Markov Random Field {(MRF)} with a Relaxation Labeling {(RL)} algorithm to incorporate document linkage knowledge. I evaluate different types of linkage among documents, including explicit linkage such as hyperlink and citation link, implicit linkage such as coauthor link and co-citation link, and pseudo linkage such as similarity link. I develop a novel semantic-based method to integrate Wikipedia concepts and categories as external knowledge into traditional document clustering. In order to support these new approaches, I develop two automated algorithms to extract multiword phrases and ontological concepts, respectively. The evaluations of news collection, web dataset, and biomedical literature prove the effectiveness of the proposed methods. In the experiment of document clustering, the proposed term-level graph-based method not only outperforms the baseline k-means algorithm in all configurations but also is superior in terms of efficiency. The {MRF-based} algorithm significantly improves spherical k-means and model-based k-means clustering on the datasets containing explicit or implicit linkage; the Wikipedia knowledge-based clustering also improves the document-content-only-based clustering. On the task of topic analysis, the proposed graph presentation, sub graph detection, and graph ranking algorithm can effectively identify corpus-level topic terms and cluster-level topic terms.

[edit] Research questions

"In this thesis, I propose a graph-based text mining system to incorporate semantic knowledge, document section knowledge, document linkage knowledge, and document category knowledge into the tasks of text clustering and topic analysis. I design a novel term-level graph knowledge representation and a graph-based clustering algorithm to incorporate semantic and document section knowledge for biomedical literature clustering and topic analysis. I present a Markov Random Field (MRF) with a Relaxation Labeling (RL) algorithm to incorporate document linkage knowledge. I evaluate different types of linkage among documents, including explicit linkage such as hyperlink and citation link, implicit linkage such as coauthor link and co-citation link, and pseudo linkage such as similarity link. I develop a novel semantic-based method to integrate Wikipedia concepts and categories as external knowledge into traditional document clustering. In order to support these new approaches, I develop two automated algorithms to extract multiword phrases and ontological concepts, respectively. The evaluations of news collection, web dataset, and biomedical literature prove the effectiveness of the proposed methods."

Research details

Topics: Data mining [edit item]
Domains: Computer science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Sample data [edit item]
Theories: "graph theory" [edit item]
Research design: Experiment [edit item]
Data source: Archival records, Experiment responses, Wikipedia pages [edit item]
Collected data time dimension: Cross-sectional [edit item]
Unit of analysis: Article [edit item]
Wikipedia data extraction: Live Wikipedia [edit item]
Wikipedia page type: Article [edit item]
Wikipedia language: English [edit item]

[edit] Conclusion

"In this chapter, we present a general framework for leveraging Wikipedia concept and category information to improve text clustering performance. Based on two different mapping techniques, exact-match and relatedness-match, we are able to create a Wikipedia concept vector and a Wikipedia category vector for each document in a collection. The concept vector and category vector provide background knowledge about a document. They are linearly combined with a text word vector to measure document similarity. The proposed framework is tested with two clustering approaches (agglomerative and partitional clustering) on three datasets: 20NG, LATimes, and TDT2. In order to comprehensively evaluate the effect of Wikipedia concept and category information on clustering performance, we experiment with seven different clustering schemes—Concept, Category, WordConcept, WordCategory, ConceptCategory, and WordConceptCategory. Based on the empirical results, we can draw the following conclusions: (1) Category information is most useful for improving clustering results. In both agglomerative clustering and partitional clustering, combining category information with document content information generates the best results in most cases. Compared to the baseline scheme, it can significantly improve clustering performance for all three datasets when using agglomerative clustering approach, and for dataset 20 Newsgroup when using partitional clustering. (2) Clustering based on all three document vectors (word vector, concept vector, and category vector) also gets significantly better results than the baseline. However, it does not outperform clustering based only on the word vector and category vector. (3) Concept information is not as useful as category information for improving clustering performance, due to the noise information it contains and the sense ambiguity problem. (4) The effect of category and concept information on k-means clustering is not as significant as it is on agglomerative clustering. But, in most cases, WordCategory-based clustering still achieves the best performance among all clustering schemes. (5) The effect of the two mapping schemes depends on the dataset, quality metric, and clustering approach. Based on the results of partitional clustering, exact-match is more effective than relatedness-match for dataset LATimes and TDT2 but the contrary for 20 Newsgroups. We believe that our findings can be extended to other applications based on document similarity measurement, such as information retrieval and text classification. For future work, we will further improve our concept mapping techniques, such as introducing sense disambiguation functions into the concept mapping process. Moreover, we will explore how to utilize the link structure among Wikipedia concepts for document clustering. Question 1: Where and how can we get domain knowledge through a graph representation, and how can we effectively utilize this knowledge? In this chapter, we attained external Wikipedia knowledge through two semantic mapping schemes: exact-match and relatedness-match. We developed a metric to integrate the attained Wikipedia concept and category information into the text clustering process. Question 3: How can we utilize Wikipedia to improve traditional text clustering? We presented a framework to extract knowledge from Wikipedia (exact-match and relatedness-match) and to enrich the representation of original documents (a similarity metric to combine documents' content and Wikipedia knowledge). Question 4: Do the proposed graph-based algorithms improve traditional text clustering? From the experimental results in section 5.3, the hierarchical clustering using Wikipedia concept and category information (wordcategory and word_concet_category) significantly outperforms the one without Wikipedia knowledge at P<0.05 level. For partitional clustering, we also observe a performance increase."

[edit] Comments


Further notes[edit]

Facts about "Exploiting external/domain knowledge to enhance traditional text mining using graph-based methods"RDF feed
AbstractFinding the best way to utilize external/dFinding the best way to utilize external/domain knowledge to enhance traditional text mining has been a challenging task. The difficulty centers on the lack of means in representing a document with external/domain knowledge integrated. Graphs are powerful and versatile tools, useful in various subfields of science and engineering for their simple illustration of complicated problems. However, the graph-based approach on knowledge representation and discovery remains relatively unexplored. In this thesis, I propose a graph-based text mining system to incorporate semantic knowledge, document section knowledge, document linkage knowledge, and document category knowledge into the tasks of text clustering and topic analysis. I design a novel term-level graph knowledge representation and a graph-based clustering algorithm to incorporate semantic and document section knowledge for biomedical literature clustering and topic analysis. I present a Markov Random Field {(MRF)} with a Relaxation Labeling {(RL)} algorithm to incorporate document linkage knowledge. I evaluate different types of linkage among documents, including explicit linkage such as hyperlink and citation link, implicit linkage such as coauthor link and co-citation link, and pseudo linkage such as similarity link. I develop a novel semantic-based method to integrate Wikipedia concepts and categories as external knowledge into traditional document clustering. In order to support these new approaches, I develop two automated algorithms to extract multiword phrases and ontological concepts, respectively. The evaluations of news collection, web dataset, and biomedical literature prove the effectiveness of the proposed methods. In the experiment of document clustering, the proposed term-level graph-based method not only outperforms the baseline k-means algorithm in all configurations but also is superior in terms of efficiency. The {MRF-based} algorithm significantly improves spherical k-means and model-based k-means clustering on the datasets containing explicit or implicit linkage; the Wikipedia knowledge-based clustering also improves the document-content-only-based clustering. On the task of topic analysis, the proposed graph presentation, sub graph detection, and graph ranking algorithm can effectively identify corpus-level topic terms and cluster-level topic terms.Finding the best way to utilize external/domain knowledge to enhance traditional text mining has been a challenging task. The difficulty centers on the lack of means in representing a document with external/domain knowledge integrated. Graphs are powerful and versatile tools, useful in various subfields of science and engineering for their simple illustration of complicated problems. However, the graph-based approach on knowledge representation and discovery remains relatively unexplored. In this thesis, I propose a graph-based text mining system to incorporate semantic knowledge, document section knowledge, document linkage knowledge, and document category knowledge into the tasks of text clustering and topic analysis. I design a novel term-level graph knowledge representation and a graph-based clustering algorithm to incorporate semantic and document section knowledge for biomedical literature clustering and topic analysis. I present a Markov Random Field {(MRF)} with a Relaxation Labeling {(RL)} algorithm to incorporate document linkage knowledge. I evaluate different types of linkage among documents, including explicit linkage such as hyperlink and citation link, implicit linkage such as coauthor link and co-citation link, and pseudo linkage such as similarity link. I develop a novel semantic-based method to integrate Wikipedia concepts and categories as external knowledge into traditional document clustering. In order to support these new approaches, I develop two automated algorithms to extract multiword phrases and ontological concepts, respectively. The evaluations of news collection, web dataset, and biomedical literature prove the effectiveness of the proposed methods. In the experiment of document clustering, the proposed term-level graph-based method not only outperforms the baseline k-means algorithm in all configurations but also is superior in terms of efficiency. The {MRF-based} algorithm significantly improves spherical k-means and model-based k-means clustering on the datasets containing explicit or implicit linkage; the Wikipedia knowledge-based clustering also improves the document-content-only-based clustering. On the task of topic analysis, the proposed graph presentation, sub graph detection, and graph ranking algorithm can effectively identify corpus-level topic terms and cluster-level topic terms.topic terms and cluster-level topic terms.
Added by wikilit teamAdded on initial load +
Collected data time dimensionCross-sectional +
ConclusionIn this chapter, we present a general framIn this chapter, we present a general framework for leveraging Wikipedia concept and

category information to improve text clustering performance. Based on two different mapping techniques, exact-match and relatedness-match, we are able to create a Wikipedia concept vector and a Wikipedia category vector for each document in a collection. The concept vector and category vector provide background knowledge about a document. They are linearly combined with a text word vector to measure document similarity. The proposed framework is tested with two clustering approaches (agglomerative and partitional clustering) on three datasets: 20NG, LATimes, and TDT2. In order to comprehensively evaluate the effect of Wikipedia concept and category information on clustering performance, we experiment with seven different clustering schemes—Concept, Category, WordConcept, WordCategory, ConceptCategory, and WordConceptCategory. Based on the empirical results, we can draw the following conclusions: (1) Category information is most useful for improving clustering results. In both agglomerative clustering and partitional clustering, combining category information with document content information generates the best results in most cases. Compared to the baseline scheme, it can significantly improve clustering performance for all three datasets when using agglomerative clustering approach, and for dataset 20 Newsgroup when using partitional clustering. (2) Clustering based on all three document vectors (word vector, concept vector, and category vector) also gets significantly better results than the baseline. However, it does not outperform clustering based only on the word vector and category vector. (3) Concept information is not as useful as category information for improving clustering performance, due to the noise information it contains and the sense ambiguity problem. (4) The effect of category and concept information on k-means clustering is not as significant as it is on agglomerative clustering. But, in most cases, WordCategory-based clustering still achieves the best performance among all clustering schemes. (5) The effect of the two mapping schemes depends on the dataset, quality metric, and clustering approach. Based on the results of partitional clustering, exact-match is more effective than relatedness-match for dataset LATimes and TDT2 but the contrary for 20 Newsgroups. We believe that our findings can be extended to other applications based on document similarity measurement, such as information retrieval and text classification. For future work, we will further improve our concept mapping techniques, such as introducing sense disambiguation functions into the concept mapping process. Moreover, we will explore how to utilize the link structure among Wikipedia concepts for document clustering. Question 1: Where and how can we get domain knowledge through a graph representation, and how can we effectively utilize this knowledge? In this chapter, we attained external Wikipedia knowledge through two semantic mapping schemes: exact-match and relatedness-match. We developed a metric to integrate the attained Wikipedia concept and category information into the text clustering process. Question 3: How can we utilize Wikipedia to improve traditional text clustering? We presented a framework to extract knowledge from Wikipedia (exact-match and relatedness-match) and to enrich the representation of original documents (a similarity metric to combine documents' content and Wikipedia knowledge). Question 4: Do the proposed graph-based algorithms improve traditional text clustering? From the experimental results in section 5.3, the hierarchical clustering using Wikipedia concept and category information (wordcategory and word_concet_category) significantly outperforms the one without Wikipedia knowledge at P<0.05 level. For

partitional clustering, we also observe a performance increase.
g, we also observe a performance increase.
Conference locationUnited States, Pennsylvania +
Data sourceArchival records +, Experiment responses + and Wikipedia pages +
Google scholar urlhttp://scholar.google.com/scholar?ie=UTF-8&q=%22Exploiting%2Bexternal%2Fdomain%2Bknowledge%2Bto%2Benhance%2Btraditional%2Btext%2Bmining%2Busing%2Bgraph-based%2Bmethods%22 +
Has authorXiaodan Zhang +
Has domainComputer science +
Has topicData mining +
Peer reviewedYes +
Publication typeThesis +
Published inDrexel University +
Research designExperiment +
Research questionsIn this thesis, I propose a graph-based teIn this thesis, I propose a graph-based text mining system to incorporate semantic knowledge, document section knowledge, document linkage knowledge, and document category knowledge into the tasks of text clustering and topic analysis. I design a novel term-level graph knowledge representation and a graph-based clustering algorithm to incorporate semantic and document section knowledge for biomedical literature clustering and topic analysis.

I present a Markov Random Field (MRF) with a Relaxation Labeling (RL) algorithm to incorporate document linkage knowledge. I evaluate different types of linkage among documents, including explicit linkage such as hyperlink and citation link, implicit linkage such as coauthor link and co-citation link, and pseudo linkage such as similarity link. I develop a novel semantic-based method to integrate Wikipedia concepts and categories as external knowledge into traditional document clustering. In order to support these new approaches, I develop two automated algorithms to extract multiword phrases and ontological concepts, respectively. The evaluations of news collection, web dataset, and

biomedical literature prove the effectiveness of the proposed methods.
the effectiveness of the proposed methods.
Revid10,760 +
Theoriesgraph theory
Theory typeDesign and action +
TitleExploiting external/domain knowledge to enhance traditional text mining using graph-based methods
Unit of analysisArticle +
Urlhttp://proquest.umi.com/pqdweb?did=1818331311&Fmt=7&clientId=10306&RQT=309&VName=PQD +
Wikipedia coverageSample data +
Wikipedia data extractionLive Wikipedia +
Wikipedia languageEnglish +
Wikipedia page typeArticle +
Year2009 +