Keywords in the mist: automated keyword extraction for very large documents and back of the book indexing

From WikiLit
Revision as of 17:48, January 29, 2014 by Mehdi (Talk | contribs) (changed Collected data time dimension and Wikipedia data extraction)

Jump to: navigation, search
Publication (help)
Keywords in the mist: automated keyword extraction for very large documents and back of the book indexing
Authors: Andras Csomai [edit item]
Citation: University of North Texas  : . 2008. United States, Texas.
Publication type: Thesis
Peer-reviewed: Yes
Database(s):
DOI: Define doi.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
Keywords in the mist: automated keyword extraction for very large documents and back of the book indexing is a publication by Andras Csomai.


[edit] Abstract

This research addresses the problem of automatic keyphrase extraction from large documents and back of the book indexing. The potential benefits of automating this process are far reaching, from improving information retrieval in digital libraries, to saving countless man-hours by helping professional indexers creating back of the book indexes. The dissertation introduces a new methodology to evaluate automated systems, which allows for a detailed, comparative analysis of several techniques for keyphrase extraction. We introduce and evaluate both supervised and unsupervised techniques, designed to balance the resource requirements of an automated system and the best achievable performance. Additionally, a number of novel features are proposed, including a statistical informativeness measure based on chi statistics; an encyclopedic feature that taps into the vast knowledge base of Wikipedia to establish the likelihood of a phrase referring to an informative concept; and a linguistic feature based on sophisticated semantic analysis of the text using current theories of discourse comprehension. The resulting keyphrase extraction system is shown to outperform the current state of the art in supervised keyphrase extraction by a large margin. Moreover, a fully automated back of the book indexing system based on the keyphrase extraction system was shown to lead to back of the book indexes closely resembling those created by human experts.

[edit] Research questions

"What are the challenges associated with keyphrase extraction on very long documents? What is the role played by different information theoretic and linguistic features? What is the role played by supervision in the quality of keyphrase extraction and back of the book indexing? Can techniques for keyphrase extraction be used to develop a fully automated back of the book indexing system?"

Research details

Topics: Information extraction [edit item]
Domains: Computer science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Sample data [edit item]
Theories: "In Chapter 5, I present the construction-integration theory of human comprehension…" [edit item]
Research design: Statistical analysis [edit item]
Data source: [edit item]
Collected data time dimension: Cross-sectional [edit item]
Unit of analysis: N/A [edit item]
Wikipedia data extraction: Dump [edit item]
Wikipedia page type: Article [edit item]
Wikipedia language: Not specified [edit item]

[edit] Conclusion

[edit] Comments


Further notes[edit]

Facts about "Keywords in the mist: automated keyword extraction for very large documents and back of the book indexing"RDF feed
AbstractThis research addresses the problem of autThis research addresses the problem of automatic keyphrase extraction from large documents and back of the book indexing. The potential benefits of automating this process are far reaching, from improving information retrieval in digital libraries, to saving countless man-hours by helping professional indexers creating back of the book indexes. The dissertation introduces a new methodology to evaluate automated systems, which allows for a detailed, comparative analysis of several techniques for keyphrase extraction. We introduce and evaluate both supervised and unsupervised techniques, designed to balance the resource requirements of an automated system and the best achievable performance. Additionally, a number of novel features are proposed, including a statistical informativeness measure based on chi statistics; an encyclopedic feature that taps into the vast knowledge base of Wikipedia to establish the likelihood of a phrase referring to an informative concept; and a linguistic feature based on sophisticated semantic analysis of the text using current theories of discourse comprehension. The resulting keyphrase extraction system is shown to outperform the current state of the art in supervised keyphrase extraction by a large margin. Moreover, a fully automated back of the book indexing system based on the keyphrase extraction system was shown to lead to back of the book indexes closely resembling those created by human experts.resembling those created by human experts.
Added by wikilit teamAdded on initial load +
Collected data time dimensionCross-sectional +
Conference locationUnited States, Texas +
Google scholar urlhttp://scholar.google.com/scholar?ie=UTF-8&q=%22Keywords%2Bin%2Bthe%2Bmist%3A%2Bautomated%2Bkeyword%2Bextraction%2Bfor%2Bvery%2Blarge%2Bdocuments%2Band%2Bback%2Bof%2Bthe%2Bbook%2Bindexing%22 +
Has authorAndras Csomai +
Has domainComputer science +
Has topicInformation extraction +
Peer reviewedYes +
Publication typeThesis +
Published inUniversity of North Texas +
Research designStatistical analysis +
Research questionsWhat are the challenges associated with keWhat are the challenges associated with keyphrase extraction on very long documents?

What is the role played by different information theoretic and linguistic features? What is the role played by supervision in the quality of keyphrase extraction and back of the book indexing?

Can techniques for keyphrase extraction be used to develop a fully automated back of the book indexing system?
utomated back of the book indexing system?
Revid10,614 +
TheoriesIn Chapter 5, I present the construction-integration theory of human comprehension…
Theory typeDesign and action +
TitleKeywords in the mist: automated keyword extraction for very large documents and back of the book indexing
Unit of analysisN/A +
Urlhttp://proquest.umi.com/pqdweb?did=1597616811&Fmt=7&clientId=10306&RQT=309&VName=PQD +
Wikipedia coverageSample data +
Wikipedia data extractionDump +
Wikipedia languageNot specified +
Wikipedia page typeArticle +
Year2008 +