Synonym set extraction from the biomedical literature by lexical pattern discovery

From WikiLit
Revision as of 04:53, May 11, 2012 by Richlw (Talk | contribs) (Text replace - "N. Collier" to "Nigel Collier")

Jump to: navigation, search
Publication (help)
Synonym set extraction from the biomedical literature by lexical pattern discovery
Authors: John McCrae, Nigel Collier [edit item]
Citation: BMC bioinformatics 9 (1): 159. 2008.
Publication type: Journal article
Peer-reviewed: Yes
Database(s):
DOI: Define doi.
Google Scholar cites: Not available
Link(s): Paper link
Added by Wikilit team: Added on initial load
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
Synonym set extraction from the biomedical literature by lexical pattern discovery is a publication by John McCrae, Nigel Collier.


[edit] Abstract

Background

Although there are a large number of thesauri for the biomedical domain many of them lack coverage in terms and their variant forms. Automatic thesaurus construction based on patterns was first suggested by Hearst [1], but it is still not clear how to automatically construct such patterns for different semantic relations and domains. In particular it is not certain which patterns are useful for capturing synonymy. The assumption of extant resources such as parsers is also a limiting factor for many languages, so it is desirable to find patterns that do not use syntactical analysis. Finally to give a more consistent and applicable result it is desirable to use these patterns to form synonym sets in a sound way. Results

We present a method that automatically generates regular expression patterns by expanding seed patterns in a heuristic search and then develops a feature vector based on the occurrence of term pairs in each developed pattern. This allows for a binary classifications of term pairs as synonymous or non-synonymous. We then model this result as a probability graph to find synonym sets, which is equivalent to the well-studied problem of finding an optimal set cover. We achieved 73.2% precision and 29.7% recall by our method, out-performing hand-made resources such as MeSH and Wikipedia. Conclusion

We conclude that automatic methods can play a practical role in developing new thesauri or expanding on existing ones, and this can be done with only a small amount of training data and no need for resources such as parsers. We also concluded that the accuracy can be improved by grouping into synonym sets.

[edit] Research questions

"We present a method that automatically generates regular expression patterns by expanding seed patterns in a heuristic search and then develops a feature vector based on the occurrence of term pairs in each developed pattern. This allows for a binary classifications of term pairs as synonymous or non-synonymous."

Research details

Topics: Ontology building [edit item]
Domains: Health, Information science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Case [edit item]
Theories: "Undetermined" [edit item]
Research design: Mathematical modeling [edit item]
Data source: [edit item]
Collected data time dimension: Cross-sectional [edit item]
Unit of analysis: Website [edit item]
Wikipedia data extraction: Live Wikipedia [edit item]
Wikipedia page type: Multiple [edit item]
Wikipedia language: Not specified [edit item]

[edit] Conclusion

"We conclude that for domains with a large amount of specific vocabulary most of the resources we studied perform worse than the automatic method we have developed here. Also given the amount of effort required to manually construct a resource, automatic thesaurus construction may prove more useful in many situations, either to aid construction or in replacement of manual construction. More importantly we have shown that we can easily automatically find patterns and we do not require any prior knowledge of the language's grammar in order to do this. Even though the patterns we generated were weak by themselves we showed that by statistically combining them we can get a much stronger result. We have also shown that we do not need to know a large number of synsets to develop an accurate classifier; this implies most importantly that this method can be used quickly on a different language. We tested our method on only a limited domain but we feel it would likely generalize well to other domains. Our novel synset grouping method not only converted the result to something more applicable, but also improved on the results for both a strict definition of synonymy, and a more relaxed definition."

[edit] Comments

"We conclude that for domains with a large amount of specific vocabulary most of the resources we studied perform worse than the automatic method we have developed here."


Further notes[edit]

Facts about "Synonym set extraction from the biomedical literature by lexical pattern discovery"RDF feed
AbstractBackground

Although there are a large numBackground

Although there are a large number of thesauri for the biomedical domain many of them lack coverage in terms and their variant forms. Automatic thesaurus construction based on patterns was first suggested by Hearst [1], but it is still not clear how to automatically construct such patterns for different semantic relations and domains. In particular it is not certain which patterns are useful for capturing synonymy. The assumption of extant resources such as parsers is also a limiting factor for many languages, so it is desirable to find patterns that do not use syntactical analysis. Finally to give a more consistent and applicable result it is desirable to use these patterns to form synonym sets in a sound way. Results

We present a method that automatically generates regular expression patterns by expanding seed patterns in a heuristic search and then develops a feature vector based on the occurrence of term pairs in each developed pattern. This allows for a binary classifications of term pairs as synonymous or non-synonymous. We then model this result as a probability graph to find synonym sets, which is equivalent to the well-studied problem of finding an optimal set cover. We achieved 73.2% precision and 29.7% recall by our method, out-performing hand-made resources such as MeSH and Wikipedia. Conclusion

We conclude that automatic methods can play a practical role in developing new thesauri or expanding on existing ones, and this can be done with only a small amount of training data and no need for resources such as parsers. We also concluded that the accuracy can be improved by grouping into synonym sets.be improved by grouping into synonym sets.
Added by wikilit teamAdded on initial load +
Collected data time dimensionCross-sectional +
CommentsWe conclude that for domains with a large amount of specific vocabulary most of the resources we studied perform worse than the automatic method we have developed here.
ConclusionWe conclude that for domains with a large We conclude that for domains with a large amount of specific vocabulary most of the resources we studied perform worse than the automatic method we have developed here. Also given the amount of effort required to manually construct a resource, automatic thesaurus construction may prove more useful in many situations, either to aid construction or in replacement of manual construction. More importantly we have shown that we can easily automatically find patterns and we do not require any prior knowledge of the language's grammar in order to do this. Even though the patterns we generated were weak by themselves we showed that by statistically combining them we can get a much stronger result. We have also shown that we do not need to know a large number of synsets to develop an accurate classifier; this implies most importantly that this method can be used quickly on a different language. We tested our method on only a limited domain but we feel it would likely generalize well to other domains. Our novel synset grouping method not only converted the result to something more applicable, but also improved on the results for both a strict definition of synonymy, and a more relaxed definition.f synonymy, and a more relaxed definition.
Google scholar urlhttp://scholar.google.com/scholar?ie=UTF-8&q=%22Synonym%2Bset%2Bextraction%2Bfrom%2Bthe%2Bbiomedical%2Bliterature%2Bby%2Blexical%2Bpattern%2Bdiscovery%22 +
Has authorJohn McCrae + and Nigel Collier +
Has domainHealth + and Information science +
Has topicOntology building +
Issue1 +
Pages159 +
Peer reviewedYes +
Publication typeJournal article +
Published inBMC bioinformatics +
Research designMathematical modeling +
Research questionsWe present a method that automatically genWe present a method that automatically generates regular expression patterns by expanding seed patterns in a heuristic search and then develops a feature vector based on the occurrence of term pairs in each developed pattern. This allows for a binary classifications of term pairs as synonymous or non-synonymous.erm pairs as synonymous or non-synonymous.
Revid5,267 +
TheoriesUndetermined
Theory typeDesign and action +
TitleSynonym set extraction from the biomedical literature by lexical pattern discovery
Unit of analysisWebsite +
Urlhttp://www.biomedcentral.com/1471-2105/9/159 +
Volume9 +
Wikipedia coverageCase +
Wikipedia data extractionLive Wikipedia +
Wikipedia languageNot specified +
Wikipedia page typeMultiple +
Year2008 +