Accuracy estimate and optimization techniques for SimRank computation
Publication (help) | |
---|---|
Accuracy estimate and optimization techniques for SimRank computation | |
Authors: | Dmitry Lizorkin, Pavel Velikhov, Maxim Grinev, Denis Turdakov [edit item] |
Citation: | VLDB Journal 19 (1): 45-66. 2010. |
Publication type: | Journal article |
Peer-reviewed: | Yes |
Database(s): | |
DOI: | 10.1007/s00778-009-0168-8. |
Google Scholar cites: | Citations |
Link(s): | Paper link |
Added by Wikilit team: | Added on initial load |
Search | |
Article: | Google Scholar BASE PubMed |
Other scholarly wikis: | AcaWiki Brede Wiki WikiPapers |
Web search: | Bing Google Yahoo! — Google PDF |
Other: | |
Services | |
Format: | BibTeX |
Contents
[edit] Abstract
The measure of similarity between objects is a very useful tool in many areas of computer science, including information retrieval. SimRank is a simple and intuitive measure of this kind, based on a graph-theoretic model. SimRank is typically computed iteratively, in the spirit of PageRank. However, existing work on SimRank lacks accuracy estimation of iterative computation and has discouraging time complexity. In this paper, we present a technique to estimate the accuracy of computing SimRank iteratively. This technique provides a way to find out the number of iterations required to achieve a desired accuracy when computing SimRank. We also present optimization techniques that improve the computational complexity of the iterative algorithm from O(n4) in the worst case to min(O(nl), O(n3/ log2n)), with n denoting the number of objects, and l denoting the number object-to-object relationships. We also introduce a threshold sieving heuristic and its accuracy estimation that further improves the efficiency of the method. As a practical illustration of our techniques, we computed SimRank scores on a subset of English Wikipedia corpus, consisting of the complete set of articles and category links.
[edit] Research questions
"The measure of similarity between objects is a very useful tool in many areas of computer science, including informa- tion retrieval. SimRank is a simple and intuitive measure of this kind, based on graph-theoretic model.
In this paper we present a technique to estimate the ac- curacy of computing SimRank iteratively. This technique provides a way to find out the number of iterations required to achieve a desired accuracy when computing SimRank. We also present optimization techniques that improve the com- putational complexity of the iterative algorithm from O(n4) to O(n3) in the worst case. We also introduce a threshold sieving heuristic and its accuracy estimation that further improves the eciency of the method."
Research details
Topics: | Ranking and clustering systems [edit item] |
Domains: | Computer science [edit item] |
Theory type: | Design and action [edit item] |
Wikipedia coverage: | Sample data [edit item] |
Theories: | "Undetermined" [edit item] |
Research design: | Mathematical modeling [edit item] |
Data source: | Wikipedia pages [edit item] |
Collected data time dimension: | Cross-sectional [edit item] |
Unit of analysis: | Article [edit item] |
Wikipedia data extraction: | Live Wikipedia [edit item] |
Wikipedia page type: | Article [edit item] |
Wikipedia language: | English [edit item] |
[edit] Conclusion
"A precise accuracy estimate for SimRank iterative com- putation is established. The estimate reveals that SimRank computation parameters suggested in the original SimRank proposal implied a relatively low accuracy, and the choice for different parameter values is suggested. The accuracy estimate allows a-priori finding out the correct number of iterations required for achieving a desired accuracy. The number of iterations turns out to be independent of input graph characteristics, the fact to benefit scalability.
Experimental results show a 50 times speedup achieved by the optimization techniques for a graph with 10K nodes, and relative improvement in computation time further increases for larger graphs. The experience in computing SimRank scores over the English Wikipedia corpus exhibits practical viability of the approach for relatively large data corpora. We believe that the results presented in the paper would facilitate a wider application of SimRank to computer science techniques, as this similarity measure definitely deserves."
[edit] Comments
Further notes[edit]
Abstract | The measure of similarity between objects … The measure of similarity between objects is a very useful tool in many areas of computer science, including information retrieval. SimRank is a simple and intuitive measure of this kind, based on a graph-theoretic model. SimRank is typically computed iteratively, in the spirit of PageRank. However, existing work on SimRank lacks accuracy estimation of iterative computation and has discouraging time complexity. In this paper, we present a technique to estimate the accuracy of computing SimRank iteratively. This technique provides a way to find out the number of iterations required to achieve a desired accuracy when computing SimRank. We also present optimization techniques that improve the computational complexity of the iterative algorithm from O(n4) in the worst case to min(O(nl), O(n3/ log2n)), with n denoting the number of objects, and l denoting the number object-to-object relationships. We also introduce a threshold sieving heuristic and its accuracy estimation that further improves the efficiency of the method. As a practical illustration of our techniques, we computed SimRank scores on a subset of English Wikipedia corpus, consisting of the complete set of articles and category links.mplete set of articles and category links. |
Added by wikilit team | Added on initial load + |
Collected data time dimension | Cross-sectional + |
Conclusion | A precise accuracy estimate for SimRank it … A precise accuracy estimate for SimRank iterative com-
putation is established. The estimate reveals that SimRank computation parameters suggested in the original SimRank proposal implied a relatively low accuracy, and the choice for different parameter values is suggested. The accuracy estimate allows a-priori finding out the correct number of iterations required for achieving a desired accuracy. The number of iterations turns out to be independent of input graph characteristics, the fact to benefit scalability. Experimental results show a 50 times speedup achieved by the optimization techniques for a graph with 10K nodes, and relative improvement in computation time further increases for larger graphs. The experience in computing SimRank scores over the English Wikipedia corpus exhibits practical viability of the approach for relatively large data corpora. We believe that the results presented in the paper would facilitate a wider application of SimRank to computer science techniques, as this similarity measure definitely deserves.is similarity measure definitely deserves. |
Data source | Wikipedia pages + |
Doi | 10.1007/s00778-009-0168-8 + |
Google scholar url | http://scholar.google.com/scholar?ie=UTF-8&q=%22Accuracy%2Bestimate%2Band%2Boptimization%2Btechniques%2Bfor%2BSimRank%2Bcomputation%22 + |
Has author | Dmitry Lizorkin +, Pavel Velikhov +, Maxim Grinev + and Denis Turdakov + |
Has domain | Computer science + |
Has topic | Ranking and clustering systems + |
Issue | 1 + |
Pages | 45-66 + |
Peer reviewed | Yes + |
Publication type | Journal article + |
Published in | VLDB Journal + |
Research design | Mathematical modeling + |
Research questions | The measure of similarity between objects … The measure of similarity between objects is a very useful
tool in many areas of computer science, including informa- tion retrieval. SimRank is a simple and intuitive measure of this kind, based on graph-theoretic model. In this paper we present a technique to estimate the ac- curacy of computing SimRank iteratively. This technique provides a way to find out the number of iterations required to achieve a desired accuracy when computing SimRank. We also present optimization techniques that improve the com- putational complexity of the iterative algorithm from O(n4) to O(n3) in the worst case. We also introduce a threshold sieving heuristic and its accuracy estimation that further improves the eciency of the method.urther improves the eciency of the method. |
Revid | 10,646 + |
Theories | Undetermined |
Theory type | Design and action + |
Title | Accuracy estimate and optimization techniques for SimRank computation |
Unit of analysis | Article + |
Url | http://dx.doi.org/10.1007/s00778-009-0168-8 + |
Volume | 19 + |
Wikipedia coverage | Sample data + |
Wikipedia data extraction | Live Wikipedia + |
Wikipedia language | English + |
Wikipedia page type | Article + |
Year | 2010 + |