Browse wiki

Jump to: navigation, search
On the problem of wiki texts indexing
Abstract A new type of documents called a "wiki pagA new type of documents called a "wiki page" is winning the Internet. This is expressed not only in an increase of the number of Internet pages of this type but also in the popularity of Wiki projects (in particular Wikipedia); therefore the problem of parsing in Wiki texts is becoming more and more topical. A new method for indexing Wikipedia texts in three languages: Russian English and German is proposed and implemented. The architecture of the indexing system including the software components {GATE} and Lemmatizer is considered. The rules of converting Wiki texts into texts in a natural language are described. Index bases for the Russian Wikipedia and Simple English Wikipedia are constructed. The validity of Zipf's laws is tested for the Russian Wikipedia and Simple English Wikipedia.an Wikipedia and Simple English Wikipedia.
Added by wikilit team Added on initial load  +
Collected data time dimension Longitudinal  +
Conclusion Not only the Internet grows, but also WikiNot only the Internet grows, but also Wikipedia grows, and by recent data [40] the encyclopedia increases and is improved in the following these direc tions: the number of languages in which Wikipedia is maintained; the number of active participants (with time the number of participants grows, but the relative number of high activity participants, i.e., those who do more than 100 corrections a month, reduces [40]); the list of thematic directions (every new group of participants, and each language group has its own interests); the entire number of articles, and in large Wikipe dias, the depth of development (formally, this is the size of an article and the number of corrections); connectedness of pages (i.e., the number of inter nal links, interwikis, categories); “embedding” of Wikipedia in the Internet web by increasing the number of external links. Search systems and wiki resources are cooperated more and more closely. On the one hand, because of a large number of hyperlinks in wiki texts and in view of the specific features of algorithms based on analysis of links (e.g., PageRank [3]), search engines assign a high rating to wiki texts, i.e., put them at the high positions as a result of search [1]. On the other hand, the search within wiki sites is performed both by using search over DB built in MediaWiki and by specialized Wikipedia search systems: Wikia Search, Lucene search, FUTEF, and in Russian Wikipedia, by Qwika. Note that the future of search engines will probably be based on distributed search using P2P applications [41]. Text indexing was and will be the important task of search engines. In this paper, we considered the architecture and implementation of a software system for indexing wiki texts WikIDF. In indexing, a list of lemmas and the frequencies of their occurrence are calculated by the GATE system, the morphological analyzer Lemma tizer, and the module RussianPOSTagger joining them. With the use of the WikIDF system, index DBs for Russian Wikipedia and Simple English Wikipedia were designed. The parameters of the source DBs of two Wikipe dias were presented: Russian Wikipedia and Simple English Wikipedia. The temporal characteristics of indexing DB were presented, and the quantitative properties of the designed index databases were described. A faster growth of English Wikipedia was detected, namely for five months (September 2007 to February 2008); in Simple English Wikipedia, the rate of growth of the number of articles was greater by 14% and by 7% faster, than in Russian Wikipedia.d by 7% faster, than in Russian Wikipedia.
Data source Experiment responses  + , Wikipedia pages  +
Doi 10.1134/S1064230709040157 +
Google scholar url http://scholar.google.com/scholar?ie=UTF-8&q=%22On%2Bthe%2Bproblem%2Bof%2Bwiki%2Btexts%2Bindexing%22  +
Has author Andrew A. Krizhanovsky + , Alexander V. Smirnov +
Has domain Computer science +
Has topic Other information retrieval topics +
Issue 4  +
Pages 616-624  +
Peer reviewed Yes  +
Publication type Journal article  +
Published in Journal of Computer and Systems Sciences International +
Research design Experiment  +
Research questions A new type of documents called a “wiki pagA new type of documents called a “wiki page” is winning the Internet. This is expressed not only in an increase of the number of Internet pages of this type, but also in the popularity of Wiki projects (in par ticular, Wikipedia); therefore the problem of parsing in Wiki texts is becoming more and more topical. A new method for indexing Wikipedia texts in three languages: Russian, English, and German, is proposed and implemented. The architecture of the indexing system, including the software components GATE and Lem matizer, is considered. The rules of converting Wiki texts into texts in a natural language are described. Index bases for the Russian Wikipedia and Simple English Wikipedia are constructed. The validity of Zipf’s laws is tested for the Russian Wikipedia and Simple English Wikipedia.an Wikipedia and Simple English Wikipedia.
Revid 10,891  +
Theories Undetermined
Theory type Design and action  +
Title On the problem of wiki texts indexing
Unit of analysis N/A  +
Url http://dx.doi.org/10.1134/S1064230709040157  +
Volume 48  +
Wikipedia coverage Sample data  +
Wikipedia data extraction Live Wikipedia  +
Wikipedia language English  + , German  + , Russian  +
Wikipedia page type Article  +
Year 2009  +
Creation dateThis property is a special property in this wiki. 15 March 2012 20:29:55  +
Categories Other information retrieval topics  + , Computer science  + , Publications with missing comments  + , Publications  +
Modification dateThis property is a special property in this wiki. 30 January 2014 20:30:12  +
hide properties that link here 
  No properties link to this page.
 

 

Enter the name of the page to start browsing from.