On the problem of wiki texts indexing

From WikiLit
Jump to: navigation, search
Publication (help)
On the problem of wiki texts indexing
Authors: Andrew A. Krizhanovsky, Alexander V. Smirnov [edit item]
Citation: Journal of Computer and Systems Sciences International 48 (4): 616-624. 2009.
Publication type: Journal article
Peer-reviewed: Yes
Database(s):
DOI: 10.1134/S1064230709040157.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
On the problem of wiki texts indexing is a publication by Andrew A. Krizhanovsky, Alexander V. Smirnov.


[edit] Abstract

A new type of documents called a "wiki page" is winning the Internet. This is expressed not only in an increase of the number of Internet pages of this type but also in the popularity of Wiki projects (in particular Wikipedia); therefore the problem of parsing in Wiki texts is becoming more and more topical. A new method for indexing Wikipedia texts in three languages: Russian English and German is proposed and implemented. The architecture of the indexing system including the software components {GATE} and Lemmatizer is considered. The rules of converting Wiki texts into texts in a natural language are described. Index bases for the Russian Wikipedia and Simple English Wikipedia are constructed. The validity of Zipf's laws is tested for the Russian Wikipedia and Simple English Wikipedia.

[edit] Research questions

"A new type of documents called a “wiki page” is winning the Internet. This is expressed not only in an increase of the number of Internet pages of this type, but also in the popularity of Wiki projects (in par ticular, Wikipedia); therefore the problem of parsing in Wiki texts is becoming more and more topical. A new method for indexing Wikipedia texts in three languages: Russian, English, and German, is proposed and implemented. The architecture of the indexing system, including the software components GATE and Lem matizer, is considered. The rules of converting Wiki texts into texts in a natural language are described. Index bases for the Russian Wikipedia and Simple English Wikipedia are constructed. The validity of Zipf’s laws is tested for the Russian Wikipedia and Simple English Wikipedia."

Research details

Topics: Other information retrieval topics [edit item]
Domains: Computer science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Sample data [edit item]
Theories: "Undetermined" [edit item]
Research design: Experiment [edit item]
Data source: Experiment responses, Wikipedia pages [edit item]
Collected data time dimension: Longitudinal [edit item]
Unit of analysis: N/A [edit item]
Wikipedia data extraction: Live Wikipedia [edit item]
Wikipedia page type: Article [edit item]
Wikipedia language: English, German, Russian [edit item]

[edit] Conclusion

"Not only the Internet grows, but also Wikipedia grows, and by recent data [40] the encyclopedia increases and is improved in the following these direc tions: the number of languages in which Wikipedia is maintained; the number of active participants (with time the number of participants grows, but the relative number of high activity participants, i.e., those who do more than 100 corrections a month, reduces [40]); the list of thematic directions (every new group of participants, and each language group has its own interests); the entire number of articles, and in large Wikipe dias, the depth of development (formally, this is the size of an article and the number of corrections); connectedness of pages (i.e., the number of inter nal links, interwikis, categories); “embedding” of Wikipedia in the Internet web by increasing the number of external links. Search systems and wiki resources are cooperated more and more closely. On the one hand, because of a large number of hyperlinks in wiki texts and in view of the specific features of algorithms based on analysis of links (e.g., PageRank [3]), search engines assign a high rating to wiki texts, i.e., put them at the high positions as a result of search [1]. On the other hand, the search within wiki sites is performed both by using search over DB built in MediaWiki and by specialized Wikipedia search systems: Wikia Search, Lucene search, FUTEF, and in Russian Wikipedia, by Qwika. Note that the future of search engines will probably be based on distributed search using P2P applications [41]. Text indexing was and will be the important task of search engines. In this paper, we considered the architecture and implementation of a software system for indexing wiki texts WikIDF. In indexing, a list of lemmas and the frequencies of their occurrence are calculated by the GATE system, the morphological analyzer Lemma tizer, and the module RussianPOSTagger joining them. With the use of the WikIDF system, index DBs for Russian Wikipedia and Simple English Wikipedia were designed. The parameters of the source DBs of two Wikipe dias were presented: Russian Wikipedia and Simple English Wikipedia. The temporal characteristics of indexing DB were presented, and the quantitative properties of the designed index databases were described. A faster growth of English Wikipedia was detected, namely for five months (September 2007 to February 2008); in Simple English Wikipedia, the rate of growth of the number of articles was greater by 14% and by 7% faster, than in Russian Wikipedia."

[edit] Comments


Further notes[edit]

Facts about "On the problem of wiki texts indexing"RDF feed
AbstractA new type of documents called a "wiki pagA new type of documents called a "wiki page" is winning the Internet. This is expressed not only in an increase of the number of Internet pages of this type but also in the popularity of Wiki projects (in particular Wikipedia); therefore the problem of parsing in Wiki texts is becoming more and more topical. A new method for indexing Wikipedia texts in three languages: Russian English and German is proposed and implemented. The architecture of the indexing system including the software components {GATE} and Lemmatizer is considered. The rules of converting Wiki texts into texts in a natural language are described. Index bases for the Russian Wikipedia and Simple English Wikipedia are constructed. The validity of Zipf's laws is tested for the Russian Wikipedia and Simple English Wikipedia.an Wikipedia and Simple English Wikipedia.
Added by wikilit teamAdded on initial load +
Collected data time dimensionLongitudinal +
ConclusionNot only the Internet grows, but also WikiNot only the Internet grows, but also Wikipedia

grows, and by recent data [40] the encyclopedia increases and is improved in the following these direc tions: the number of languages in which Wikipedia is maintained; the number of active participants (with time the number of participants grows, but the relative number of high activity participants, i.e., those who do more than 100 corrections a month, reduces [40]); the list of thematic directions (every new group of participants, and each language group has its own interests); the entire number of articles, and in large Wikipe dias, the depth of development (formally, this is the size of an article and the number of corrections); connectedness of pages (i.e., the number of inter nal links, interwikis, categories); “embedding” of Wikipedia in the Internet web by increasing the number of external links. Search systems and wiki resources are cooperated more and more closely. On the one hand, because of a large number of hyperlinks in wiki texts and in view of the specific features of algorithms based on analysis of links (e.g., PageRank [3]), search engines assign a high rating to wiki texts, i.e., put them at the high positions as a result of search [1]. On the other hand, the search within wiki sites is performed both by using search over DB built in MediaWiki and by specialized Wikipedia search systems: Wikia Search, Lucene search, FUTEF, and in Russian Wikipedia, by Qwika. Note that the future of search engines will probably be based on distributed search using P2P applications [41]. Text indexing was and will be the important task of search engines. In this paper, we considered the architecture and implementation of a software system for indexing wiki texts WikIDF. In indexing, a list of lemmas and the frequencies of their occurrence are calculated by the GATE system, the morphological analyzer Lemma tizer, and the module RussianPOSTagger joining them. With the use of the WikIDF system, index DBs for Russian Wikipedia and Simple English Wikipedia were designed. The parameters of the source DBs of two Wikipe dias were presented: Russian Wikipedia and Simple English Wikipedia. The temporal characteristics of indexing DB were presented, and the quantitative properties of the designed index databases were described. A faster growth of English Wikipedia was detected, namely for five months (September 2007 to February 2008); in Simple English Wikipedia, the rate of growth of the number of articles was greater by 14%

and by 7% faster, than in Russian Wikipedia.
d by 7% faster, than in Russian Wikipedia.
Data sourceExperiment responses + and Wikipedia pages +
Doi10.1134/S1064230709040157 +
Google scholar urlhttp://scholar.google.com/scholar?ie=UTF-8&q=%22On%2Bthe%2Bproblem%2Bof%2Bwiki%2Btexts%2Bindexing%22 +
Has authorAndrew A. Krizhanovsky + and Alexander V. Smirnov +
Has domainComputer science +
Has topicOther information retrieval topics +
Issue4 +
Pages616-624 +
Peer reviewedYes +
Publication typeJournal article +
Published inJournal of Computer and Systems Sciences International +
Research designExperiment +
Research questionsA new type of documents called a “wiki pagA new type of documents called a “wiki page” is winning the Internet. This is expressed not only

in an increase of the number of Internet pages of this type, but also in the popularity of Wiki projects (in par ticular, Wikipedia); therefore the problem of parsing in Wiki texts is becoming more and more topical. A new method for indexing Wikipedia texts in three languages: Russian, English, and German, is proposed and implemented. The architecture of the indexing system, including the software components GATE and Lem matizer, is considered. The rules of converting Wiki texts into texts in a natural language are described. Index bases for the Russian Wikipedia and Simple English Wikipedia are constructed. The validity of Zipf’s laws is

tested for the Russian Wikipedia and Simple English Wikipedia.
an Wikipedia and Simple English Wikipedia.
Revid10,891 +
TheoriesUndetermined
Theory typeDesign and action +
TitleOn the problem of wiki texts indexing
Unit of analysisN/A +
Urlhttp://dx.doi.org/10.1134/S1064230709040157 +
Volume48 +
Wikipedia coverageSample data +
Wikipedia data extractionLive Wikipedia +
Wikipedia languageEnglish +, German + and Russian +
Wikipedia page typeArticle +
Year2009 +