Wikipedia - a quantitative analysis
Publication (help) | |
---|---|
Wikipedia: a quantitative analysis | |
Authors: | Felipe Ortega [edit item] |
Citation: | Universidad Rey Juan Carlos : 228. 2009 April. Madrid, Spain. |
Publication type: | Thesis |
Peer-reviewed: | Yes |
Database(s): | |
DOI: | Define doi. |
Google Scholar cites: | Citations |
Link(s): | Paper link |
Added by Wikilit team: | Yes |
Search | |
Article: | Google Scholar BASE PubMed |
Other scholarly wikis: | AcaWiki Brede Wiki WikiPapers |
Web search: | Bing Google Yahoo! — Google PDF |
Other: | |
Services | |
Format: | BibTeX |
Contents
[edit] Abstract
In this doctoral thesis, we undertake a quantitative analysis of the top-ten language editions of Wikipedia, from different perspectives. Our main goal has been to trace the evolution in time of key descriptive and organizational parameters of Wikipedia and its community of authors. The analysis has focused on logged authors (those editors who created a personal account to participate in the project). Among the distinct metrics included, we can find the monthly evolution of general metrics (number of revisions, active editors, active pages); the distribution of pages and its length, the evolution of participation in discussion pages. We also present a detailed analysis of the inner social structure and stratification of the Wikipedia community of logged authors,fitting appropriate distributions to the most relevant metrics. We also examine the inequality level of contributions from logged authors, showing that there exists a core of very active authors who undertake most of the editorial work. Regarding articles, the inequality analysis also shows that there exists a reduced group of popular articles, though the distribution of revisions is not as skewed as in the previous case. The analysis continues with an in-depth demographic study of the community of authors, focusing on the evolution of the core of very active contributors (applying a statistical technique known as survival analysis). We also explore some basic metrics to analyze the quality of Wikipedia articles and the trustworthiness level of individual authors. This work concludes with an extended analysis of the evolution of the most influential parameters and metrics previously presented. Based on these metrics, we infer important conclusions about the future sustainability of Wikipedia. According to these results, the Wikipedia community of authors has ceased to grow, remaining stable since Summer 2006 until the end of 2007. As a result, the monthly number of revisions has remained stable over the same period, restricting the number of articles that can be reviewed by the community. On the other side, whilst the number of revisions in talk pages has stabilized over the same period, as well, the number of active talk pages follows a steady growing rate, for all versions. This suggests that the community of authors is shifting its focus to broaden the coverage of discussion pages, which has a direct impact in the final quality of content, as previous research works has shown. Regarding the inner social structure of the Wikipedia community of logged authors, we find Pareto-like distributions that fit all relevant metrics pertaining authors (number of revisions per author, number of different articles edited per author), while measurements on articles (number of revisions per article, number of different authors per article) follow lognormal shapes. The analysis of the inequality level of revisions performed by authors, and revisions received by arti- cles shows highly unequal distributions. The results of our survival analysis on Wikipedia authors presents very high mortality percentages on young authors, revealing an endemic problem of Wikipedias to keep young editors on collaborating with the project for a long period of time. In the same way, from our survival analysis we obtain that the mean lifetime of Wikipedia authors in the core (until they abandon the group of top editors) is situated between 200 and 400 days, for all versions, while the median value is lower than 120 days in all cases. Moreover the analysis of the monthly number of births and deaths in the community of logged authors reveals that the cause of the shift in the monthly trend of active authors is produced by a higher number of deaths from Summer 2006 in all versions, surpassing the monthly number of births from then on. The analysis of the inequality level of contributions over time, and the evolution of additional key features identified in this thesis, reveals a worrying trend towards progressive increase of the effort spent by core authors, as time elapses. This trend may eventually cause that these authors will reach their upper limit in the number of revisions they can perform each month, thus starting a decreasing trend in the number of monthly revisions, and an overall recession of the content creation and reviewing process in Wikipedia. To prevent this probable future scenario, the number of monthly new editors should be improved again, perhaps through the adoption of specific policies and campaigns for attracting new editors to Wikipedia, and recover older top- contributors again. Finally, another important contribution for the research community is {WikiXRay}, the soft- ware tool we have developed to perform the statistical analyses included in this thesis. This tool completely automates the process of retrieving the database dumps from the Wikimedia public repositories, process them to obtain key metrics and descriptive parameters, and load them in a local database, ready to be used in empirical analyses. As far as we know, this is the first research work implementing a comparative analysis, from an quantitative point of view, of the top-ten language editions of Wikipedia, presenting results from many different scientific perspectives. Therefore, we expect that this contribution will help the scientific community to enhance their understanding of the rich, complex and fascinating working mechanisms and behavioral patterns of the Wikipedia project and its community of authors. Likewise, we hope that {WikiXRay} will facilitate the hard task of developing empirical analyses on any language version of the encyclopedia, boosting in this way the number of comparative studies like this one in many other scientific disciplines.
[edit] Research questions
"1. How does the community of authors in the top ten Wikipedias evolve over time?: 2. What is the distribution of content and pages in the top tenWikipedias?: 3. How does the coordination among authors in the top ten Wikipedias evolve over time?: 4. Which are the key parameters defining the social structure and stratification ofWikipedia authors?: 5. What is the average lifetime ofWikipedia volunteer authors in the project?: 6. Can we identify basic quantitative metrics to describe the reputation ofWikipedia authors and the quality of Wikipedia articles?: 7. Is it possible to infer, based on previous history data, any sustainability conditions affecting the top ten Wikipedias in due course?:"
Research details
Topics: | Featured articles, Reliability, Size of Wikipedia, Data mining, Information extraction, Community building, Contributor engagement, Quality improvement processes, Participation trends [edit item] |
Domains: | Computer science [edit item] |
Theory type: | Design and action [edit item] |
Wikipedia coverage: | Main topic [edit item] |
Theories: | "We will also make use of the muhaz R package [80], written originally by Kenneth Hess and ported to R by R. Gentleman. [55], [30], [64] and [120] provide good introductions to the theory of survival analysis and practical examples using R. We present here a very brief introduction
to the basic theory behind survival analysis, just to provide a minimum background framework to understand the results and conclusions that we will draw in Chapter 4 of this thesis work. On the other side, the extremely low ratio found in the Polish language version sets out interesting theories about the source of efforts in this language version. The combination of a very active cohort of bots, together with the very low ratio of talk pages, indicates that the Polish language version is not following the same organizational pattern found in other language editions. Such a low ratio of talk pages points out the little effort undertaken on coordination actions and discussion about article contents in the Polish version. The 3 distinct types of theoretical distributions found in our data are: • Pareto distribution: This distribution follows a straight line all along the entire range of its CCDF plot, when we use a logarithmic scale for both axes. The mathematical properties and other interesting characteristics of this distribution can be found in [74]. The slope of the line ® is the characteristic parameter of this distribution. Sometimes,the empirical data only follows the straight line shape from a minimal value, which is usually known as xmin. The methodology presented by [25] and followed in this thesis work produces the M.L.E. of both parameters, along with the maximum distance from the fitted line to the empirical data. • Upper truncated Pareto distribution: Similar to the previous one, it follows a straight line along its lower values, but it suddenly drops off from a certain upper limit value. The algorithm presented in [4] is applied in the VGAM library of GNU R to find the M.L.E. of the lower, upper limits and the slope of the distribution, using generalized linear models. • Lognormal distribution: The lognormal distribution presents a characteristic curved shaped all along its CCDF curve, without any straight line throughout its range. The fitdistr function, included in the MASS library provides a good tool to fit this family of theoretical distributions to empirical data. The logarithmic mean and standard deviation are the characteristic parameters defining this distribution." [edit item] |
Research design: | Statistical analysis [edit item] |
Data source: | Archival records, Interview responses, Wikipedia pages [edit item] |
Collected data time dimension: | Longitudinal [edit item] |
Unit of analysis: | Article, Language, User [edit item] |
Wikipedia data extraction: | Dump [edit item] |
Wikipedia page type: | Article, Article:talk [edit item] |
Wikipedia language: | Multiple [edit item] |
[edit] Conclusion
"The joint conclusion that we can extract from the numeric summaries in these tables is that the production of quality content in Wikipedia presents a strong correlation between both a high number of authors and a large number of different revisions. In other words, Wikipedia needs to sustain, and increase as much as possible the number of different authors and the number of revisions received in case the project wants to ensure maintaining a process that is able to create top quality content. Our quantitative analysis of FAs and author reputation in Wikipedia leave some interesting conclusions. First and foremost, there exist common quantitative patterns in FAs of the top ten language editions of Wikipedia, according to their total number of articles. FAs in these language editions are longer, present a higher number of different authors and longer revisions than non-FAs. They are also older articles (in average), according to our definition of age, and they have a much lower average recentness value. Finally, FAs presents higher average rating values, computed following Stein and Hess proposal. The main conclusion that we can infer from the overall results of our quantitative analysis is that there exists a severe risk on the capacity of the top-ten Wikipedias, to maintain their current activity level in due course. According to our graphs and numbers, the inequality level of the contributions from logged authors is becoming more and more biased towards the core of very active authors. At the same time, the monthly Gini coefficients show that the inequality level of contributions from logged authors has remained stable over time, at the cost of demanding more and more contributions from active authors to alleviate this deficit of monthly revisions. Furthermore, we have seen that the distribution of the total number of revisions per author follows an upper truncated Pareto distribution. While more core authors begin to reach the upper limit of their human contribution capacity, we will see a point in the future of this language versions in which the steady-state of the monthly Gini coefficient will start to decrease. This situation would not pose a problem in itself, unless for the fact that we have demonstrated that the most significant part of the content creation effort in Wikipedia is not undertaken by casual, passing-by authors, but by members of the core of very active contributors. On top of that, the lack of new core members seriously threaten the scalability of the top-ten language versions regarding the quality of their content. We have demonstrated in the analysis previously presented that the eldest, top-active contributors are responsible for the majority of revisions in FAs, as well. Since the number of core authors has reached a steady-state (due to the leverage in the total number of active authors per month), the group of authors providing the primary source of effort in the revision of quality articles has stalled. Without new core members, the number of different articles who would potentially become FAs can not expand, since we do not have enough revisors for that content. Since the total number of quality articles generated so far in the top-ten language editions is fairly low, we can conclude that this approach will not contribute to dynamize the creation of quality content in Wikipedia in due course. It is true that Wikipedia has succeeded to compete with other traditional encyclopaedias, namely Britannica, but if we do not have a clear strategy for making the creation of quality content in Wikipedia more agile, the project will not ever evolve from its current character of “good starting point to look for a quick introduction of a new topic, from which we can jump to more serious information sources”. To conclude this section, it would be disappointing to avoid offering some insights about possible solutions for the top-ten Wikipedias to improve their current trend. Nevertheless, some of the knowledge needed to formulate such recommendations could be perfectly a matter for a doctoral thesis on its own, namely the causes driving Wikipedia authors to eventually join the core of very active users. Since we have not answered such questions, we can simply settle for enumerating direct countermeasures to alleviate these findings. In the first place, incrementing the number of core authors should become a priority for the project, and as a first step, Wikipedia should focus increasing the number of monthly active authors. Indeed, donations campaigns are necessary to aid in the financial support of the project, but attracting new contributors or recovering older ones should be an equally important goal, given the current situation."
[edit] Comments
""The production of quality content in Wikipedia presents a strong correlation between both a high number of authors and a large number of different revisions" P.160"
Further notes[edit]
Abstract | In this doctoral thesis, we undertake a qu … In this doctoral thesis, we undertake a quantitative analysis of the top-ten language editions of Wikipedia, from different perspectives. Our main goal has been to trace the evolution in time of key descriptive and organizational parameters of Wikipedia and its community of authors. The analysis has focused on logged authors (those editors who created a personal account to participate in the project). Among the distinct metrics included, we can find the monthly evolution of general metrics (number of revisions, active editors, active pages); the distribution of pages and its length, the evolution of participation in discussion pages. We also present a detailed analysis of the inner social structure and stratification of the Wikipedia community of logged authors,fitting appropriate distributions to the most relevant metrics. We also examine the inequality level of contributions from logged authors, showing that there exists a core of very active authors who undertake most of the editorial work. Regarding articles, the inequality analysis also shows that there exists a reduced group of popular articles, though the distribution of revisions is not as skewed as in the previous case. The analysis continues with an in-depth demographic study of the community of authors, focusing on the evolution of the core of very active contributors (applying a statistical technique known as survival analysis). We also explore some basic metrics to analyze the quality of Wikipedia articles and the trustworthiness level of individual authors. This work concludes with an extended analysis of the evolution of the most influential parameters and metrics previously presented. Based on these metrics, we infer important conclusions about the future sustainability of Wikipedia. According to these results, the Wikipedia community of authors has ceased to grow, remaining stable since Summer 2006 until the end of 2007. As a result, the monthly number of revisions has remained stable over the same period, restricting the number of articles that can be reviewed by the community. On the other side, whilst the number of revisions in talk pages has stabilized over the same period, as well, the number of active talk pages follows a steady growing rate, for all versions. This suggests that the community of authors is shifting its focus to broaden the coverage of discussion pages, which has a direct impact in the final quality of content, as previous research works has shown. Regarding the inner social structure of the Wikipedia community of logged authors, we find Pareto-like distributions that fit all relevant metrics pertaining authors (number of revisions per author, number of different articles edited per author), while measurements on articles (number of revisions per article, number of different authors per article) follow lognormal shapes. The analysis of the inequality level of revisions performed by authors, and revisions received by arti- cles shows highly unequal distributions. The results of our survival analysis on Wikipedia authors presents very high mortality percentages on young authors, revealing an endemic problem of Wikipedias to keep young editors on collaborating with the project for a long period of time. In the same way, from our survival analysis we obtain that the mean lifetime of Wikipedia authors in the core (until they abandon the group of top editors) is situated between 200 and 400 days, for all versions, while the median value is lower than 120 days in all cases. Moreover the analysis of the monthly number of births and deaths in the community of logged authors reveals that the cause of the shift in the monthly trend of active authors is produced by a higher number of deaths from Summer 2006 in all versions, surpassing the monthly number of births from then on. The analysis of the inequality level of contributions over time, and the evolution of additional key features identified in this thesis, reveals a worrying trend towards progressive increase of the effort spent by core authors, as time elapses. This trend may eventually cause that these authors will reach their upper limit in the number of revisions they can perform each month, thus starting a decreasing trend in the number of monthly revisions, and an overall recession of the content creation and reviewing process in Wikipedia. To prevent this probable future scenario, the number of monthly new editors should be improved again, perhaps through the adoption of specific policies and campaigns for attracting new editors to Wikipedia, and recover older top- contributors again. Finally, another important contribution for the research community is {WikiXRay}, the soft- ware tool we have developed to perform the statistical analyses included in this thesis. This tool completely automates the process of retrieving the database dumps from the Wikimedia public repositories, process them to obtain key metrics and descriptive parameters, and load them in a local database, ready to be used in empirical analyses. As far as we know, this is the first research work implementing a comparative analysis, from an quantitative point of view, of the top-ten language editions of Wikipedia, presenting results from many different scientific perspectives. Therefore, we expect that this contribution will help the scientific community to enhance their understanding of the rich, complex and fascinating working mechanisms and behavioral patterns of the Wikipedia project and its community of authors. Likewise, we hope that {WikiXRay} will facilitate the hard task of developing empirical analyses on any language version of the encyclopedia, boosting in this way the number of comparative studies like this one in many other scientific disciplines. one in many other scientific disciplines. |
Added by wikilit team | Yes + |
Collected data time dimension | Longitudinal + |
Comments | "The production of quality content in Wikipedia presents a strong correlation between both a high number of authors and a large number of different revisions" P.160 |
Conclusion | The joint conclusion that we can extract f … The joint conclusion that we can extract from the numeric summaries in these tables is that the production of quality content in Wikipedia presents a strong correlation between both a high number of authors and a large number of different revisions. In other words, Wikipedia needs to sustain, and increase as much as possible the number of different authors and the number of revisions received in case the project wants to ensure maintaining a process that is able to create top quality content. Our quantitative analysis of FAs and author reputation in Wikipedia leave some interesting conclusions. First and foremost, there exist common quantitative patterns in FAs of the top ten language editions of Wikipedia, according to their total number of articles. FAs in these language editions are longer, present a higher number of different authors and longer revisions than non-FAs.
They are also older articles (in average), according to our definition of age, and they have a much lower average recentness value. Finally, FAs presents higher average rating values, computed following Stein and Hess proposal. The main conclusion that we can infer from the overall results of our quantitative analysis is that there exists a severe risk on the capacity of the top-ten Wikipedias, to maintain their current activity level in due course. According to our graphs and numbers, the inequality level of the contributions from logged authors is becoming more and more biased towards the core of very active authors. At the same time, the monthly Gini coefficients show that the inequality level of contributions from logged authors has remained stable over time, at the cost of demanding more and more contributions from active authors to alleviate this deficit of monthly revisions. Furthermore, we have seen that the distribution of the total number of revisions per author follows an upper truncated Pareto distribution. While more core authors begin to reach the upper limit of their human contribution capacity, we will see a point in the future of this language versions in which the steady-state of the monthly Gini coefficient will start to decrease. This situation would not pose a problem in itself, unless for the fact that we have demonstrated that the most significant part of the content creation effort in Wikipedia is not undertaken by casual, passing-by authors, but by members of the core of very active contributors. On top of that, the lack of new core members seriously threaten the scalability of the top-ten language versions regarding the quality of their content. We have demonstrated in the analysis previously presented that the eldest, top-active contributors are responsible for the majority of revisions in FAs, as well. Since the number of core authors has reached a steady-state (due to the leverage in the total number of active authors per month), the group of authors providing the primary source of effort in the revision of quality articles has stalled. Without new core members, the number of different articles who would potentially become FAs can not expand, since we do not have enough revisors for that content. Since the total number of quality articles generated so far in the top-ten language editions is fairly low, we can conclude that this approach will not contribute to dynamize the creation of quality content in Wikipedia in due course. It is true that Wikipedia has succeeded to compete with other traditional encyclopaedias, namely Britannica, but if we do not have a clear strategy for making the creation of quality content in Wikipedia more agile, the project will not ever evolve from its current character of “good starting point to look for a quick introduction of a new topic, from which we can jump to more serious information sources”. To conclude this section, it would be disappointing to avoid offering some insights about possible solutions for the top-ten Wikipedias to improve their current trend. Nevertheless, some of the knowledge needed to formulate such recommendations could be perfectly a matter for a doctoral thesis on its own, namely the causes driving Wikipedia authors to eventually join the core of very active users. Since we have not answered such questions, we can simply settle for enumerating direct countermeasures to alleviate these findings. In the first place, incrementing the number of core authors should become a priority for the project, and as a first step, Wikipedia should focus increasing the number of monthly active authors. Indeed, donations campaigns are necessary to aid in the financial support of the project, but attracting new contributors or recovering older ones should be an equally important goal, given the current situation.portant goal, given the current situation. |
Conference location | Madrid, Spain + |
Data source | Archival records +, Interview responses + and Wikipedia pages + |
Google scholar url | http://scholar.google.com/scholar?ie=UTF-8&q=%22Wikipedia%3A%2Ba%2Bquantitative%2Banalysis%22 + |
Has author | Felipe Ortega + |
Has domain | Computer science + |
Has topic | Featured articles +, Reliability +, Size of Wikipedia +, Data mining +, Information extraction +, Community building +, Contributor engagement +, Quality improvement processes + and Participation trends + |
Month | April + |
Pages | 228 + |
Peer reviewed | Yes + |
Publication type | Thesis + |
Published in | Universidad Rey Juan Carlos + |
Research design | Statistical analysis + |
Research questions | 1. How does the community of authors in th … 1. How does the community of authors in the top ten Wikipedias evolve over time?:
2. What is the distribution of content and pages in the top tenWikipedias?: 3. How does the coordination among authors in the top ten Wikipedias evolve over time?: 4. Which are the key parameters defining the social structure and stratification ofWikipedia authors?: 5. What is the average lifetime ofWikipedia volunteer authors in the project?: 6. Can we identify basic quantitative metrics to describe the reputation ofWikipedia authors and the quality of Wikipedia articles?: 7. Is it possible to infer, based on previous history data, any sustainability conditions affecting the top ten Wikipedias in due course?:ing the top ten Wikipedias in due course?: |
Revid | 11,076 + |
Theories | We will also make use of the muhaz R packa … We will also make use of the muhaz R package [80], written originally by Kenneth Hess and ported to R by R. Gentleman. [55], [30], [64] and [120] provide good introductions to the theory of survival analysis and practical examples using R. We present here a very brief introduction
to the basic theory behind survival analysis, just to provide a minimum background framework to understand the results and conclusions that we will draw in Chapter 4 of this thesis work. On the other side, the extremely low ratio found in the Polish language version sets out interesting theories about the source of efforts in this language version. The combination of a very active cohort of bots, together with the very low ratio of talk pages, indicates that the Polish language version is not following the same organizational pattern found in other language editions. Such a low ratio of talk pages points out the little effort undertaken on coordination actions and discussion about article contents in the Polish version. The 3 distinct types of theoretical distributions found in our data are: • Pareto distribution: This distribution follows a straight line all along the entire range of its CCDF plot, when we use a logarithmic scale for both axes. The mathematical properties and other interesting characteristics of this distribution can be found in [74]. The slope of the line ® is the characteristic parameter of this distribution. Sometimes,the empirical data only follows the straight line shape from a minimal value, which is usually known as xmin. The methodology presented by [25] and followed in this thesis work produces the M.L.E. of both parameters, along with the maximum distance from the fitted line to the empirical data. • Upper truncated Pareto distribution: Similar to the previous one, it follows a straight line along its lower values, but it suddenly drops off from a certain upper limit value. The algorithm presented in [4] is applied in the VGAM library of GNU R to find the M.L.E. of the lower, upper limits and the slope of the distribution, using generalized linear models. • Lognormal distribution: The lognormal distribution presents a characteristic curved shaped all along its CCDF curve, without any straight line throughout its range. The fitdistr function, included in the MASS library provides a good tool to fit this family of theoretical distributions to empirical data. The logarithmic mean and standard deviation are the characteristic parameters defining this distribution.tic parameters defining this distribution. |
Theory type | Design and action + |
Title | Wikipedia: a quantitative analysis |
Unit of analysis | Article +, Language + and User + |
Url | http://www.mendeley.com/research/wikipedia-a-quantitative-analysis/ + |
Wikipedia coverage | Main topic + |
Wikipedia data extraction | Dump + |
Wikipedia language | Multiple + |
Wikipedia page type | Article + and Article:talk + |
Year | 2009 + |