In this article, we presented a network-th … In this article, we presented a network-theoretical approach to language classification. Our study is a first attempt to classify languages by means of the topological characteristics of social ontologies generated in these languages. We have tested two related hypotheses: a variant of the Sapir-Whorf Hypothesis and a variant of Nisbett’s Hypothesis on differences in Western and Eastern cultures. In this way, we gained access to structural analyses of linguistic networks by example of Wikipedia-based social ontologies as a new resource of language classification.
In support of the SWH, we successfully classified languages into three genealogical groups. We also outperformed corresponding baselines of random classification. Concerning Nisbett’s variant of the SWH, we obtained a similar result by separating Western and Eastern languages. As predicted by Nisbett, the classification worsened by extending the corpus of Eastern languages by Sundic languages. In any event, enlarging the number of classes may worsen our results as well as we observed in our experiments. Obviously, the results obtained could have been biased by the number of classes and related factors such as the size of the language families, the validity of the underlying corpora and the independence of the data sources. Thus, we aim to examine these factors in further studies to undermine our findings. Additionally, future work will address the construction of more elaborate baselines, and checking the extensibility of our approach to other kinds of social ontologies. Further, we plan to build more expressive graph models in conjunction with topological indices that are more separable to get better classification results. We also want to extend sensitivity analyses as the one based on Konstantinova’s index of degeneration to get classifiers that can be reliably transferred to other areas of linguistic networks. Finally, we will make larger classification experiments to extend the range of language families covered by our approach.language families covered by our approach.
This article presents an approach to autom … This article presents an approach to automatic language classification based on complex network theory ([Newman, 2003], [Ferrer i Cancho et al., 2004] and Mehler, 2008a A. Mehler, Large text networks as an object of corpus linguistic studies. In: A. Lüdeling and M. Kytö, Editors, Corpus Linguistics. An International Handbook of the Science of Language and Society, De Gruyter, Berlin/New York (2008), pp. 328–382.[Mehler, 2008a]). It explores the topologies of social ontologies as part of Wikipedia to get a new data source of genealogical classification. In so doing, the article tests a variant of the Sapir-Whorf Hypothesis (SWH) by means of a network-theoretical approach. It tackles the question, whether structural similarities of social ontologies correspond to family resemblances of the underlying languages.
Generally speaking, the SWH states that language structure imprints on cognitive structure ([Whorf, 1956] and [Lucy, 1992]). If this principle of linguistic relativity is true, then the usage of similar languages should result in similar conceptual structures. Therefore, conversely, conceptual structures should be indicative of family resemblances of the languages in which they are manifested. According to our network-theoretical approach, we additionally hypothesize that these resemblances can be deduced from topological similarities of conceptual structures.
The present study combines the domain-centered with the structure-centered approach. On the one hand, our method can be regarded as domain-centered since we refer to encyclopedic domains as the data source of language classification. At the same time, we overcome the restriction of traditional domain-centered approaches to small ranges of terms. The reason is that social ontologies cover, in principle, the complete range of encyclopedic knowledge and its terminological manifestation. Additionally, we circumvent the problematic introspection of many domain-centered approaches as we access social ontologies directly without any subjective mediation. Consequently, we depart from domain-centered approaches in two respects. The first is that we do not compare the terms of different ontologies directly, nor do we directly compare the referents of these terms in the corresponding domains. Rather, we follow a strict network-theoretical approach as outlined in Section 1.
In this section we present our approach to characterizing social ontologies by topological indices of their graph model. As explained in the last section, we capture both the network- and tree-like structures of social ontologies in a single model. This is done by taking fingerprints of GNAGs by means of four classes of topological indices:
Class 1. Network Theoretical (NT) measures: We utilize the apparatus of scale-free networks (Newman, 2003). In a pilot study (see Section 6.1), we test the hypothesis that languages can be classified into families based on topological indices of dependency networks as invented by Ferrer i Cancho et al. (2004). In line with this approach, we test whether the same indices indicate the membership of social ontologies to language families. We test this for the cluster coefficients Cws (Watts and Strogatz, 1998), Cbr (Bollobás and Riordan, 2003) and their weighted counterparts Cw(k) and (Serrano et al., 2006). Further, we consider the diameter δ together with the average geodesic distance L, the average degree, Newman’s assortativity index (Newman, 2003) and the expected L and Cws of the random and regular graphs of equal order and size.9 All in all, we consider 12 indices in Class 1 – see Mehler (2008b) for a thorough exemplification of these indices in the context of linguistic networks.
Class 2. Information Theoretical (IT) measures: In addition, we investigate a range of measures that have been invented in order to describe the information content of graphs and processes of information flow based on them [see
Harary, 1969 for a first introduction into this topic]. This relates to so-called measures of graph entropy (Dehmer and Mowshowitz, submitted for publication). The idea behind this approach is more related to Nisbett’s Hypothesis, which states that information content tells us something about the shareability (Freyd, 1983) of knowledge systems. Therefore, we direct our attention to this class of topological indices. Further, a pre-study has shown that compactness and centrality measures are informative about differences of linguistic networks like such as wiki graphs (Mehler, 2008b). This includes the compactness measure of hypertext theory (Botafogo et al., 1992) as well as graph-related centrality measures such as graph, degree and closeness centrality, which have been successfully applied in NLP (Feldman and Sanger, 2007 R. Feldman and J. Sanger, The Text Mining Handbook. Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press, Cambridge (2007).Feldman and Sanger, 2007). As centrality measures are primarily based on the notion of geodesic distance, they relate to graph entropy measures so that we commonly refer to this group as Information Theoretical (IT) measures. All in all, we experiment with 45 indices in Class 2 as further described in Section 5.2.1.
Class 3. GNAG-based measures: We additionally utilize a range of measures that have been developed in order to capture the topological specifics of social ontologies in contrast to terminological and formal ontologies (
Mehler, 2010). This class of measures is sensitive to the kernel hierarchical structure of GNAGs and, therefore, goes beyond network-theoretical indices (of Class 1). We experiment with 52 indices in Class 3 as described in Section 5.2.2. in Class 3 as described in Section 5.2.2.