«PERFECT PHYLOGENETIC NETWORKS: A NEW METHODOLOGY FOR RECONSTRUCTING THE EVOLUTIONARY HISTORY OF NATURAL LANGUAGES LUAY NAKHLEH DON RINGE TANDY WARNOW ...»
PERFECT PHYLOGENETIC NETWORKS: A NEW METHODOLOGY FOR
RECONSTRUCTING THE EVOLUTIONARY HISTORY OF NATURAL
LUAY NAKHLEH DON RINGE TANDY WARNOW
Rice University University of University of Texas
Pennsylvania In this article we extend the model of language evolution exemplified in Ringe et al. 2002, which recovers phylogenetic trees optimized according to a criterion of weighted maximum compatibility, to include cases in which languages remain in contact and trade linguistic material as they evolve. We describe our analysis of an Indo-European (IE) dataset (originally assembled by Ringe and Taylor) based on this new model. Our study shows that this new model fits the IE family well and suggests that the early evolution of IE involved only limited contact between distinct lineages. Furthermore, the candidate histories we obtain appear to be consistent with archaeological findings, which suggests that this method may be of practical use. The case at hand provides no opportunity to explore the problem of conflict between network optimization criteria; that problem must be left to future research.*
1. INTRODUCTION. Languages differentiate and divide into new languages by a process roughly similar to biological speciation:1 communities separate (typically geographically), the language changes differently in each of the new communities, and in time people from separate communities can no longer understand each other.2 While this is not the only way in which languages change, it is this process that is referred to when we say, for example, ‘French and Italian are both descendants of Latin’. The evolution of families of related languages can be modeled mathematically as a rooted tree in which internal nodes represent ancestral languages at the points in time at which they began to diversify and the leaves represent attested languages. Reconstructing this process for various language families is a major endeavor within historical linguistics, but it is also of interest to archaeologists, human geneticists, and physical anthropologists, for example, because an accurate reconstruction of how particular families of languages have evolved can help answer questions about human migrations, the times at which new technologies were first developed, when ancient people began to use * This work was supported in part by the David and Lucile Packard Foundation (Warnow) and by the National Science Foundation with grants EIA 01-21680 (Warnow), BCS 03-12830 (Warnow), SBR-9512092 (Warnow and Ringe), and BCS 03-12911 (Ringe). Warnow would like to acknowledge and thank the Radcliffe Institute for Advanced Study, the Program in Evolutionary Dynamics at Harvard University, andthe Institute for Cellular and Molecular Biology at the University of Texas at Austin for their support during the time this work was done. The authors would like to thank Ann Taylor for help in putting the dataset together and James Clackson, Joe Eska, and Craig Melchert for expert advice regarding data of particular languages.
The software used to construct perfect phylogenetic networks was written by Luay Nakhleh but used earlier code (for perfect phylogeny reconstruction) developed by Alexander Michailov and optimized by Alex Garthwaite.
We take this opportunity to point out that the similarity between biological and linguistic speciation has nothing whatsoever to do with nineteenth-century ideas about the ‘organic’ nature of language. The micro-level processes of biological descent and linguistic descent are actually quite different, but they give rise to similar large-scale patterns, and the similarities are topological—that is, mathematical (see Hoenigswald 1960:144–60, 1987, Ruvolo 1987).
We are well aware that whether one is confronted with ‘the same language’ or ‘different languages’ is a complex matter. However, it seems difficult to dispute that two speakers who cannot understand one another at all are ‘speaking different languages’; we therefore adduce that situation as the paradigm case. What matters for cladistics is that, given enough divergence with too little effective contact, a single language will eventually become two or more different languages by any reasonable criterion.
PERFECT PHYLOGENETIC NETWORKS 383horses, and so on (see e.g. White & O’Connell 1982, Mallory 1989, Roberts et al.
1990).3 Various researchers (e.g. Gleason 1959, Dobson 1969, 1974, Embleton 1986) have noted that if speech communities do not remain in effective contact as their languages diverge, a tree is a reasonable model for the evolutionary history of their language family, and that this tree (called a PHYLOGENY or EVOLUTIONARY TREE) can be inferred from shared unusual innovations in language structure (changes in inflection, regular sound changes, and the replacement of lexemes for basic meanings). Such techniques established the major subfamilies within Indo-European (IE) decades ago but have not been sufficient to resolve the family’s evolution fully; major questions, such as whether all of the non-Anatolian branches of the family constitute a clade (the ‘Indo-Hittite hypothesis’) or whether Greek and Armenian are sisters, continue to be debated. More recently, techniques for using multistate characters have been devised which suggest that the vast majority of linguistic characters,4 provided that they are correctly chosen and coded, should be COMPATIBLE on the true tree (see Ringe et al. 2002:70–78 with references); in other words, each character should evolve without backmutation or parallel evolution.5 This condition is also expressed by saying that the tree is a PERFECT PHYLOGENY, that is, a phylogenetic tree that is fully compatible with all of the data.
(See §2 for an extended discussion of those requirements.) A collaboration between linguist Don Ringe and computer scientist Tandy Warnow led to a computational technique to solve the ‘perfect phylogeny’ problem (determining whether a perfect phylogeny exists for a given dataset); that technique was subsequently used to analyze an IE dataset compiled by Don Ringe and Ann Taylor (see the references under all three authors in the bibliography). Their initial test of the methodology largely supported the claim that a perfect phylogeny should exist, but not entirely. The Germanic subfamily especially seemed to exhibit nontreelike behavior, evidently acquiring some of its characteristics from its neighbors rather than (only) from its direct ancestors.6 Readers who have not been trained in historical linguistics also need to understand that recognition of language families is different from and independent of the reconstruction of phylogenetic trees, and that the recognition of cognates—words and affixes inherited by two or more related languages from any common ancestor—also does not depend on prior knowledge of the true tree. Cognates are recognized by the regular correspondences between their sounds that are the direct result of regular sound changes; see especially Hoenigswald 1960 for discussion of this fundamental point. Cognates cannot be reliably recognized by mere similarity. Language families are recognized by a density of putative cognates too great to be attributed to mere chance resemblance; see Ringe 1999 for some of the problems involved.
A character is a linguistic parameter in which languages can agree or differ; languages are assigned the same state of the character if they agree, but different states if they differ. Characters of interest in linguistic phylogeny are highly specific, since general characters (such as word order) typically reveal much less about shared linguistic history. For instance, the basic meaning ‘hand’ can be chosen as a character; IE languages that exhibit cognates of English hand will all be assigned a single state of that character, languages that exhibit cognates of French main a second state, and so on. Among phonological developments, across-theboard merger of Proto-Indo-European (PIE) *m and *mbh can be chosen as a character; the two Tocharian languages (which share that merger) will then be assigned one state, while all the other IE languages in our database (which did not undergo the merger) will be assigned another state. On the coding of characters see further Ringe et al. 2002:71–76.
Backmutation is the reappearance at some point in the phylogenetic tree of a state that has already appeared at some earlier point in the same line of descent but was subsequently lost; in other words, a sequence of states a N b (N c... ) N a in a single line of descent is backmutation. Parallel development is the appearance of the same state independently in different lines of descent.
We wish to emphasize that this appears to be an ineluctable conclusion of Ringe et al. 2002; we see no grounds for questioning it and do not revisit the problem here. Interested readers are referred to Ringe et al. 2002, especially pp. 85–92. Since the best tree found in that earlier work also figures largely in this 384 LANGUAGE, VOLUME 81, NUMBER 2 (2005) Consequently, though their methodology seemed promising and offered potential answers to many of the controversial problems in the evolution of IE (cf. Jasanoff 1997, Winter 1998, Ringe 2000 with references), it is clearly necessary to extend their model to address the problem of how characters evolve when diverging language communities remain in significant contact. For such cases trees are not an appropriate model of evolution; NETWORKS are needed instead to model the evolutionary history of the family.
In this article we show how to extend the perfect phylogeny approach to the case in which the language family requires a network model (that is, an underlying tree with additional ‘contact’ edges; see Fig. 3 for an example) instead of a tree model, and we test this approach on the same IE dataset analyzed by Ringe, Warnow, and Taylor. Our analysis finds several networks with a very small number of contact edges that are plausible with respect to what is known about the early linguistic geography of the IE family. The study thus leads us to conjecture that the IE family, though it did not evolve by means of clean speciation, exhibits a pattern of initial diversification that is close to treelike: the vast majority of characters evolve down the ‘genetic’ tree, and the evolution of the rest can be accounted for by positing limited borrowing between languages. It also suggests that this extended model of character evolution is plausible and that the tools we have developed may be helpful in reconstructing evolutionary histories for other datasets that are similarly close to treelike in their evolution.
The rest of this article is organized as follows. We review the model of Ringe and Warnow, and then present our extension to the case of network evolution. We next describe the data we use to represent the IE family, and then turn to our computational analysis of the data which results in the candidate networks we then consider. Comparing the candidate networks in the light of known IE history produces a set of five feasible solutions, leading to a detailed discussion of the best network that we find.
We conclude with a discussion of the implications of this work for future research in IE and general historical linguistics. Notes on the formal mathematical model of language evolution on networks and the computational approach are given in Appendix A. The full set of our coded data, together with a list of characters omitted and the reasons for their omission, are made available in an online appendix at http://www.cs.rice.edu/ nakhleh/CPHL; a selection is given in Appendix B.
2. INFERRING EVOLUTIONARY TREES. An evolutionary tree, or phylogeny, for a language family S describes the evolution of the languages in S from their most recent common ancestor. Different types of data can be used as input to methods of tree reconstruction; QUALITATIVE CHARACTER data, which reflect specific observable discrete characteristics of the languages under study, are one such type of data. Qualitative characters for languages can encode phonological, morphological, and lexical evidence, as described immediately below. Current approaches for subgrouping used in historical linguistics explicitly select characters that appear to have evolved without backmutation or parallel development; because of this, our analysis is based on a subset of the characters (eliminating those with clear parallel development, in particular). We have also found it advisable to eliminate characters that are POLYMORPHIC (those for which at least one language exhibits more than one state) because models of linguistic evolution involving polymorphic characters that are (at least provisionally) accepted as linguistically realistic have not yet been established.
Experience shows that it is easy to construct a comparative dataset using only qualitative characters that evolve without backmutation—that is, characters that never change from a given state to a second state (and potentially to a third, etc.) and then back to the given state (see Ringe et al. 2002:70). The relative absence of backmutation in linguistic data is partly the result of known properties of linguistic systems and language change and partly the result of probabilistic factors. Backmutation in phonological
characters is easy to avoid: since phonemic mergers are irreversible (Hoenigswald 1960: