«This study uses a generalized additive mixed-effects regression model to predict lexical differences in Tuscan dialects with respect to standard ...»
LEXICAL DIFFERENCES BETWEEN TUSCAN DIALECTS AND STANDARD
ITALIAN: ACCOUNTING FOR GEOGRAPHIC AND SOCIODEMOGRAPHIC
VARIATION USING GENERALIZED ADDITIVE MIXED MODELING
MARTIJN WIELING SIMONETTA MONTEMAGNI
University of Groningen and Istituto di Linguistica Computationale University of Tübingen ‘Antonio Zampolli’, CNR
JOHN NERBONNE R. HARALD BAAYEN
1. Introduction. In spite of their different origins and histories, it is nowadays a widely acknowledged fact that traditional dialectology (to be understood here as dialect geography) and sociolinguistics (or urban dialectology) can be seen as two streams of a unique and coherent discipline: modern dialectology (Chambers & Trudgill 1998).
Chambers and Trudgill (1998:187–88) describe the convergence of these two historically separated disciplines as follows:
For all their differences, dialectology and sociolinguistics converge at their deepest point. Both are dialectologies, so to speak. They share their essential subject matter. Both fix the attention on language in communities. Prototypically, one has been centrally concerned with rural communities and the other with urban centres, but these are accidental differences, not essential ones and certainly not axiomatic.
… A decade or two ago, it might have been possible to think that the common subject matter of dialectology and sociolinguistics counted for next to nothing. Now we know it counts for everything.
of international exchanges funded by the National Research Council (CNR, Italy). This research has also benThe research reported in this article was carried out in the framework of the Short Term Mobility program efited from the Rubicon grant awarded to Martijn Wieling by the Netherlands Organisation for Scientific Research (NWO) and the Alexander von Humboldt Professorship awarded to R. Harald Baayen.
In practice, however, dialectology and sociolinguistics remain separate fields when considering the methods and techniques used for analyzing language variation and change.
Sociolinguistics—whose basic goal consists of identifying the social factors underlying the use of different variants of linguistic variables—has adopted a quantitative approach to data analysis since its inception (e.g. Labov 1966). Over time, different methods for the analysis of linguistic variation were developed, capable of modeling the joint effect of an increasing number of factors related to the social background of speakers (including age, gender, socioeconomic status, etc.) and linguistic features.
While early studies focused on simple relationships between the value of a linguistic variable and the value of a social variable (see e.g. Labov 1966, 1972), over time more advanced statistical methods for the analysis of linguistic variation were developed.
Since the 1970s, the most common method in sociolinguistic research has been logistic regression (Cedergren & Sankoff 1974), and more recently, mixed-effects regression models have been applied to sociolinguistic data (Johnson 2009, Tagliamonte & Baayen 2012, Wieling et al. 2011).
Traditional dialectology shows a different pattern. Beginning with its origin in the second half of the nineteenth century, it typically relied on the subjective analysis of categorical maps charting the distribution of the different variants of a linguistic variable across a region. Only later, during the last forty years, have quantitative methods been applied to the analysis of dialect variation. This quantitative approach to the study of dialects is known as dialectometry (Goebl 1984, 2006, Nerbonne et al. 1996, Nerbonne 2003, Nerbonne & Kleiweg 2007, Séguy 1973). Dialectometric methods focus mostly on identifying the most important dialectal groups (i.e. in terms of geography) using an aggregate analysis of the linguistic data. The aggregate analysis is based on computing the distance (or similarity) between every pair of locations in the data set based on the complete set of linguistic variables and analyzing the resulting linguistic distance (or similarity) matrix using multivariate statistics to identify aggregate geographical patterns of linguistic variation.
While viewing dialect differences at an aggregate level arguably provides a more comprehensive and objective view than the analysis of a small number of subjectively selected features (Nerbonne 2009), the aggregate approach has never fully convinced linguists of its usefulness because it fails to identify the linguistic basis of the identified groups (see e.g. Loporcaro 2009). By initially aggregating the values of numerous linguistic variables, traditional dialectometric analyses offer no direct method for testing whether and to what extent an individual linguistic variable contributes to observed patterns of variation. Recent developments in dialectometric research have tried to reduce the gap between models of linguistic variation that are based on quantitative analyses and more traditional analyses that are based on specific linguistic features. Wieling and Nerbonne (2010, 2011) proposed a new dialectometric method, the spectral partitioning of bipartite graphs, to cluster linguistic varieties and simultaneously determine the underlying linguistic basis. Originally applied to Dutch dialects, this method was also successfully tested on English (Wieling et al. 2013, Wieling et al. 2014) and Tuscan (Montemagni et al. 2012) dialects. Unfortunately, these methods still disregard social factors, and only take into account the influence of geography.
While some attempts have been made, social and spatial analyses of language are still far from being integrated. Britain (2002) reports that, on the one hand, sociolinguistics fails to incorporate the notion of spatiality in its research. On the other hand, dialectometry mainly focuses on dialect geography and generally disregards social facLexical differences between Tuscan dialects and standard Italian tors. The few exceptions indeed ‘prove’ the proverbial rule. Montemagni and colleagues (2013) and Valls and colleagues (2013) included in their dialectometric analyses social factors concerning the difference between age classes or urban versus rural communities. Unfortunately, the effect of these social factors was evaluated by simply comparing maps visually, as opposed to statistically testing the differences. Another relevant aspect on which the sociolinguistic and dialectometric perspectives do not coincide concerns the role of individual features, which are central in sociolinguistics, but are typically and programmatically disregarded in dialectometry. These issues demonstrate that there is an increasing need for statistical methods capable of accounting for both the geographic and sociodemographic variation, as well as for the impact and role of individual linguistic features.
The present study is methodologically ambitious for its attempt to combine dialectometric and sociolinguistic perspectives along the lines depicted above. The statistical analysis methods we employ enable the incorporation of candidate explanatory variables based on social, geographical, and linguistic factors, making it a good technique to facilitate the intellectual merger of dialectology and sociolinguistics (Wieling 2012).
The starting point is the Wieling et al. 2011 study, which proposed a novel method using a generalized additive model in combination with a mixed-effects regression approach to simultaneously account for the effects of geographical, social, and linguistic variables. A basic generalized additive model was used to represent the global geographical pattern, which was employed in a second step as a predictor in a linear mixed-effects regression model. This model predicted word-pronunciation distances from the standard language to 424 Dutch dialects, and it turned out that both the geographical location of the communities, as well as several location-related predictors (i.e. community size and average community age), and word-related factors (i.e. word frequency and category) were significant predictors. While the Wieling et al. 2011 study includes social, lexical, and geographical information, a drawback of that study is that only a single speaker per location was considered, limiting the potential influence of speakerrelated variables.
In this article, we present an extended analytical framework that was tested on an interesting case study: Tuscan lexical variation with respect to standard Italian. There are three clear and important differences with respect to the Wieling et al. 2011 study. First, since the software available for generalized additive mixed-effects regression modeling has improved significantly since that time, we are able to advance on that approach by constructing a single generalized additive mixed-effects regression model. This is especially beneficial since we are now in a position to better assess the effect of concept frequency, a variable that has largely been ignored in dialectological studies but is highly relevant as it ‘may affect the rate at which new words arise and become adopted in populations of speakers’ (Pagel et al. 2007:717). Second, in this study we focus on lexical variation rather than variation in pronunciation. We therefore do not try to predict dialect distances, but rather a binary value indicating whether the lexicalization of a concept is different (1) or equal (0) with respect to standard Italian. A benefit of this approach is that it is more in line with standard sociolinguistic practice, which also focuses on binary distinctions. Third, since we take into account multiple speakers per location, we are in an improved position to investigate the contribution of speaker-related variables such as age and gender.
The Tuscan dialect case study we use to investigate the potential of this new method (integrating social, geographical, and lexical factors) is a challenging one. In Italy a complex relationship exists between the standard language and dialects due to the hisLANGUAGE, VOLUME 90, NUMBER 3 (2014) tory of this language and the circumstances under which Italy achieved political unification in 1861, much later than most European countries. In Tuscany, a region with a special status among Italian dialects, the situation is even more complex, since standard Italian is based on Tuscan, and in particular on the Florentine variety, which achieved national and international prestige from the fourteenth century onward as a literary language and only later (after the Italian Unification, and mainly in the twentieth century) as a spoken language. Standard Italian, however, has never been identical to genuine Tuscan and is perhaps best described as an ‘abstraction’ increasingly used for general communication purposes. The aim of this study, therefore, is to investigate this particular relationship between Italian and Tuscan dialects. We focus on lexical variation in Tuscan dialects compared to standard Italian with the goal of defining the impact, role, and interaction of a wide range of factors (i.e. social, lexical, and geographical) in determining lexical choice by Tuscan dialect speakers. The study is based on a large set of dialect data consisting of the lexicalizations of 170 concepts attested by 2,060 speakers in 213 Tuscan varieties drawn from the Atlante lessicale Toscano (‘Lexical atlas of Tuscany’; Giacomelli et al. 2000).
After discussing the special relationship between standard Italian and the Tuscan dialects in the next section, we describe the Tuscan dialect data set, followed by a more in-depth explanation of the generalized additive modeling procedure, our results, and the implications of our findings.
2. Tuscan dialects and standard italian. As pointed out by Berruto (2005), Italy’s dialetti do not correspond to the same type of entity as, for example, the English dialects. Following the Coserian distinction among primary, secondary, and tertiary dialects (Coseriu 1980), the Italian dialects are to be understood as primary dialects (i.e.