[Colloquium] Linguistic Universals and Diversity: Usage Data as Evidence and Explanation

Usage data are central for functionally and cognitively oriented typology. They allow us to perform language comparison more precisely, avoiding data reduction. They can also help us discover new cross-linguistic generalizations and test, rethink and explain the old ones. In my talk I will present two case studies that demonstrate the importance of usage data. In the first one, I debunk the old myth that the given/new distinction strongly influences word order in flexible languages. Using spontaneous corpus data from German, Russian and Tagalog modelled with the help of deep and shallow machine learning, I show that these languages are much closer to fixed-order languages, such as English and Mandarin Chinese, than one would expect. The second case is differential object marking. The cross-linguistic generalizations that we find in the literature and test with the help of cross-linguistic databases are explained by the fact that human, definite and given arguments are less expected to be objects than transitive subjects. It is therefore communicatively efficient to mark them formally. Unlike some related corpus frequencies, these discourse preferences are robust across languages and text types, which makes them a plausible explanation for the cross-linguistic generalizations.