Language topic

Forostar

Ancient Mariner
Because I wasn't sure if the following deserves an own topic, I thought we could have a general topic on anything language related (we used to do language kind of "games" in other topics, and naturally all that could be done here as well).

I'd like to start with this:
- - - - - - -
The weirdest languages (source)

We’re in the business of natural language processing with lots of different languages. In the last six months, we’ve worked on (big breath): English, Portuguese (Brazilian and from Portugal), Spanish, Italian, French, Russian, German, Turkish, Arabic, Japanese, Greek, Mandarin Chinese, Persian, Polish, Dutch, Swedish, Serbian, Romanian, Korean, Hungarian, Bulgarian, Hindi, Croatian, Czech, Ukrainian, Finnish, Hebrew, Urdu, Catalan, Slovak, Indonesian, Malay, Vietnamese, Bengali, Thai, and a bit on Latvian, Estonian, Lithuanian, Kurdish, Yoruba, Amharic, Zulu, Hausa, Kazakh, Sindhi, Punjabi, Tagalog, Cebuano, Danish, and Navajo.

Natural language processing (NLP) is about finding patterns in language—for example, taking heaps of unstructured text and automatically pulling out its structure. The open secret about NLP is that it’s very English-centric. English is far and away the language that linguists have worked on the most and it’s also the language that has the most available resources for computer science projects (and more data is almost always better in computer science). So one of the best ways to test an NLP system is to try languages other than English. The better that a system can deal with diverse data, the more confident that you can be in its ability to handle unseen data.

To this end, we might choose to define “weirdness” in terms of English. But that’s a pretty irritating definition. Let’s try to do something different.


A global method for linguistic outliers
The World Atlas of Language Structures evaluates 2,676 different languages in terms of a bunch of different language features. These features include word order, types of sounds, ways of doing negation, and a lot of other things—192 different language features in total.

So rather than take an English-centric view of the world, WALS allows us take a worldwide view. That is, we evaluate each language in terms of how unusual it is for each feature. For example, English word order is subject-verb-object—there are 1,377 languages that are coded for word order in WALS and 35.5% of them have SVO word order. Meanwhile only 8.7% of languages start with a verb—like Welsh, Hawaiian and Majang—so cross-linguistically, starting with a verb is unusual. For what it’s worth, 41.0% of the world’s languages are actually SOV order. (Aside: I’ve done some work with Hawaiian and Majang and that’s how I learned that verbs are a big commitment for me. I’m just not ready for verbs when I open my mouth.)

The data in WALS is fairly sparse, so we restrict ourselves to the 165 features that have at least 100 languages in them (at this stage we also knock out languages that have fewer than 10 of these—dropping us down to 1,693 languages).

Now, one problem is that if you just stop there you have a huge amount of collinearity. Part of this is just the nature of the features listed in WALS—there’s one for overall subject/object/verb order and then separate ones for object/verb and subject/verb. Ideally, we’d like to judge weirdness based on unrelated features. We can focus in on features that aren’t strongly correlated with each other (between two correlated features, we pick the one that has more languages coded for it). We end up with 21 features in total.

For each value that a language has, we calculate the relative frequency of that value for all the other languages that are coded for it. So if we had included subject-object-verb order then English would’ve gotten a value of 0.355 (we actually normalized these values according to the overal entropy for each feature, so it wasn’t exactly 0.355, but you get the idea). The Weirdness Index is then an average across the 21 unique structural features. But because different features have different numbers of values and we want to reduce skewing, we actually take the harmonic mean (and because we want bigger numbers = more weird, we actually subtract the mean from one). In this blog post, I’ll only report languages that have a value filled in for at least two-thirds of features (239 languages).


The outlier (weirdest) languages
The language that is most different from the majority of all other languages in the world is a verb-initial tonal languages spoken by 6,000 people in Oaxaca, Mexico, known as Chalcatongo Mixtec (aka San Miguel el Grande Mixtec). Number two is spoken in Siberia by 22,000 people: Nenets (that’s where we get the word parka from). Number three is Choctaw, spoken by about 10,000 people, mostly in Oklahoma.

But here’s the rub—some of the weirdest languages in the world are ones you’ve heard of: German, Dutch, Norwegian, Czech, Spanish, and Mandarin. And actually English is #33 in the Language Weirdness Index.

The 25 weirdest languages of the world. In North America: Chalcatongo Mixtec, Choctaw, Mesa Grande Diegueño, Kutenai, and Zoque; in South America: Paumarí and Trumai; in Australia/Oceania: Pitjantjatjara and Lavukaleve; in Africa: Harar Oromo, Iraqw, Kongo, Mumuye, Ju|’hoan, and Khoekhoe; in Asia: Nenets, Eastern Armenian, Abkhaz, Ladakhi, and Mandarin;and in Europe: German, Dutch, Norwegian, Czech, and Spanish.

By the way, how awesome of a name is “Pitjantjatjara“? (Also: can you guess which one of the internal syllables is silent?)


Questions and pronouns: two example features
This is odd. Is this odd? One of the features that distinguishes languages is how they ask yes/no questions.The vast majority of languages have a special question particle that they tack on somewhere (like the ka at the end of a Japanese question). Of 954 languages coded for this in WALS, 584 of them have question particles. The word order switching that we do in English only happens in 1.4% of the languages. That’s 13 languages total and most of them come from Europe: German, Czech, Dutch, Swedish, Norwegian, Frisian, English, Danish, and Spanish.

But there is an even more unusual way to deal with yes/no questions and that’s what Chalcatongo Mixtec does: which is to do nothing at all. It is the only language surveyed that does not have a particle, a change of word order, a change of intonation…There is absolutely no difference between an interrogative yes/no question and a simple statement. I have spent part of the day imagining a game show in this language.

Another thing languages have to deal with is what to do with simple subjects like I, they, or it. These are called pronominal subjects (something like The minister prevaricated has a nominal subject). The most common way to do this is to just tack the information about the subject on to the verb—437 out of 711 languages do this, like Spanish, Italian, and Portuguese. But Dutch, German, and Norwegian—like English—prefer having special subject pronouns that are normally/obligatorily present. But this is only done by 82 of the 711 languages coded in WALS. Kutenai (100 speakers in British Columbia, Canada) and Mumuye (400,000 speakers in Nigeria) do something even more unusual: they have something like subject pronouns but these go in different positions in the syntax than where full noun phrases go. And even more unusual than this is Chalcatongo Mixtec again: they combine several strategies so they have both subject markers that they add to verbs and they have pronoun words, too. But these pronoun words appear in a different spot from where a full noun phrase would show up.


The 5 least weird languages in the world
Now if I asked you to consider these languages, how weird would you say they were? Lithuanian, Indonesian, Turkish, Basque, and Cantonese. Surprise! They are really low on the Weirdness Index. They don’t seem typical to linguists and language learners but for these 21 features they stick with the crowd. Notice that we get isolates (like Basque) distributed throughout levels of Weirdness. Basque is “typical” but Kutenai, another isolate, is one of the weirdest of all languages. Even more surprising is that Mandarin Chinese is in the top 25 weirdest and Cantonese is in the bottom 10. This has to do with the fact that they have different sounds: Mandarin, unlike Cantonese has uvular continuants and has some limits on “velar nasals” (like English, Mandarin can have a sound like at the end of song but it can’t have that sound at the beginning of words—worldwide it’s rare to have that particular restriction).

At the very very bottom of the Weirdness Index there are two languages you’ve heard of and three you may not have: Hungarian, normally renowned as a linguistic oddball comes out as totally typical on these dimensions. (I got to live in Budapest last summer and I swear that Hungarian does have weirdnesses, it just hides them other places.) Chamorro (a language of Guam spoken by 95,000 people), Ainu (just a handful of speakers left in Japan, it is nearly extinct), and Purépecha (55,000 speakers, mostly in Mexico) are all very normal. But the very most super-typical, non-deviant language of them all, with a Weirdness Index of only 0.087 is Hindi, which has only a single weird feature.

Part of this is to say that some of the languages you take for granted as being normal (like English, Spanish, or German) consistently do things differently than most of the other languages in the world. It reminds me of one of the basic questions in psychology: to what extent can we generalize from research studies based on university students who are, as Joseph Henrich and his colleagues argue, Western Educated Industrialized Rich and Democratic. In other words: sometimes the input is WEIRD and you need to ask yourself how that changes things.


You’re weird
Even though the methods here don’t define things in terms of English, they still smuggle in some cultural-specificity. That is, the linguists who developed and annotated the features were mostly speakers of European languages. What features might a person from Papua New Guinea or Ethiopia or the Amazon have come up with instead? And of course, WALS doesn’t have any data at all on about 4,000 languages. And the languages that it has the most data for are not truly random.

Despite this, English still ranks as highly unusual (it comes in as #33 with an index value of 0.756). That English-speaking brain you’ve been using to read this? It’s wired weird.

- Tyler Schnoebelen (@TSchnoebelen)


Appendix: The tops and bottoms
Here are the values for the top and bottom 10 languages (see: http://idibon.com/the-weirdest-languages/ for better presentation) . You might also check out our posts on:
Rank, Language, Weirdness Index
1, Mixtec (Chalcatongo), 0.972
2, Nenets, 0.935
3, Choctaw, 0.924
4, Diegueño (Mesa Grande), 0.920
5, Oromo (Harar), 0.919
6, Kutenai, 0.908
7, Iraqw, 0.900
8, Kongo, 0.883
9, Armenian (Eastern), 0.861
10, German, 0.858
230, Basque, 0.189
231, Bororo, 0.153
232, Quechua (Imbabura), 0.151
233, Usan, 0.151
234, Cantonese, 0.143
235, Hungarian, 0.132
236, Chamorro, 0.128
237, Ainu, 0.128
238, Purépecha, 0.100
239, Hindi, 0.087

Update: Here is the full list, with the 21 weirdness features and all of the languages that had values for at least one of them (don’t trust those values, of course). Weirdness_index_values_full_list
- - - - - - -

I can identify a lot with the first comment under the article ...

I love this article! I had been wondering about this. Being Dutch myself, and teaching Dutch to immigrants, I had been wondering also if the mistakes that are still made typically by immigrants who have learned Dutch thoroughly and have been here for a long time, define some of the weirdness of Dutch (mixing up the definite articles 'de' and 'het' and the use of the elusive word "er").

... but I also recommend reading the other comments. Very interesting!
 
....... We end up with 21 features in total.

For each value that a language has, we calculate the relative frequency of that value for all the other languages that are coded for it. So if we had included subject-object-verb order then English would’ve gotten a value of 0.355 (we actually normalized these values according to the overal entropy for each feature, so it wasn’t exactly 0.355, but you get the idea). The Weirdness Index is then an average across the 21 unique structural features. .......
Here's the full list of features, from the spreadsheet:
83A: Order of Object and Verb
87A: Order of Adjective and Noun
143A: Order of Negative Morpheme and Verb
143G: Minor morphological means of signaling negation
69A: Position of Tense-Aspect Affixes
116A: Polar Questions
57A: Position of Pronominal Possessive Affixes
101A: Expression of Pronominal Subjects
6A: Uvular Consonants
71A: The Prohibitive
129A: Hand and Arm
130A: Finger and Hand
44A: Gender Distinctions in Independent Personal Pronouns
14A: Fixed Stress Locations
9A: The Velar Nasal
72A: Imperative-Hortative Systems
111A: Nonperiphrastic Causative Constructions
64A: Nominal and Verbal Conjunction
124A: 'Want' Complement Subjects
117A: Predicative Possession
19A: Presence of Uncommon Consonants
 
Turkish is known to be a very regular language. In fact, there's only one irregular verb in the whole language. It's no surprise Turkish is ranked low in the weirdness count. In terms of being difficult for English speakers to learn however, is another story. It's one of the toughest languages that use latin alphabet for an English speaker. Alphabet differences always plays a big role so obviously Arabian, Mandarin, Cantonese, Japanese, Hebrew etc. are difficult to learn for an English speaker. But from what I've seen and read so far, from the latin alphabet languages, only Polish and Finnish are equally as hard as Turkish to learn.

One of the trickiest parts about our language is the extensive usage of agglutination. A common example is this: Muvaffakiyetsizleştiriverebileceklerimizdenmişcesine. This is pronounced as one word in Turkish and means"as if you were one of those whom we could make unsuccessful". Here's another example I've found on Wikipedia:

ev - house
evler - houses
evin - your house
eviniz - your house (plural)
evim - my house
evimde - at my house
evlerinizin - of your houses
evlerinizden - from your houses
evlerinizdendi - it was from your houses
evlerinizdenmiş - it's said to be from your houses
evinizdeyim - i'm at your house
evinizdeymişim - i was apparently at your house.
evinizde miyim? - am i at your house?

And an example of word formation by agglutination:

yat - lie down
yatık - leaning
yatak - bed
yatay - horizontal
yatkın - inclined to
yatırmak - to lay down
yatırım - investment

The other tricky part is obviously the pronounciation.
 
how many languages do you speak?

Fluent: English and German. Semi-fluent: French and Persian. Basic knowledge: Kurdish (Kurmancî and Soranî).
Old languages (in descending order of expertise): Old Persian, Bactrian, Middle Persian, Latin, Young Avestan.
I also got a (very) basic introduction to Manichaean Parthian, Sogdian, Khotanese Saka and Khwarezmian.

Planning to improve on some of those (especially Latin, French and Young Avestan). Taking a Pashto course next semester. If I find the time, I'd like to gather some basic knowledge of Sanskrit, ancient Greek, Russian and Old (Gathic) Avestan within the next year or so.
 
I can speak Serbian (duh!), Montenegrian, Bosnian, Croatian and English fluently :) I studied German for years but haven't used it in a while. I suppose I could understand people if they talked slower. I also know a lot of Spanish, but I never studied it.
 
I forgot Pig Latin ... I am all over that. I am also praying no one answers Klingon Ka plaa!
 
I had a discussion recently with someone about accents, which is interesting. For example, I was born near Chicago and have lived equal parts there, California, and Texas (with a short stop in Hawaii). When I am with family, I tend to talk more Chicagoan, around Texas with more of a drawl. But I noticed when I was in Germany my Chicagoan yeaaah became a German Ja! not intentionally, but perhaps trying to fit into my surroundings.
 
I'm sure there's going to be someone who says Elfish <_<.
As for me... English, and the odd word/phrase of Spanish/French left over from school that I would recognise if spoken, but couldn't give you if you asked for it.
 
Yes! :D Montenegrian language even got new letters like two years ago :D

I like that. Maybe Montenegrin should also introduce the injunctive. That should draw some attention on it, because every language is cool when it goes Sanskrit. I also like that it has an aorist, maybe they should put that in the foreground, and ditch the imperfect and future II. Having seven tenses is so 50 BC. Let's talk about the cases: What about something unconventional, like merging genitive and dative and getting rid of the vocative? It would make the language a lot more likeable.
 
Back
Top