Menu Close

How Corpora Changed Linguistics

Oscar Torres

A long time ago in a German forest far, far away, two brothers who by chance happened to share the surname “Grimm” noticed something odd. They had roamed far and wide their Northern European, homelands and even a bit further to collect stories. Although they were surprised with how some stories were so similar –they did study Germanic and Norse cultures, after all—one of these brothers who we’ll call Jacob (or Jakie for friends) noticed a strange pattern.

The old chap realized that every time a German word finished with the strong S sound, like in “Fuß”, that same word would have a T in Swedish (“fot”) and English (“foot”). And this would happen with other consonants as well, discovering correspondences with German fricatives (sounds like the F or S where the air only slightly comes out) and Swedish or English plosives (sounds where there’s a hard stop and air gets expelled forcefully, like P or T).

The interesting thing about it was that he could only notice it because he had inadvertently made a comparative corpus of folklore in Germanic languages and spent enough time reading and writing them down to notice it. Today, we’ll be talking about why collections of texts like the Brothers Grimm’s are key to our modern understanding of linguistics and will only continue to grow in importance.

  • So, what exactly is a corpus?

A corpus (“corpora” in plural) is essentially a bunch of specifically selected text samples, be it written or spoken, that linguists use to study patterns or test theories. They’re not exactly new, as historical linguistics has used collections of manuscripts to understand how languages arrived at their current state since the dawn of time, noting little changes in spelling over centuries. However, the understanding that we have of them nowadays has only existed since the 1960s, coinciding with the rise of computers, the internet, and statistical analysis.

In the olden days, linguists like our dear Jakie had to research using the traditional “horizontal” or “line-by-line” reading. This meant long, arduous hours comparing and making a tally of each time he noticed that a German P matched with a Swedish F. With statistical software, however, linguists during the 60s and 70s could read “vertically” to find patterns.

 Imagine a spreadsheet with thousands of samples of text in each row, and all they had to do to find hundreds of matches was pressing a button and the program would “align” every instance of the word “apple” in the different languages to see that Eastern Germanic languages would spell it something like “affel” or “apfel” and Western Germanic languages would have something like “apel” or “apple”. Jakie sure would have loved to have this back in the 19th Century.

Corpora are often used to compare languages, but they can also be used to track the changes in a language as they occur by taking recordings every year, to help create standardized guides for learners of a language by finding the most common (or “normative”) vocabulary, and so much more.

  • The corpus revolution

Let’s say we are a group of linguists prior to the 1960s who want to test a theory. Let’s say that all adverbs in English can be made with an adjective and the termination “-ly”. If we have enough funding, we could probably buy a couple dozen general or purpose-specific dictionaries and study them long and hard over the course of a few months to test our theory. We might reach a conclusion, but we wouldn’t have nearly enough real-world data to conclude that we are actually in the right without statistically significant evidence.

But what if we have access to computers that we can just spill all the data into and have it churn out results? Obviously, it would be quite a bit more efficient in terms of time and money. Add the fact that soon after PCs arrived the internet made corpora not only easier to work with, but also easier to access and compile, now that large databases and digitized versions of documents and texts were available to a previously unimaginable extent.

Millions, billions, and we’re getting close to trillions of words in a single corpus can be accessed in seconds. These databases kept growing while linguists reached a moment of “crisis” as the previous methods were rendered inadequate to study data at this scale. Previous researchers’ main obstacle was acquiring the amount of data to be able to demonstrate theories or extract conclusions, but researchers in the ‘90s had to develop new methods in all subfields and paradigms in linguistics to adapt to the sheer magnitude of the data. All of this led to changes in the quality criteria for evidence, the emergence of new patterns that had not been thought of previously, the possibility of testing previously assumed theories, and the questioning of the philosophy around knowledge itself.

  • Final word

Linguists who lived and did research as the field changed have collaboratively paved the road for us newcomers in terms of developing tools to work with corpora and syllabuses that include training for them. Not to mention the fact that they had to adapt all their previously existing methods “on the fly” to be able to continue doing research.

The main point I extract from all this is that it is nearly impossible to carry out research in linguistics without a proper training in what corpora are and how they are used. Corpora are a resource that is taken for granted from the perspective of a student who is arriving to the field, and the fact that there is still much more to explore and understand about their possibilities leaves many doors open for a dissertation or thesis.

To conclude, the pioneers from the past century left us with so many questions, and now it’s our turn to answer them. What does it mean to be “right”? How many matches do I need to say that something is significant? In what contexts? When do I stop adding data? What samples can I use that won’t skew my results? These and a long etcetera of questions are issues that we are still dealing with nowadays, and we probably will still deal with for years to come.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.