Collocates data


The Corpus of Contemporary American English (COCA) is the most widely-used corpus in the world. In March 2020 it was updated for the last time (with data up through Dec 2019), and the collocates data from the corpus was updated in April 2020. The following are the major changes and improvements in the collocates data.
 
Feature Previous COCA 2020
Corpus: size 400 million words More than twice as large, at one billion words
Corpus: how up to date Texts from 1990 - ~2012 The most recent texts are from Dec 2019. There are 20 million words each year from 1990-2019 (+ about 240 million words from blogs and other websites from 2013). So there are about 600 million new words of data since the previous data was released in 2012.
Corpus: genres Spoken, fiction, magazine, newspaper, academic. Same five genres as before (with about 120-130 million words per genre), plus the three new genres:
-- Blog posts and other web pages (120-130 million words for each of these two genres). So much of what we consume nowadays comes from the web, and these genres include many words that don't occur much elsewhere (e.g. ebook, webpage, browsing, password, template, meme, snarky, off-topic, downloadable, open-source, updated, (to) monetize, upgrade, debunk, archive, pirate, upgrade).
-- TV and movies subtitles (130 million words). This is by far the most informal language we've ever had in COCA. Many studies (e.g. A, B, and C show that the data from subtitles agrees with native speaker intuitions about their language even better than the data from actual everyday conversation (like in the BNC). Until now, COCA didn't really have this highly informal language.
Data 4.3 million node / collocates pairs for the top 60,000 lemmas 13.5 million node / collocates pairs for the top 60,000 lemmas. Because the new corpus is much larger, there are many more node / collocate pairs with the minimum frequency, especially for lower-frequency words.