|
Word frequency lists and dictionary
|
|||||||||
| home | uses | compare | samples | free lists | n-grams | non-english | academic | purchase | |

|
The n-grams are primarily for use in (computational) linguistics, for language modeling and processing. In comparison with other n-grams datasets, we are not aware of any publicly-accessible dataset from a corpus as large as the Corpus of Contemporary American English, other than the Google n-grams sets. And for most people, the COCA n-grams data is probably more usable than the Google data, since it is a size that can actually fit on and run on something besides a high-end workstation or a supercomputer. In addition, the COCA n-grams provide lemma and part of speech information, while the Google n-grams are just strings of words. Feel free to take a look at a sample of the n-grams data. It contains nearly 200,000 3-grams for 400 different words, where the n-gram appears at least ten times in the corpus. Of course, this is just a tiny fraction of the full n-grams set that is available for purchase, which has all 3-grams (including those that occur just once) for all words . The full 150,000,000 n-grams dataset is $195 academic / $395 commercial.
|