Collocates data

The following are the columns in the collocates files (see sample with 1.34 million node/collocate pairs; 1/10 of the total).

ID The rank ID of the lemma, 1-60,000.

There are no entries for most of the 30-40 most frequent words (e.g. "the", "of", "with"), but all other words in the top 60,000 are included in the collocates data.This makes sense, at list from a simplistic psycholinguistic point of view. When most people hear bread, they think {loaf, slice, crumb, butter, eat}, etc. But it's not clear at all what they might think of when they hear the or of or with.


The "node word". Note that this is lemmatized, so that decide = {decide, decides, decided, deciding, etc)


The part of speech of the node word


The collocate (lemmatized).


The part of speech of the collocate. (This is the first letter of the codes shown here.)


The Mutual Information score (see

There are minimum values in terms of collocate frequency and Mutual Information (MI) score for inclusion in the list:
     ID 1-200: MI > 1.6  //  ID 201-1000: MI > 2.0  //  ID > 1000: MI > 2.4


The frequency of the node / collocate pair.

[% coll < node]

The percentage of the tokens for this pair (node word and collocate) where the collocate precedes the node word. This can be useful to distinguish, for example, subjects (which typically precede verbs) and objects (which follow the verb).