The following are the columns in the
collocates files (see sample with 1.34 million node/collocate pairs;
1/10 of the total).
ID |
The rank ID of the lemma, 1-60,000. |
There are no entries for most of the 30-40
most frequent words (e.g. "the", "of", "with"), but all other words in
the top 60,000 are included in the collocates data.This makes sense, at
list from a simplistic psycholinguistic point of view. When most people
hear bread, they think {loaf, slice, crumb, butter, eat},
etc. But it's not clear at all what they might think of when they hear
the or of or with. |
lemma |
The "node word". Note that this is
lemmatized, so that decide = {decide, decides, decided,
deciding, etc) |
lemPoS |
The part of speech of the node word |
coll |
The collocate (lemmatized). |
collPoS |
The part of speech of the collocate. (This
is the first letter of the codes shown
here.) |
MI |
The Mutual Information score (see
https://www.english-corpora.org/mutualInformation.asp). |
There are minimum values in terms of
collocate frequency and Mutual Information (MI) score for inclusion in
the list:
ID 1-200: MI > 1.6 // ID 201-1000: MI > 2.0
// ID > 1000: MI > 2.4 |
freq |
The frequency of the node / collocate pair. |
[% coll < node] |
The percentage of the tokens for this pair
(node word and collocate) where the collocate precedes the node word.
This can be useful to distinguish, for example, subjects (which
typically precede verbs) and objects (which follow the verb). |
|