|
In addition to frequency lists for English, we also have what we
believe are the most accurate frequency lists for Spanish and
Portuguese. Both frequency lists contain the top 20,000 lemmas /
words in the language. The Spanish data is based on the 20 million
words from the 1900s in the 100 million word
Corpus del Español, and
the Portuguese data is based on the 20 million words from the 1900s
in the 45 million word
Corpus do Português. In both cases, these are
the only corpora that are: 1) large 2) balanced across genres
(spoken, fiction, newspaper, academic), and 3) which are accurately
tagged for part of speech and lemma (which is necessary to create a
frequency dictionary).
The data is available in a number of
formats. Click on the links for sample files. To order, see the
information at the end of this page.
|
Type of data |
Explanation |
Samples |
Price |
|
|
|
|
Acad1 |
Com2 |
|
Word/lemma |
The top
20,000 words (grouped by lemma, so salir =
salgo, salimos, salieran, etc). You can also obtain the frequency
for each
individual word form (for salir: salgo, salieran,
etc) of each lemma, and you can also have the frequency for
the lemma in each of the major genres in the corpus (spoken,
fiction, newspaper, and academic). |
20,000 lemma list:
Spanish
Portuguese
|
$100 |
$200 |
By word
form
By genre |
Add:
$75
$75 |
Add:
$150
$150 |
|
N-grams |
The frequency of all two-word
(2-gram),
three-word
(3-gram), or other n-grams strings. With
these lists, you can quickly and easily find the frequency of
combinations of words across the corpus, without having to use the
corpus interface. In addition, you can specify for which words you
want n-grams (e.g. top 20,000 lemmas, all NOUN+de+ NOUN
sequences, or all words in a
customized 30,000 word list that you send to us). |
Spanish:
2-grams
3-grams
Portuguese:
2-grams
3-grams
|
$100 |
$200 |
|
Synonyms |
For Spanish: 320,000 synonyms
for 29,000 headwords. For Portuguese: 470,000 synonyms for
31,000 headwords. |
Spanish
Portuguese |
$75 |
$150 |
| Other data |
If there is other data that you could use (without
having access to the full text), please
let us know. Examples might be the frequency of each word or phrase
in a 30,000 word/phrase list, or the frequency of all
synonyms for the top 10,000 lemmas in the corpus. |
|
|
|
Note:
1 = Academic license, 2 = Commercial
license
To order data. To initiate the order,
send us an email
indicating what data you need. We will then send you a short
non-disclosure agreement (NDA) which specifies that you will not
give the data to anyone outside of your organization, and we will
also send a request for payment from PayPal. Once you send back NDA
(as an email attachment) and submit the payment on PayPal (you can
use credit card) we will send the data -- nearly always within 24
hours.
|