BASICS Toolkit

Lemmatizer

The lemmatizer script inserts lemma information for Middle English verbs in the PPCME2, PCMEP, or PLAEME. Versions of these corpora with lemmatized verbs are required to take advantage of the qMaker tool. The script draws on an inventory of form-lemma correspondences, which can be queried with the Lemma search tool.

For licensing reasons, a lemmatized version of the PPCME2 cannot be made available here. Nor can the functionality of the lemmatization script be offered directly within the present web application, as this would entail that each user upload their entire copy of the corpus, then download a lemmatized version.

Instead, the lemmatizer script itself is made available for download, so that interested owners of a copy of the PPCME2 can perform the lemmatization themselves. A download link as well as a user guide are given below.

Should you notice lemmatization errors, have lemma suggestions for unlemmatized forms, or would like to disambiguate a form marked with multiple lemmas, we would be very grateful if you could take the time to notify us via our feedback form.

When using the lemmatizer script for your research, please cite the following publications:

Percillier, M. (2016). Verb lemmatization and semantic verb classes in a Middle English corpus. Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), 209–214. Retrieved from https://www.linguistics.rub.de/konvens16/pub/26_konvensproc.pdf
[BibTeX]
Percillier, M. (2018). A Toolkit for lemmatising, analysing, and visualising Middle English Data. In A. U. Frank, C. Ivanovic, F. Mambrini, M. Passarotti, & C. Sporleder (Eds.), Proceedings of the Second Workshop on Corpus-Based Research in the Humanities CRH-2 (pp. 153–160). Retrieved from https://www.oeaw.ac.at/fileadmin/subsites/academiaecorpora/PDF/CRH2.pdf
[BibTeX]

When using using the lemmatizer script on the PCMEP or PLAEME corpora, please also cite the following publication:

Percillier, M., & Trips, C. (2020). Lemmatising Verbs in Middle English Corpora: The Benefit of Enriching the Penn-Helsinki Parsed Corpus of Middle English 2 (PPCME2), the Parsed Corpus of Middle English Poetry (PCMEP), and A Parsed Linguistic Atlas of Early Middle English (PLAEME). Proceedings of the 12th Language Resources and Evaluation Conference, 7170–7178. Marseille, France: European Language Resources Association. Retrieved from https://www.aclweb.org/anthology/2020.lrec-1.886
[BibTeX]

The lemmatizer script can be downloaded here.

Please be advised that the Lemmatizer script has only be tested on macOS. It should (in theory) function under Windows and Linux as well.

User guide

Contents of the download

00readme-v16.md
A text file formatted in Markdown containing usage instructions as well as information on the version history of the script
00readme-v16.pdf
A PDF version of 00readme-v16.md
insertLemmas.py
The Python script that inserts verb lemma information into the corpus
lemmas.csv
An inventory of form-lemma correspondences
frenchMEDverbs.csv
A list of Middle English verbs borrowed from French
textDates.csv
An inventory of text dates for the PPCME2

Requirements

Python 3
NB: Please note that Python 2 is no longer supported
Directory containing the .psd files of the PPCME2, PCMEP, or PLAEME

Setup

If necessary, install Python
NB: Windows users may have to add Python to the Environment Variables
Place the following items in the same directory:
- insertLemmas.py
- lemmas.csv
- frenchMEDverbs.csv
- textDates.csv (required for the PPCME2 only)
- and the directory containing the .psd files of the PPCME2, PCMEP, or PLAEME

Running the script

Open a command-line interface window and change the current directory to the one containing insertLemmas.py, lemmas.csv, frenchMEDverbs.csv, as well as the directory containing the .psd files of your corpus:
```
cd path/to/directory
```
(on macOS/Linux)
```
cd C:\path\to\directory
```
(on Windows)
NB: You may be able to simply drag the folder into the command-line window after typing "cd "
Execute the Python script with one of the following commands (which may vary depending on your operating system and how Python was installed, or how many different versions of Python are installed):
```
python3 insertLemmas.py corpusDir
python insertLemmas.py corpusDir
py insertLemmas.py corpusDir
```
NB: "corpusDir" stands for the name of the directory containing the .psd corpus files
When lemmatizing the PLAEME, this should be indicated via the "-p" or "--plaeme" flag, e.g.:
```
python insertLemmas.py -p corpusDir
python insertLemmas.py --plaeme corpusDir
```
Additional options for the PPCME2 are described in the "Read Me" files.

Output

A directory, named after the input directory with the suffix "_Lemmatised"
Inside this directory, the corpus files, named after the input files, with a ".lemma.psd" ending
In these files, all verb forms with a V* part-of-speech tag contain the following information:
- lemma suggestion, marked by "@l"
- MED number of the lemma suggestion, marked by "@m"
- French etymology, marked by "@e=french" when applicable, and "@e=nonfrench" when not
- "@w=doubt" when raters marked that they had doubts about their lemma suggestion(s)
- NB: verb forms can have more than one lemma suggestion

We have developed a second annotation script that inserts lemma and animacy information for lexical noun forms that occur within prepositional phrases headed by to that occurred in the vicinity of verbs in the PPCME2, the PCMEP, or the PLAEME. This is done to help determine the status of prepositional phrases as prepositional objects or adverbials. The noun forms that were identified will also be annotated when they occur outside of such prepositional phrases.

The animacy annotation script can be downloaded here.

User guide

Contents of the download

00readme-v4.md
A text file containing usage instructions as well as information on the version history of the script
00readme-v4.pdf
A PDF version of 00readme-v4.md
insert-animacy.py
The Python script that inserts animacy information into the corpus
toPP-animacy.csv
A table listing the forms identified and the information to be inserted into the corpus

Running the script and output

The script functions very much the same way as the lemmatizing script described above. Each piece of inserted information is demarcated by @ characters and specified by an attribute, such as:

l: the lemma in its Present-Day English form and/or as specified in the (Parsed) Linguistic Atlas of Early Middle English, whereby multiple entries are separated by a pipe (|) character
a: the animacy status, which may be "animate", "inanimate", "ambiguous" for homonyms that could not be distinguished, and "panimate" for lexemes that are inanimate in their primary sense but have a high propensity to refer to animate entities via metonymy or metaphor. Such cases can be included in a query for animate entities with: *@a=animate@*|*@a=panimate@*

The following older versions of the scripts are also available: