Pedagogy

Introducing YiDraCor: a TEI-Encoded Corpus of Yiddish Drama

Jonah Lubin

On Sunday, October 29th, DraCor released its Yiddish corpus (YiDraCor). DraCor (short for “drama corpora”), is an ongoing project developed and maintained by Prof. Frank Fischer of the Freie Universität Berlin and Prof. Peer Trilcke of the Universität Potsdam. It is a corpus of plays, split into 21 (primarily linguistic) subcorpora, and encoded in TEI. TEI is a markup language, not dissimilar from HTML, which has been developed to encode text. In the case of DraCor, this means that the TEI-encoded plays in its corpus are very machine readable: not only is each play a full text, but speakers, stage directions, scenes, acts, and so on are all encoded, such that the texts are ready out-of-the-box for computational literary study. As a sort of sample of what this data can be used for, DraCor provides visualizations for each of its plays. See below the network for Joseph Lateiner’s Yudale der blinder:

Each character in the play is represented by a vertex, and if two characters appear in the same scene or act, they are connected with an edge. Building off the work of Yarkho and Sapogov, DraCor also provides visualizations for speech distributions. A sample of Yarkho’s analysis, wherein he compares speech distributions between Classicist and Romantic tragedies (from 1928, Digital Humanities way avant-la-lettre), can be found here.

But DraCor is not primarily a website: it is a programmable corpus of drama, which revolves around its API. An API (Application Programming Interface) is a service that allows, in this context, particular types of data to be queried and received from a database. The DraCor API can be used to query and download specified data from the corpus at will. For example, recently Prof. Frank Fischer and I used it the API download all of the stage directions from DraCor’s German corpus. We then took this mass of stage directions, and wrote a Python program which extracted and labeled all of the props that were found within them.The DraCor website, with all its knobs and bobs, is to serve as an exemplary use-case of its powerful API, which can be used to slice, dice, and serve the corpus according to your taste.

Every year, DraCor is featured in the work of a growing number of researchers. So far in 2024, DraCor was used by Julia Jennifer Beine to sketch the scheming slave in Roman comedy, by Dîlan Canan Çakir to investigate the German one-act play, and by José Luis Losada Palenzuela to analyze text reuse in the Spanish theater of the Siglo de Oro. For more uses, see here.

Work on the Yiddish corpus began at a workshop hosted by the Freie Universität Berlin. Leading the charge for the Yiddish corpus were Ruthie Abeliovich (Tel Aviv University) and Sinai Rusinek (University of Haifa). Abeliovich brought along the plays of Joseph Lateiner, a shundy Yiddish playwright who was born in 1853 in Iași, and died in 1935 in New York City. Lateiner’s strange and fascinating dramatic work is at the center of Abeliovich’s DYBBUK project, the goal of which is to uncover and explore non-canonical theater in all its aspects, from its manuscripts to its stage-hands.

We used the workshop to set up a workflow for converting PDF’s first into OCR’d full-texts, and then into fully-featured TEI. The OCR model, hosted on Transkribus, was trained by Rusinek on plays from the DYBBUK corpus. The next step was to use ezdrama, a Markdown-like markup language designed to encode plays, to convert the texts into TEI.

This was, more or less, the workflow we then used to produce our initial corpus of three plays. Of particular difficulty was the OCR and proofreading of the heavily-pointed orthography used to print Lateiner’s drama. Though it was no easy feat, we are proud to report that the editions of the plays on DraCor maintain their original orthographies, with all their many, many idiosyncrasies. Maintaining the orthography of a single edition, which is the general policy of DraCor, leads to a broader diversity of orthography in the general corpus of digitized Yiddish, which provides useful data for understanding the development of Yiddish writing systems.

It is also useful for producing ground-truth to train Yiddish OCR on a variety of orthographies.

About a year after that fateful workshop at the FU, we have three TEI-encoded Yiddish plays ready to present. Two of these plays are by Joseph Lateiner. The first is a comic one-acter, and the second is a dramedy in four acts. The third play is a one-acter by Jane Rose (Born 1880, Minsk; died 1927, Cleveland). The work of Jane Rose was first brought to my attention by Alona Bach at the Yiddish Book Center, where she shared her translation of Rose’s Nit mit alemen.

As the basis of our corpus, we selected these three understudied plays from non-canonical authors. We decided to have two plays from the same author, in order to examine how stylometry (statistical authorship attribution based on linguistic attributes of the text) would work on this data. After Ruthie Abeliovich brought them to the workshop, Lateiner seemed like a perfect choice. Rose’s work was selected as a stylistic and linguistic counterpoint to Lateiner’s. It is important to us that the corpus represents a broad range of Yiddish drama, and we hope to maintain a high standard of linguistic and authorial diversity.

The one-act plays (Der man untern tish (1908) by Lateiner and Engshaft (1918) by Rose), both deal with (the fear of) a wife’s infidelity and the intrusion of men into the home. But the ways plays treat this theme, and the social institutions that occasion it, are very different. Lateiner’s play is basically a farce, with songs and physical (if distasteful) comedy. Set in Central/Eastern Europe, infidelity is occasioned by the intrusion of a shabbos oyrekh. Engshaft, on the other hand, is a short drama, set in New York, with very English-heavy Yiddish. The masculine intruder in this case is the dreaded boarder.

Our full-length drama is Lateiner’s Yudale der Blinder. It is the story of a blind young man (Yudale), left in the charge of his wealthy uncle (Valdman). Yudale received a substantial inheritence from his father, but as long as he remains blind and unmarried, Valdman gets to keep it. It is thus in Valdman’s best interest to keep Yudale both blind and unmarried – but the German eye-doctor Professor Edelman, and Yudale’s poor, golden-hearted love Dvoyre might have something to say about that… There’s drama, there’s comedy, there’s singing and dancing. You’ll laugh, you’ll cry…

We are very excited to be able to present these plays to you, and hope that many computational analyses of Yiddish drama will be hot on their heels. If you are interested in contributing to the project, whether that be by alerting us to typos, recommending us a play, sending us a full-text, or encoding, we would love to hear from you!

Please send any questions, suggestions, or generous offers of assistance to jonah_lubin [at] g.harvard.edu

MLA STYLE
Lubin, Jonah. “Introducing YiDraCor: a TEI-Encoded Corpus of Yiddish Drama.” In geveb, October 2024: https://ingeveb.org/pedagogy/yidracor.
CHICAGO STYLE
Lubin, Jonah. “Introducing YiDraCor: a TEI-Encoded Corpus of Yiddish Drama.” In geveb (October 2024): Accessed Mar 26, 2025.

ABOUT THE AUTHOR

Jonah Lubin

Jonah Lubin is a PhD student in Comparative Literature at Harvard University.