The DYBBUK Model: A New Yiddish Handwriting Text Recognition Tool

Ruthie Abeliovich

The Digital Humanities zeitgeist has afforded scholars and students of Yiddish a variety of tools that render the field accessible: digitized books, periodicals and manuscripts, oral history archives, online classes and cultural productions, as well as language learning tools, from mobile application Duolingo to a Yiddish iteration of the new word game Wordle. And yet, research into historical records of Yiddish is still daunting, as anyone facing the challenge is bound to stumble over the task of deciphering Yiddish handwritten manuscripts. Many skills are required in order to decode handwritten Yiddish: one needs to become intimately acquainted with not only the style and idiosyncratic elements of the handwriting, but also with the language itself— the specific written dialect, local knowledge, as well as foreign language influences on writers and the cultural context of the deciphered document.

Featuring formal or personal content, handwritten documents enable scholars to connect with the writing process and its corporeality, as well as engage with the scribbles, spelling errors, and editing of the text. The interpretation and transcription of Yiddish handwriting may be highly time-consuming and ultimately frustrating. To access such records, one needs to struggle with jagged handwriting and disjointed textual fragments that may intimidate and dishearten people from going down these ‘rabbit holes’ and looking into the past.

The core research team of the DYBBUK project —a five-year research initiative, funded by the European Research Council (ERC)—realized the need for a digital solution for deciphering Yiddish handwriting. 1 1 For more on the DYBBUK project, see: The project sets out to study the forms, themes, and practices of the popular Yiddish theater at the turn of the twentieth century, focusing on the unexplored corpus of two of the most prolific Yiddish theater makers: “Professor” Moyshe Hurwitz and Joseph Lateiner. As Nahma Sandrow writes in her path breaking Vagabond Stars: A World History of Yiddish Theater, “In the archives of the YIVO Institute for Jewish Research in New York there are hundreds of Yiddish plays, most of them copied by hand into ruled notebooks and scrawled over with line changes.” They “are hard to make out,” Sandrow points out. 2 2 Nahma Sandrow, Vagabond Stars: A World History of Yiddish Theater (Syracuse University Press, 1996), 109. In our project, we aim to restore the neglected, yet influential corpus of theater and music and make it available for our appreciation and delight. The Yiddish popular theater played a pivotal role in transporting cultural styles, ideas, and products at times of massive migration movements in Europe and beyond; its theatrical practices and modes of expression influenced popular show business as we know it today. To understand the complexity and scope of the Yiddish popular theater, this project will study the images, sounds, messages, performances, and the theater practices it involved. Focusing on its understudies corpus, the DYBBUK project offers a new understanding to the evolution of modern Jewish theater, as a process developed and fertilized through a dynamic dialogue with the Shund.

Within this framework, we have focused on creating an automatic Yiddish recognition model that would be able to read and decipher Yiddish handwriting of various sorts, shapes, and curves. For this task we chose Transkribus, a cutting-edge digital platform for the transcription, automatic text detection, and enrichment of handwritten archival documents. 3 3 To download and begin working in Transkribus see: The adoption of Transkribus as a working digital environment enables the deciphering of archival documents as full-text, and can thus facilitate and boost research into hidden treasure troves, rich in Jewish cultural heritage.

The platform was conceived in the context of the two EU projects—tranScriptorium (2013–2015) and READ (Recognition and Enrichment of Archival Documents–2016–2019). Since July 1, 2019, the platform has been managed and further developed by the READ-COOP. Access to Transkribus automatic text detection models is available through a credit system, which offers 500 free credits per user, as well as various packages for more extensive use. All other functions of Transkribus—layout analysis, transcription and correction, data upload and download, accuracy check, annotation, and search—remain free of charge. According to Wikipedia, in June 2020 the platform boasted more than 37,000 registered users who constitute a vibrant and supportive community on social media platforms.

The training of a Yiddish-language handwriting model required our project team to transcribe dozens of Yiddish handwritten pages. In order to create a generic language model that could be useful for a wide variety of documents, we included a large corpus of various types of handwriting in the model. At the beginning of January 2022, we held an online workshop that introduced the first Yiddish handwriting text recognition (HTR) model for deciphering Yiddish handwriting to the public. 4 4 To view the recording from the workshop: The workshop was open to scholars, students, and Yiddish-lovers, who wished to explore new working methods on the basis of their own archival materials and family letters. We envisioned this workshop as the inauguration of a Yiddish Transkribus community that will strive toward improving the Yiddish handwriting recognition, while focusing on individual materials. The more we train the platform to recognize and decipher Yiddish handwriting, the better the model will become. The ninety-minute gathering was divided into two sessions: during the first, Sinai Rusinek—our Digital Humanities expert—presented the Transkribus interface, explained how to work with its different features, what to expect (and not to expect) from this AI-powered software. In the second part of the workshop, we split into two zoom breakout-rooms, led by the core team of the project—Miriam Trinh and Oren Cohen-Roman, in order to gain hands-on experience with the interface.

The workflow of automatic text recognition includes the preparation of images and their uploading onto the digital platform. Before we enable the computer to read and decipher the writing, we must conduct a layout analysis of the page we are working on. The layout analysis has to provide a description of the various areas in the text, as well as detail the order in which the computer will read the manuscript. Upon describing the page layout and correcting it, we can run the text recognition for our manuscript. At this stage, the user needs to go over the text and amend any mistakes made in the text. Following the correction, the text can be downloaded for personal use.

The real wonder happens after a significant portion of the text has been corrected; then we can train the computer to improve its reading skills. Transkribus already has a model—named dijest—that can decipher printed materials in Hebrew script - including Yiddish, in various orthographies. Employing dijest enables users to transform materials into searchable and annotatable texts. 5 5 More information on this model may be found in:

Digital handwriting recognition tools revolutionize the accessibility of knowledge and knowledge-production. They bear the potential to smooth out some of the technical obstacles of our work and, at the same time, invites us to ponder the cultural meanings embedded in the use of an automatized Yiddish language reading model. However, our scholarly intervention should not only be understood as a technical operation; it also has a symbolic dimension. A language known for its ability to absorb other languages into it— integrating elements from Slavic, German, and Hebrew (loshn koydesh)— Yiddish becomes, through the handwriting recognition tool, a model for the language itself: an archive made up of the embodied practices of its past writers and speakers. The Transkribus platform demonstrates the ways whereby Yiddish is intrinsically shaped by the technology through which it is mediated. It thus reflects the technology of the language, whereby the very conditions of the language are bound up to its form.

This language model also presents us with a radical twist in the modern Yiddish cultural plot: the language model is based upon popular Yiddish plays that have been subjected to harsh criticism, debased as unvaluable “shund” (literally, trash). It is a delicious irony that those plays, considered by Jewish intellectual elitist circles as worthless works, lacking any artistic or cultural value, make up the “ground truth” of our handwriting recognition model, demonstrating, once again, how theater furnishes us with a surprising platform to engage with new technologies.

Abeliovich, Ruthie. “The DYBBUK Model: A New Yiddish Handwriting Text Recognition Tool.” In geveb, February 2022:
Abeliovich, Ruthie. “The DYBBUK Model: A New Yiddish Handwriting Text Recognition Tool.” In geveb (February 2022): Accessed May 26, 2022.


Ruthie Abeliovich

Ruthie Abeliovich is assistant professor of Theatre and Performance Studies at the University of Haifa.