Announcing hebrewCorpus

We are pleased to announce hebrewCorpus, a new corpus that is available for free online. This corpus presents a variety of searchable texts in Hebrew. Sources include the Tanach, the Mishnah, nine Israeli newspapers, some early and modern fiction, subtitles from movies, transcribed interviews from the Corpus of Spoken Israeli Hebrew, academic journals, sessions of the Knesset, Wikipedia, and a few others. All of these texts add up to over 150 million words.

These texts are not tagged, since the morphological ambiguity of Hebrew makes doing so problematic, but the program does use part of speech filters that try to predict the part of speech based on structure and affixes. The program also uses regular expressions, which greatly enhance the searchability of the texts. Detailed instructions and a tutorial for the corpus are provided on the site.

We invite all Hebrew teachers, students, and scholars interested in using a search tool to study Hebrew to explore this resource. If you know of anyone who would be interested, we invite you to refer them to the site. 

To begin using the corpus, go to http://hebrewcorpus.nmelrc.org. Click on register for free, and add your name and e-mail address to begin searching. You can also log in as a guest, but this is problematic since many users may log in as guest at the same time.

There is also a mailing list that provides updates and tips on the corpus; please contact Justin Parry at ootkaman@yahoo.com to be added to it.

This corpus was developed with funding from the National Middle East Language Resource Center (NMELRC)