Gale-Cengage Learning and 18thConnect have joined forces to undertake a major initiative benefitting scholars and improving the digital archive for future generations.
Gale’s ECCO catalog, Eighteenth-Century Collections Online, contains page images for 182,000 texts, some of them as lengthy as Clarissa. Because the process of creating such a set of images has taken decades of work, some of the page images are not readable enough to be transformed into typed texts by computer programs designed for this work. 18thConnect.org is a community of scholars and open-source online finding aid hosted by the University of Virginia. It has received a grant from the Mellon Foundation as well as NEH support from the NCSA and I-CHASS (the National Center for Supercomputer Applications and the Institute for Computing in the Humanities, Arts, and Social Sciences at the University of Illinois) in order to develop a new, open-source software program that, after being trained on the ECCO catalog, will itself be available for public use.
18thConnect will re-run the ECCO page images through this new program in order to generate cleaner text, if possible, that Gale has been able to do commercially. Next, 18thConnect will provide a window for users—anyone who wishes to register with an email address—to correct the typing of these texts. That these texts are correctly typed is, we believe, crucial for searching, data-mining, and making them findable and comprehensible to future generations. Users who wish to correct whole texts will receive, in compensation for their work, access to the fully typed text. 18thConnect and NINES hold workshops each summer to demonstrate to scholars how to build library-quality scholarly editions from plain typed texts. Once properly constructed, these editions can be submitted to 18thConnect for peer review. The editorial board at 18thConnect is comprised of top scholars in the field, and acceptance letters are designed to indicate the value of these editions to Promotion and Tenure Committees. Further, library-quality scholarly editions are eligible to become MLA Electronic Scholarly Editions. Positively reviewed editions are first accepted into the 18thConnect online finding aid. If a scholar’s edition has been accepted (positively peer-reviewed), Gale Cengage may choose to publish the edition along with the page images as a print-on-demand edition.
In addition, because of this mutually beneficial collaboration between Gale Cengage Learning and 18thConnect.org, the ECCO catalog is now completely searchable on the freely available 18thConnect site. Everyone may search through the bibliographical information in the Gale catalog. If you or your institution subscribes to Gale, clicking on the link returned with an entry will get the user into the ECCO text collection and take him or her directly to that particular text. (One has to be on a work computer at a subscribing institution or on a proxy server). But if the user does not subscribe, he or she may do one of two things: he or she can find out the holding libraries for that text through the ESTC (English Short Title Catalog), also online at 18thConnect.org, OR the user may simply click on “correct” in order to put the text in his or her correction queue. Once corrected, the typed text belongs to the user to do with as he or she wishes—and again, we recommend creating a scholarly edition. All the work of correction will go back into the ECCO collection to improve this valuable resource for posterity.
Text Creation Partnership
It was precisely the desire to improve for posterity the ECCO collection (Eighteenth-Century Collections Online, owned by Gale Cengage) that motivated the Text Creation Partnership at the University of Michigan to undertake the TCP-ECCO project. The TCP has produced hand-typed page images of texts in the EEBO collection (Early English Books Online, owned by ProQuest) because documents printed before 1700 are generally impossible to type mechanically. “Mechanical typing” requires having an OCR program (Optical Character Recognition Program) run though the page images and “read” them, turning into typed letters the lines of print visible in the image file or pdf. When you use the “find” function to search a pdf file, it is actually a typed version of the image that you are searching, and that version has been generated by OCR. But standard OCR engines will not work on page images of texts that were printed before about 1820, and no OCR engine at all will work on texts before about 1720.
The Text Creation Partnership is devoted to making electronically searchable those early modern documents for which we have only digital images. However, it does not work with OCR. Instead, this group pays a company for typing by hand so that page images to be double- or triple-keyed—typed two or three times, and compared—in order to achieve 99.995% accuracy. Many institutions have contributed money to TCP-EEBO, and, as a result, as many as 40,000 documents will be full-text searchable in EEBO sometime next year. In the case of the TCP’s work with ECCO, however, only 2,229 texts have been keyed, and institutions are no longer contributing the funds necessary to support their work.
18thConnect is currently developing a program of OCR plus crowd-sourced correction that will produce typed versions of page images which are as accurate as if they had been typed by hand in the first place. The TCP is vigorously supporting 18thConnect by giving us access to their typed texts for searching and for research. We are using them to train our OCR engine, and also, those texts are now completely searchable here, online, in the 18thConnect finding aid: click on ECCO in the search menu on the right-hand side, and then click on “Full-text only,” a radio-button at the bottom of the search menu, again, on the right-hand side. You will see 2,188 returns, and then you can hone your search down further by putting in words or authors or anything to be found in the bibliographic information or anywhere throughout the text. Searching 18thConnect, you will see how high quality those 2,188 typed texts are, and the difference accuracy makes in being able to do good research. (See also the screencast available in “What Is 18thConnect?”)