Your institution’s library has access to a number of Gale’s primary source collections, and they can be found by clicking on the Available Text links on the Home Page or in the Learning Center. Most of Gale’s archival collections are text mineable, with the exception of those that are primarily manuscript-based or have specific rights restrictions that prevent text mining at this time:
Chatham House Archive
Early Arabic Printed Books
The Financial Times Historical Archive
State Papers Online
National Geographic Magazine Archive
Hand-written texts (including Arabic texts) present considerable difficulty in rendering the content in plain text due to limitations on handwritten text recognition. While OCR engines are continuously improving their ability to recognize a wide variety of character sets, the variables presented by handwritten text remain challenging to most platforms today. Even so, Gale has employed a number of new technologies to derive OCR from manuscript collections like the Crime and Punishment module of the Nineteenth Century Collections Online (NCCO) and will continue to create and advance the state of manuscript OCR in the future.
At present, users are limited to 10,000 documents per content set. The limit was determined through consultation with our source library partners, researchers, beta testers, and programmers. It allows us to analyze analysis pipeline performance and make changes to both hardware and software to respond to computational needs in the future.
Gale Digital Scholar Lab includes a variety of tools that support well-known text analysis methods that are both qualitative and quantitative. Four of these tools are open-source and are widely recognized and used in the academic space today; the remaining two tools are built in similar fashion to their Open Source equivalents or utilize Open Source components in the analysis process. Providing these tools along with millions of pages of primary source content and accompanying OCR text gives users the ability to quickly move from corpus creation to text analysis in one platform.
Gale Digital Scholar Lab includes the following tools:
Yes. The Clean feature of Gale Digital Scholar Lab lets you strip out blank spaces, punctuation, special characters, and more in order to ensure cleaner, more accurate analytical output. It’s designed to work seamlessly with the included analysis tools, in addition to cleaning content sets before downloading them locally. Cleaning is a critical part of the preparation for any text analysis. Gale Digital Scholar Lab includes the ability to clean content sets as a separate feature, so you can ensure that documents in specific Content Sets are prepared in precisely the same way. Users can decide how they’re altered and make adjustments according to their individual research needs.
The majority of Gale Primary Sources and Archives Unbound collections can be analyzed within Gale Digital Scholar Lab (please see exclusion list earlier in this FAQ). While the mission of the platform is to provide access to OCR text of your institution’s Gale Primary Source collections, we will also support the ability to analyze non-Gale texts with the Digital Scholar Lab. We will continue to explore possibilities to extend our content reach to include outside collections that are frequently asked for by customers.
Users can upload plain text files (.txt) and text in spreadsheets (.csv) by navigating to the Upload feature on the Build page of Gale Digital Scholar Lab. They can select one or more files from their computer to upload, apply metadata, manage, and add to a Content Set.
Watch this 7-minute tutorial video to learn more.
Users are the only ones who can access their documents and have control over their state. Once a document has been uploaded in the Lab they can edit the document’s text, apply metadata, and add to a Content Set. Users can also delete their documents from the Gale Digital Scholar Lab environment at any time. It is important to note that deleting documents means they will no longer be available for inclusion in content sets or analysis. They will also be removed from any content set currently containing them and no longer be available to view in past analyses.