Skip to Main Content

Textual Data & Analytics

This guide offers an overview of methods, tools, and sources of data for textual analytics.

Check the permission before using a text resource

Whenever you start gathering text files for textual analysis, you must carefully consider if you have the legal right to use the resource for textual analysis. It can be a violation of copyright to use a copyrighted resource for text analysis, so it is best practice to use items that are in the public domain  or that expressly state the the item is available for Data/Text mining. 

  • Often the practice of textual analysis falls under Fair use, which is an exception  to copyright. However, fair use does not apply if you share the corpus (which is what happens when you share or export a corpus in Voyant and other tools). 
  • All published works from 1928 and before are public domain, so these items are typically safe for textual analysis. 
  • "Some materials you may want to use may have technological protection measures (e.g., digital locks). Breaking these protection measures may be a violation of copyright law or other statutory laws. " (Copyright Implications in Text Data Mining, Resources for Text and Data Mining Guide by Emory Libraries) 
  • Materials in Library databases may not be available for text analysis according to the license agreement the library has negotiated with the database publisher. Look at the Terms and Conditions page of the database to see if Data mining is allowed. 

Finding Textual Data

Most practitioners of textual data anlytics will agree: finding textual data that is relevant to your research question, readily available in digital format, and not restricted by copyright or licensing is in many ways the hardest part of textual analytics. The sources below are good places to start, but if you're not finding what you are looking for, don't hesitate to reach out to a librarian for further consultation.

 

Tips:

  • Transcription vs. optical character recognition: If the text you are looking for exists in a transcribed version, it is almost certainly going to be of higher quality than that generated by optical character recognition (OCR). Choose transcriptions whenever available. 
  • Some sources will allow you to retrieve texts in bulk if you know what you want. Talk to a librarian if you're unsure how to pursue this option.

Databases with Text analysis tools

Open data/shared corpora

Library Subscription Databases