Libraries
 

Contact Information
  
last updated: 1/28/2013

Text & Data Mining

Text and data mining of academic databases are becoming increasingly popular ways to conduct research. They can allow scholars to make connections not previously discovered, or find solutions more quickly and efficiently. Such research has also gotten some researchers into trouble for alleged copyright and contract violations, when practiced without due diligence into existing legal restrictions.


For IU researchers interested in accessing the Libraries’ digital journals, databases, special collections (specifically, HathiTrust), and other subscription content for the purposes of text or data mining, here are some things you should know before you start.

  1. In order to text mine journal articles or databases available through the Libraries, you’ll need a librarian’s help. Many publishers and vendors such as Thomson Reuters, EBSCO, and Proquest restrict automated data scraping and large-scale access to their journal articles and databases. They believe doing so is necessary to protect their copyright, lest end-users abuse their subscription privileges by downloading and sharing articles with others who have not paid for a subscription to the journal. Librarians can help you contact publishers in order to gain access. Email Lori Duggan (Head, Electronic Resources Unit) at lbadger@indiana.edu to learn more about what restrictions might exist, and how you can get permission to text mine.
  2. Text and data mining is negotiated with publishers on a case-by-case basis. Only a handful of IUB researchers have approached the Libraries in order to help get access for text- or data-mining research. Because the demand is so low, we work with researchers on a case-by-case basis. This means that it may take weeks or months to work out terms with vendors. Be sure to build this extra time into your research schedule.
  3. Negotiations with publishers are mostly handled by the researcher. Librarians help you make initial contact with vendors, but a majority of the responsibility to negotiate access falls to you, the researcher. Sometimes, it may require weekly phone calls or emails to make progress. Keep in mind that getting access will take time, as described above, and also require persistence on your part.
  4. It might cost you money (and resources). Publishers are unlikely to write services into their default subscription contracts if they can charge extra for them, and data-mining packages are no different. None of our current vendor contracts include text- or data-mining, and one-time access can cost upwards of $10,000. Writing expected costs into grant proposals is a good way to ensure access.

    On a related note, you will need to have your own data storage provisions and tools. Check out RFS for short-term data storage, and SDA for long-term tape storage. (Both are free options supported by UITS.) Many data analysis tools are available free of charge to IU researchers via IUware.
  5. For certain types of text-mining research, Open Access journals and repositories can be good alternatives to subscription journals. Publishers such as Hindawi, PLOS and BioMed Central welcome text-mining and reuse of their content, as do some institutional and subject repositories like PubMed Central (though some have argued that PMC’s text mining capabilities are less than stellar).

    It’s worth noting that both Elsevier and Gale have expressed public support for text-mining, but they negotiate it on a case-by-case basis.

It’s possible to switch text mining access for IU researchers to “default,” but we need to build critical mass first. As described above, we don’t have any vendors who allow text mining according to their current subscription contracts. We also don’t have a lot of researchers who have expressed a need for access. However, if enough researchers reach out to us, we may consider negotiating default text- and data-mining access in future contracts. Send an email to Lori Duggan (lbadger@indiana.edu) showing your support and we’ll keep you in mind.



last updated: 1/28/2013