Research has found that by putting your research data online, you’ll become up to 30% more highly cited than if you kept your data hidden.
Data are second only to journal articles in terms of importance to scholarly communication and publishing; they are the rocks from which diamonds are refined. And as a scholar, chances are you’ve got some kind data lying around on your hard drive or server, no matter if your specialty is in the humanities, social sciences, or STEM fields. Data are ubiquitous.
Yet a lot of research and scholarly data never see the light of day. It used to be difficult to make data available to others, so scholars didn’t do so unless it was required by journals or funder mandates.
But research has found that by making your data available online, you’ll become up to 30% more highly cited than if you kept your data hidden. Openly available data also lead to replication and reproducibility studies, and are important to the quality of research overall. And advancements in technology have made it easier than ever to make your data open access and preserve it.
In this week’s challenge, we’ll share several easy ways to make your data available online: open repositories (ORs) like Figshare, Zenodo and the Open Science Framework; disciplinary repositories (DRs) like Dryad and ICPSR; and institutional repositories (IRs) like AUrora.
A common way for many researchers to share their data over the years has been to submit it as a supplementary file to a journal article. But publishers are beginning to encourage scientists to deposit their data to repositories instead.
Publishers recognize that repositories of all persuasions are fantastic places to post your research data. That’s because of two standard features for most repositories: high-quality preservation options and persistent identifiers for your data.
Preservation is a no-brainer – if you’re entrusting your data to a repository, you want to know that it will be around until you decide to remove it.
Persistent identifiers are important because they allow your data to be found if the URL for your data changes or if it’s transferred to another repository when your repository is shuttered, and so on. And with persistent identifiers like DOIs, it’s easy to track citations, shares, mentions, and other reuse and discussion of your data on the Web.
There are several different types of repository that can host your data depending upon your institution and discipline. Let’s dig into the different types of repositories and what each does best.
Once you create a project in OSF, the platform allows you to add files, write a wiki, assign collaborators/contributors, assign a license, and register a DOI. Projects can be public or private. But the structure you can add to a project is one thing that makes OSF such a powerful tool. You can have different components for a project (and components inside those components); e.g. methods, data, analysis. Each component can have different attributes too – wikis, privacy settings, collaborators/contributors, and file storage. There are two options for adding your data to an OSF project or component – internal OSF storage, or you can link a third party services like Google Drive, DropBox, and GitHub. If you link one of your cloud services to and OSF project and/or component, you can specify exactly which files or folders you want to give access to. All components (including OSF storage files) have unique, persistent URLs. Also, OSF allows you to track versions of your files and there are analytics tools that allow you to track how people are interacting with your data. For anyone who is concerned with dependability, OSF has a preservation fund that, at this time, will keep all hosted projects and data up for at least 50 years.
Though there is no limit to the number of files you can upload to OSF’s internal storage, individual files have to be 5GB or less. As with any data set that is governed by HIPPA or FERPA, make sure that you are following the requirements of a grant or institution if you are making data public on OSF.
Open repositories like Figshare and Zenodo are repositories that anyone can use, regardless of institutional affiliation, to preserve any type of scholarly output they want. Here are specific advantages and disadvantages of these two open repositories.
Figshare offers free deposits for open data up to 5 GB in file size. They issue persistent identifiers called DOIs for datasets. Users can “version” their data as simply as uploading updated files, and can easily embed Figshare datasets in other websites and blogs by copying and pasting a simple code. Other users can comment on datasets and download citation files to their reference managers for later use.
Figshare offers preservation backed by CLOCKSS, a highly trusted, community-governed archive used by repositories around the world. And you get basic information about the number of views and shares on social media your dataset has received to date.
Zenodo also offers free data deposits and issues DOIs for your datasets. Much like Figshare, the non-profit makes citation information for datasets available in BibTeX, EndNote, and a variety of other library and reference manager formats. Users can add highly detailed metadata for their files – much more than Figshare currently allows – which can aid in discoverability. Other Zenodo users can comment on your files. And best of all, Zenodo makes it easy to sign up with your ORCID identifier or GitHub account (If you don’t have a GitHub account yet, no worries! We’ll cover GitHub next week).
Both repositories have open APIs, making them very interoperable with other systems, and they are both user-friendly and fun to use.
For some, Figshare’s funding model is a serious drawback; it’s a for-profit company funded by Digital Science, whose parent company, Macmillian Publishing, is the keeper of the Nature Publishing Group empire.
Zenodo’s preservation plan is less robust than Figshare’s, and currently Zenodo can only host files 5 GB or less in size. Zenodo also lacks public page view and download statistics, meaning that you can’t track the popularity or reuse of the data you submit to the archive.
Disciplinary repositories offer a way to share specialized research data with relevant communities. They offer many of the same features as IRs and ORs, but often with special features for disciplinary data.
Disciplinary repositories like KNB and ICPSR often allow users to use subject-specific metadata schema that enhance discoverability. They are focal points for their disciplines, meaning that your data will more likely be seen by those understand it. Repositories like those in the DataONE network are interoperable with the software that you and other researchers already use to collect and analyze data, making it super easy to deposit data as part of your regular workflow. Depending on the repository, they might offer DOIs for data you’ve deposited.
The Open Access Directory and the Re3Data guide maintain lists of open and disciplinary data repositories.
Not all disciplinary repositories allow you to deposit large datasets. Some do not offer DOIs. And occasionally, grant-funded subject repositories that don’t have sustainable business models shut down after their funding runs out.
In some disciplines, entire repositories exist just for data of particular formats. Some examples include the RCSB’s Protein Data Bank for 3D shapes of proteins, nucleic acids, and complex assemblies; Genbank for DNA sequences; and EMDataBank for 3D electron microscopy density maps, atomic models, and associated metadata.
If there’s a repository for the datatype you work with, often your best bet is to deposit it there. By virtue of being a hub for disciplinary data, datatype repositories are frequented by others in your field who are doing similar research – an ideal audience of those you’d want to see and reuse your data. Datatype repositories often offer highly specific metadata and search options, making it easy for others in your field to find your data.
Datatype repositories cater to a very small subset of data formats, and can sometimes lack links to the publications and other datasets that give them much-needed context. Some datatype repositories are inactive, having been abandoned after their funding ran out, or because of a lack of use, or for a host of other reasons. Be careful to check whether the datatype repository you’re interested in using is regularly updated.
Institutional repositories are platforms where a university’s community can preserve their research data and other scholarly outputs. AUrora is the institutional repository that serves faculty members, students, staff, and other stakeholders at the University of Oklahoma.
AUrora is free to use if you’re affiliated with AU, and it allows for the addition of both basic and complex data descriptions. Additionally, by working with your librarian, we may be able to mint DOIs for your data.
By virtue of being backed by AU and administered by librarians, AUrora offers a degree of trust that money can’t buy. Librarians have been stewards of the scholarly record since the times of the ancient Library of Alexandria, and both AU and AUrorawill likely be around long after the Googles of the world have been shuttered.
What AUrora offers in trust, it lacks a bit in flexibility and control. Of course, if you’re not affiliated with AU, you won’t be able to deposit your data in AUrora. And while AUrora can accept nearly any file type, it does not necessarily offer users a way to read those files. At the time of this writing, AUrora also doesn’t support versioning, so you’re unable to edit your files and accompanying descriptive information.
Most importantly, AUrora is not able to store very complex or very large data files. If you’re working with Excel data, AUrora can certainly handle that! And it can handle lots of other filetypes too, of course. But it’s not built for terabytes of data.
Finally, AUrora uses a very general metadata standard, Dublin Core, so it doesn’t support domain or datatype-specific metadata fields and controlled vocabularies.
In addition to some of the drawbacks addressed above, the biggest limitation to the idea of making your data openly available is that not everyone can do it! If you work with sensitive data – defined by ANDS as “data that can be used to identify an individual, species, object, or location that introduces a risk of discrimination, harm, or unwanted attention”– you often can’t share your data openly online.
That said, some repositories like ICPSR do index sensitive data, making it available only to registered users. The availability of a metadata record alone can sometimes be enough to cite sensitive data, so it’s possible that you can still get cited, even if your data aren’t openly available. But we don’t recommend keeping your data behind a login or other barrier if you don’t have to.
Unsure if your data are “sensitive?” Check out Purdue University Library’s guide on sensitive data, which can help you identify it and all applicable laws and regulations.
For this week’s homework, you’re going to get some of your data online, and you’re going to make an appointment with the Research Data Specialist at AU Libraries.
Explore data hosted on the Open Science Framework, Figshare, and Zenodo then choose and sign up for an account on the platform of your choice (we recommend OSF, specifically because we offer OSF workshops and have librarians who specialize in OSF training). Deposit at least one data set to the service. It can be a copy of supplementary data you’ve posted alongside a journal article, raw data, or data from a dead-end project you’ve never published.
Be sure to add as much descriptive information as possible during the deposit. It will make your data useful to those who look at your data, and good description will also make it more “Googleable”– each one of these repositories is well-indexed by search engines. All digital data should also include a README.txt file that helps others better understand the data. Your README file should contain:
There are thousands of repositories where you could possibly deposit data from your field. Ask a trusted colleague for a recommendation or check out the Re3Data guide for a comprehensive list of subject repositories.
Once you’ve found one that suits your needs, register for it and deposit a dataset or two.
Ask a colleague or your advisor what the best repositories are for the data formats you tend to create. Sign up for each that you think will be the most relevant to your work, explore some of the other datasets on the site, and deposit a dataset or two of your own. And just like you did for the previous two deposits, make sure you add great descriptive information and a README file, which can help others understand your data.
Liaison librarians at AU Libraries can connect you with additional resources, including discipline-specific repositories where you can deposit and share your data. Even better, AU Libraries employs a full-time research data specialist, Ali Krzton, who is dedicated to all things data! Make an appointment with Ali to consult about data repositories, data management plans, data formats, data storage, workflows for input/uploads, and more!
Got an idea of what repository works best for you? Great! Next time you’ve got a dataset that you want to share with the world, do it!