by Suzanne Grubb
Like many of you, I don’t work with datasets that can’t fit neatly into an Excel workbook.
My inner geek loves reading about the world of machine learning and predictive analytics – as well as articles in Information Outlookor the several sessions at the Annual Conference covering the amazing work done by our colleagues curating and leveraging massive datasets. But it feels like all Big Data discussions in the library world are falling into one of three categories:
- “You Can Make Data Products, Too” how-to/cheerleading pieces, encouraging information professionals to build their skills as data workers,
- “Your Organization Should Be Using Big Data” high-level strategy pieces for managers, or
- “Not Actually Big Data” pieces written by folks who have confused datasets-that-are-large with “big data.”
I work for a niche digital library that doesn’t have the resources for in-house data-wrangling or a mission that warrants a big data strategy. And I suspect there are many others out there like me who are struggling with how to find our place in the big data ecosystem.
For those of us who are falling through the cracks in this Big Data conversation, I’m hoping to put together a field guide to some of the additional players – beyond researchers and data scientists – who help shed some light on the ongoing evolution of the big data landscape.
In May, the White House released two reports (the US Open Data Action Plan and Big Data: Seizing Opportunities, Preserving Values) articulating the policy agenda for big data. Of particular interest to librarians: the reports specified a commitment to strengthening privacy protections, supporting innovation in education, supporting digital literacy, and improving public access to government datasets – including expansion of Project Open Data, Data.gov, and open source tools to use data.gov.
Groups like EPIC and the Electronic Frontier Foundation frequently discuss the implications of operating in a world where automated data collection is treated as a given, as well as the challenges of de-identifying and anonymizing datasets. In the coming years users will expect (and organizations may be required to provide) greater control over personal data. Even if we don’t collect and store user data within our own organizations, we need to be aware of how this information might be passed through to data brokers and other third-party vendors – and prepared to act in the event of a breach.
There are several initiatives looking to use big data to disrupt the way we think about research. Alternative methods for measuring the impact of a publication, like Altmetrics, take into account social sharing and other data to quickly shed light on trending topics and the public reach of research, while standards like FundRef and ORCID are setting the stage for a wealth of readily available big data tools that will enhance our ability to visualize and explore the research ecosystem.
Data Publishers and Data Curators
There has been incredible growth in the number (and geographic distribution) of open access data repositories over the past decade. Fortunately, there is a growing number of organizations curating and cataloging these datasets (like Databib.org, DataCite, Quandle and many others). In order to help users locate and evaluate data sets, it’s important to understand the considerations involved in publishing and using data.
Citizen Hackers and Startups
The emergence of a hackathon culture – with nationally sponsored and grassroots, local events that connect data scientists and coders with causes and problems – alongside a proliferation of data-product startups means that there has also been an explosion of free apps to help the non-data scientists among us in visualizing and manipulating data. Even if you do not want to host a hackathon, there are a huge number of projects (for example: codefordc.org/projects.html) that provide highly useful functionality for specialized datasets in all subject areas.
While I hope this post provides some food for thought, this is certainly not a complete list. What else would you include in your version of Big Data field guide? I hope you’ll post your thoughts and additions in the comments.
Suzanne Grubb is a digital librarian/instructional designer and all-purpose info-geek, currently building a Clinical Research Education Library for a DC-based association.