Return to homepage

From the roadmap for action report

What Do We Mean By Data And Data Infrastructure

We talk about two types of data. The first is Research Data, which refers to the data academic institutions generate through their research activities. The second is Grey Data, which refers to the vast amount of data produced by universities outside of core research activities.

4 mins read

Research Data and “Grey Data”

In this document, we talk about two types of data. The first is Research Data, which refers to the data academic institutions generate through their research activities. The second is Grey Data, which refers to the vast amount of data produced by universities outside of core research activities, and which tends to focus on the individuals belonging to its community (primarily students – but also faculty and staff). This includes data from applications, student records, ID cards, surveys, sensors, surveillance video, internet and network usage, etc. The term “Grey Data” was coined by Dr. Christine Borgman at the University of California Los Angeles, who points out that the boundaries between research and Grey Data are increasingly blurry, and it is necessary to consider both when discussing solutions to the issues posed by the rise of data and data analytics in academic institutions. As a result, this document addresses both types of data.

It is also critical to underscore upfront that we are not opposing the use of data and data analytics in academic institutions. Data collection and analysis are key elements of research, teaching, and learning.1 We do acknowledge that there are voices arguing that the unbalanced use of machine learning can shift the focus within the academic community away from basic science and into technology - or away from theory and into producing large data sets.2However, a world without data is also a world where biases can and do play a large role, so limiting the use of data is not the solution.

Our goal is to ensure that academic institutions retain control over the use of their data and of the analytics applied to it. It is also vital that their use is consistent with the goals of the academic community and that academic institutions are properly equipped to deal with the risks and implications posed by the use of data.

Of course, in many ways, this phenomenon mirrors the rise of data capture and usage in society, and it poses similar challenges. What is different is the declining opportunity for individuals within academic institutions to actively opt out of data collection. Individuals working for corporations, depending on where they live, have limited or no expectations about digital privacy at work. On the other hand, academic institutions, at least in many western countries, have always protected academic freedom, including the right to conduct research and search for information, without prying eyes. These concepts are now at risk.

Metrics and algorithms

It is important to distinguish upfront between metrics, which refers to what is being measured, and algorithms, which refers to how it is being measured. These two categories are often interrelated, but they pose different issues, and therefore, should be addressed separately.

Metrics should be controlled by academic institutions. It is their responsibility – and theirs alone – to ensure that evaluation is performed on the basis of multiple factors that align with the institution’s mission and values. This document does not advocate for academic institutions to choose any specific metric, but rather simply that they deliberately choose what metrics are used, rather than simply relying on those sold by commercial vendors. Developing metrics may be complex, resource-intensive, and unique to each institution’s context. However, the sharing of best practices across like-minded institutions may facilitate the establishment of a number of metrics which could become de facto standards.

Algorithms, on the other hand, do not necessarily need to be controlled by each academic institution, but must be carefully understood and monitored. It is critical that algorithms are as transparent as possible, so that they can be fully analyzed and held accountable. So long as an algorithm remains a “black box,” an institution is powerless to understand whether it may contain biases that are incompatible with its values, or flaws that could lead to costly mistakes.


  1. SPARC staunchly advocates for Open Data as a way to accelerate the research process, but our advocacy for Open Data policies in independent of our work in this document. 

  2. https://medium.com/berkman-klein-center/from-technical-debt-to-intellectual-debt-in-ai-e05ac56a502c 

About the authors

Portrait of Claudio Aspesi

Claudio Aspesi

A respected market analyst with over a decade of experience covering the academic publishing market, and leadership roles at Sanford C. Bernstein, and McKinsey.

Scholarly Publishing and Academic Resources Coalition

SPARC is a non-profit advocacy organization that supports systems for research and education that are open by default and equitable by design.