comment 0

ACM FAccT 2020 Craft Session (3): Experimenting with flows of work – How to create modes of working towards epistemic justice?

Drawn workflow of algorithmic legal analysis

This is part three of a blog post series reflecting on a workshop, held at FAccT conference 2020 in Barcelona, about machine learning and epistemic justice. If you are interested in the workshop concept and the theory behind it as well as what is a workflow and why we worked with them, read our first two posts here and here.

In this post, work group facilitators briefly report on some of what came out of the Translation cartographies part of the workshop. This part focused on identifying and mapping terms which were important for facilitating cross-disciplinary understanding, and we introduce here some of the workflows that participants generated. How did participants go about re-writing the “default” technical workflow? What new modes of working were proposed? What conditions (practical, conceptual, contextual) were necessary in order to enable the implementation of the workflow proposed by the group? Where were some of the frictions (contested terms, methods, understandings) that created disagreement or discontinuities in the discussion? Spoiler alert: the exercise of redrawing a technical workflow and identifying necessary terms and conditions for the new workflows to work, proved to be an ambitious feat to be completed within our allocated time. We knew that, the question was whether that was experienced as problematic. The jury’s still out, but we’ve already decided to build in at least some more breathing space in the next edition.

Outside ethics: cooperating around the AI black-box

Image of a re-worked technical workflow of one of the participant groups on a whiteboard. The image portrays the prototypical workflow, post-its with added structures to the workflow, and a new section on ethics.
Workflow 1

The participants at this table remarked the flatness and generality of the standard workflow. From the beginning they wondered to what extent they could steer away from the workflow, testing different maneuvers to work with it. Should people position themselves somewhere “in the workflow”? It seemed that people’s disciplinary backgrounds shaped how they position themselves. A computer scientist at the table responded that he would locate himself “in the optimization part”. A sociologist noted that she would consider the organisational embeddedness of AI and how it responds to broader societal and policy problems.   

As most people could imagine interventions during the design and use of machine learning algorithms, they attached sticky notes in the beginning and the end of the workflow. This included “feedback loops for user testing”, “emphasis on labelling data and how labels emerge”, “internal ethics reviews”, and “establishing and using checklists of ethical requirements”, among others. Interestingly, one person suggested locating ethics outside of the workflow – which was not contested but supported by proposals to establish “ethics offices”. These interventions – and the resulting workflow – reproduce the notion of AI as a black box where only input and output can be accounted for (and thus intervened upon).

Tough Love

Reworked wofklow on a whiteboard of one of the participants group with additional stages on context, structure and impact assessement added to the prototypical workflow.
Workflow 2

From the get-go, this group of participants performed their inquisitive exchanges in a cooperative, problem solving spirit. Scheduling time to reflect (which we’d done but minimally) seems especially valuable in order for ‘things’ to come out in such a setting. An early note of a participant read “how much would I have to explain to make this point .. what is my stake in ending up with an outcome that I am behind?”

Eventually, no term was left unturned before adding it to their workflow redesign in what seemed a clear effort to uncover tacit disciplinary normativity, and arrive at interdisciplinary trust. The participant choir’s smooth tuning of their  social-political-legal voices in the stage of ‘problem formulation’ contrasted with their lengthy quarantining of the CS contribution of ‘solvability,’ while they investigated it for negative performative (‘solutionist’) potential. Shared glossaries are loved for a reason (and ours did not include the term).

Around 18.30, a 5-step workflow was born under the name TAF ❤ (a.k.a. “tough love”), challenging the parent (venue) acronym. Although stages were progressively sorted (from problem, design and ex ante impact assessments to briefing of deployment and advisory teams, and ex-post impact assessment), strategic relations drew across boundaries and blue feedback loops attested to dynamic relations. The pre-final stage was headed by a big green post-it with a red exclamation mark, and the text “the goal gets redesigned in every step.”

Making the work flow and adjust: building feedback loops

Prototypical workflow on a whiteboard re-worked with post-it notes listing additional considerations such as user profiles etc.
Workflow 3

This text was written by creators of the workflow as an annotation, and presented here without edits:

Our model is a non-linear model that is more dynamic than a default linear workflow. Our workflow is inspired by feedback loops. The major loop is to go from model readjustment back to task definition. We have different levels on which interdisciplinary work is conducted. Workflows are divided into research and other processes. Connections are not given between the different levels but they can interact. Workflows can be interdependent and interacting. For instance, one level is to question and define task definitions and see how this task definition. An important aspect of our workflow(s) is humility. This means that different actors are aware of different actors. Awareness is an act of engagement, taking into account different perspective and that the own perspective is not king. Another important concept that might be related to humility is bias. As researchers we are aware of our biases but also carry biases and embed them into research. Bias may be a contested term that can be positive or negative in different contexts. The disciplinary upbringing is a form of bias that we should be aware of. Workflow(s) need actors and humanization which means we have different faces involved in the research process. Especially to realize epistemic justice, we need to take different communities into account.  

comment 0

ACM FAccT 2020 Craft Session (2): Finding common ground, charting workflows

Workshop participants drawing their workflows

This is part two of a blogpost series reflecting on a workshop, held at FAccT conference 2020 in Barcelona, about machine learning and epistemic justice. If you are interested in the workshop concept and the theory behind it, read our first article here. This post reflects on the workshop method. Written by Danny Lämmerhirst, Aviva de Groot, Goda Klumybte, Phillip Lücking and Evelyn Wan.

How can computer scientists, data scientists, as well as scholars from humanities and social sciences hear one another, acknowledge and appreciate each other’s ways of reasoning about algorithms? If our goal is to strive for epistemic justice – that is to improve our capacities for fair and inclusive knowledge making, – what form could a workshop take to respond to this? 

To address these questions, our FAccT (formerly known as FAT*) 2020 workshop revolved around two main parts, “Charting methodological workflows” and “Translation Cartographies”. We formed four groups in the beginning consisting of participants with diverse disciplinary backgrounds. Each group was informed about facilitator guidelines that address respectful participation, and diversity in voices, promoting awareness as these aims easily (and unintentionally) suffer from group dynamics and knowledge production suffers as a consequence. We also offered a shared glossary – brief descriptions of key concepts that acted as a common reference point that people were encouraged to add their own takes (and lemma’s) to. In this post we will discuss some insights that came up during the first part of charting the workflows.

We started off with participants presenting and discussing their discipline’s usual workflows among themselves. By “workflow” we mean a formalized way of building, approaching or analyzing a particular object (subject of investigation). The term itself is more frequently associated with industrial and organisational fields, and often scholars from the humanities and social sciences would not think about the way they do their work as adhering to a “workflow”. Nonetheless, a method, a specific approach, or a research routine can be regarded as a workflow. In computing, for example, design methods, practices of writing, editing, filing and distributing software products can be seen as workflows, too. We deliberately played with this concept to highlight that we all, regardless of the discipline, do have “formats” that we adhere to in our work and especially where these are tacit we need to bring them to the fore in order to open ourselves up to reflection and critique.

  • Drawn workflow of algorithmic legal analysis
  • Drawn workflow of critical technical practice analysis
  • Drawn workflow of an unidentified research process

Above: Some sample workflows from different disciplines

This was for us a way to invite everyone to summarise their discipline’s workflow in any representation they like (a graphic, a comic, a text), and to share that amongst each other, as an alternative “self-intro”. It was an essential part for building common ground in such a short time – understanding the way the other works, the kind of questions different people zoom in on or bring into attention. It offered the possibility to connect with one another, despite everyone having very different jobs and output formats in their lines of work. For example, one participant working in policy first-hand observed the iterative cycles that take place also in coding. The question then became, how to build upon these first conversations and proceed from there.

After the groups shared their workflows , we presented a “default” technical workflow for building an algorithmic system. This served as a blueprint and a basis to start the debate about how we could design workflows otherwise. In other words, the second part of the exercise was to start debating on how this “prototypical” technical workflow could be changed towards ideals of epistemic justice. Drawing from their own experience with algorithmic systems and on their own discipline-specific perspectives, the participants in various groups annotated, drew on and redrew the technical workflow in order to make it more interdisciplinary and more oriented towards epistemic justice (we will share examples and more details of the resulting workflows in the next post). 

Standard machine learning application design workflow diagram
“Default” technical workflow that participants were asked to change towards epistemic justice, based on incorporating their own knowledge and disciplinary perspectives

Shared knowledge space requires shared objects upon which ideas and assumptions can be worked on and tested against: concrete examples, real world applications, case studies. For our workshop, we picked the prototypical machine learning design workflow as our shared object that the different groups tackled in various ways. As predicted, the workflow was quickly understood as being far too simplistic, but the responses were different: some groups rejected it all together while some groups had difficulty deviating from it, for example. How to represent the rich and messy realities of people’s different practices on paper? We were interested in the kind of examples and personal experiences that would come to inform the changes participants made to the standard workflow before them, and asked them to pay particular attention to instances of ‘conflict’ or difficulties.

Some chose to abandon the presented model immediately, using it merely as a springboard for discussion. Some noticed how quickly from modelling the workflow they go to “modelling the world”, how trying to capture all dynamics possible would render the model humongous and useless. In one group, participants decided to use a case study to test their ideas. Immediately, the case changed the model, showing how powerful a clear case can be for influencing algorithmic design, and for establishing  common ground. Perhaps a case of scientifically produced knowledge that was uncovered as bad science could be used to drive the discussion. Critical theory is rife with examples, and the workshop setting could perhaps try to tease out to what extent ‘workflows’ (in our broad understanding) were at play in allowing wrongful authority to be established. These interactions bring out the broader questions our workshop attempted to tackle: to what extent can one plan epistemically just collaboration from an abstract workflow? Are general models of working together even useful, or should such models always be case-specific? Can interdisciplinarity be formalized into a method, or should it adhere in more flexible ways to predispositions, normative concepts such as epistemic responsibility and  justice? 

Technical sciences tend to be familiar with working with abstract workflow models, and management studies offer multiple flowcharts detailing effective process management. Methods-turned-adjectives-turned-models such as “agile”, “lean” and others offer ways to steer processes of work without burdening them with too rigid of a structure. However, one of the biggest advantages of interdisciplinary work could exactly be the non-smoothness – the necessary pauses in the process in order to explain concepts, to address concerns and to find collaborative forms of research. It calls for sensibilities to pay attention to epistemic, social and political power dynamics that are inevitably alive in knowledge making. Further, the value that the disciplines within social sciences and the humanities have to offer also doesn’t come from formalized work processes but from their interpretative power and ability to examine context, history, background, positionality, which are then case-specific. 

Rather than nailing down how interdisciplinary collaboration is best organised, and how to deal with e.g.,  formalization versus interpretative flexibility and case specificity, we intended to come up with insights to better inform, and deal with the ‘how to’ questions so many teams are struggling with. We depart from the assumption that some middle ground is needed to productively discuss these questions, and aim to come up with tactics to ‘prep’ the deliberative space to do so. Hybrid spaces could be constructed where all “sides” of the sciences come forth with the best they have to offer. Would  such hybrid workflows flow? This is the experiment that is still to be done.

comment 0

Lost in translation? Invitation to address the challenges of interdisciplinary cooperation in the FAT community

This blogpost was written by Aviva de Groot, Danny Lämmerhirt, Evelyn Wan, Goda Klumbyte, Mara Paun, Phillip Lücking, and Shazade Jameson.

Introduction: The short of it

The rapid deployment of complex computational, data-intense infrastructures profoundly influences our human environments: private and public, social, commercial, and institutional, and on a global scale. Adverse effects have been more and less predictable, and challenging to reveal. Calls for creating a more responsible practice are heard from many sides. Accountability, fairness, and transparency are much used terms, and the FAT conference series has aimed precisely at unpacking what this means for AI development.

This has not been easy, nor can it be. The highly interdisciplinary effort that needs to be poses specific challenges. There are gaps in understanding between those who design systems of AI/ML and those who critique them, and in between the latter. These can be defined in multiple ways: methodological, epistemological, linguistic, and cultural. How can we hack our systematized research and design patterns towards new, communal methodologies?

This workshop, which will take place in Barcelona on the 29th of January as part of FAT* 2020 responds to these challenges. In a 3-hour effort, we will translate and thereby ‘explode’ common workflow patterns in AI design to a multidisciplinary setting. In bridging gaps (in) between existing criticisms of machine learning and the practice of design principles and mechanisms, we aim to build a common ground for computer scientists, practitioners, and researchers from social sciences and humanities to work together. The method we use aims to identify actionable points for our respective work, and a more fundamental appreciation of what it means to combine our methods.

More concretely

The fact that knowledge of AI/ML has been concentrated in the hands of too few is an injustice much addressed within the broader FAT community. A lack of diversity in the workforce, and a one-dimensional technical perspective introducing design logics revolving around terms like “optimization”, “efficiency”, “fixing” and performance scores from which ML technology is perceived, designed and deployed — i.e. crafted — both erase potential of crafting otherwise. Opening up the ML community, but also embedding its research in more diverse, multi/interdisciplinary settings is called for — loudly.

Crafting otherwise requires us to examine both the contents and methods of working involved in research on these techniques and their employment. We see such work as ‘epistemic practices’ that are inevitably value-laden as they are tied up with the history, challenges and traditions of all connected disciplinary fields. Terms like fairness, optimal, or causal have different meanings in different fields, but there are tougher challenges. Our methods for knowledge production differ greatly — between technical and non-technical, quantitative and qualitative perspectives. All of these compete for a voice in the public discourse, that place where many hope to see the ‘informed, democratic debate’ to take place. We propose the term ‘epistemic justice’ in order to probe reflexively into the basis from which we operate from, whether as critical scholars or as designers. Opening up our methods for others to understand means honestly translating, and that entails some soul searching. What informs our methods of working? What assumptions do we take on? Whose voices do we afford authority, and why? What do our different disciplines identify as key principles or key procedure that need a fundamental place in the design process? When do we call for NOT designing anything? How do we see various disciplines come together to articulate a larger shared vision? In short, how do we do our “epistemic best” in a multidisciplinary setting?

We are a group of scholars with a background in law, science and technology studies, media studies, computer science, and gender studies. Our shared fascination and puzzlement with these questions prompted us to organize a workshop during the ACM FAT* conference as part of the call for sessions to critique and rethink accountability, fairness and transparency.

Our workshop is part of the ongoing effort to cultivate more reflexive epistemic practices in the interdisciplinary research setting of FAT*.

This 3-hour workshop will be structured as follows:

  1. Introduction/ short presentations by facilitators
  2. Charting methodological workflows: Participants will document, share, and discuss their usual workflows to analyse algorithmic systems. During that exercise, participants will compare their workflows across disciplines, and compare their experiences with a prototypical AI design workflow.
    Our goal for this exercise is to make different disciplinary workflows visible and to develop a critical design workflow for AI which enables epistemic justice based on different disciplinary experiences. This will be accompanied by brief presentations of these new critical workflows, explaining their logic, their advantages, and the type of questions or tasks that it would be most useful for and where its limitations are.
  3. Translation cartographies: Groups will be invited to reflect on their own interdisciplinary process of critiquing algorithmic models. In this exercise we will surface and discuss the terms and concepts that could assist or inhibit collaborative AI design. As disciplines draw attention to different problems and questions, and frame their entry point to AI in different terms, we will map both the necessary terms as well as contested terms that are important for collaboration. Finally, participants will develop a glossary to accompany the new hybrid workflow.
  4. In a closing plenary we will reflect on what you have learned in the workshop, the thoughts our session has provoked, and how we imagine to put the ideas from the workshop into use.

How can you get engaged?

Are you interested in multi-disciplinary work around the design and use of AI systems and planning to attend ACM FAT*? Then join us at the workshop during the ACM FAT* conference in Barcelona on January 29 (note: the workshop will be limited to 30 participants)!

More details can be found at the FAT* 2020 website as well as the University of Kassel website. Please note that we will provide more background in the session itself. With the documents we are currently preparing to hand out, you will also find a glossary of key concepts and further reading.

In order to help this workshop be fruitful to all, we encourage you to share with us your experiences in advance on this pad.

  • Tell us how you would describe a standard workflow in your discipline. In other words, what are the standard steps that one takes in your discipline to approach and perform a research or design task?
  • Do you have experiences with design processes and formats to facilitate interdisciplinarity around ML? Or would you like to share your experiences how interdisciplinarity has worked for you?

You will soon be able to access the abstract of our conference proceeding at this DOI. We have also prepared a reading list for anyone interested in the topic (here) and for those unable to attend the conference. We would like to build a collective resource that many people can draw from. If you know relevant reading lists or literature, feel free to suggest these in the document.

Authors’ note:

Aviva de Groot is PhD candidate at the Tilburg Institute for Law, Technology, and Society (TILT) at Tilburg University. Her Thesis ‘“Care to Explain?” Articulating legal demands to explain AI-infused decisions, responsibly’ addresses explainability concerns through the lens of epistemic justice. How can modern day decision makers in and on the loop of these processes maintain a responsible relation with decision subjects?

Danny Lämmerhirt is PhD candidate at the Locating Media Graduate School at University of Siegen. His dissertation project draws from STS, economic sociology, and technography to explore the role of devices in organising bottom-up health data cooperatives and their collective data practices to valorize data.

Goda Klumbyte is PhD candidate and research associate at the Gender/Diversity in Informatics Systems group at the University of Kassel. Her dissertation focuses on knowledge production in and through machine learning systems from feminist and post/de-colonial perspective.

Phillip Lücking is PhD candidate and research associate at the Gender/Diversity in Informatics Systems group at the University of Kassel. His research interest encompasses relevant contemporary topics of computer science such as machine learning and robotics in relation to their societal impacts, as well as special interest in how modern digital technology can be utilized for social good.

Dr. Evelyn Wan is a postdoctoral researcher at the Tilburg Institute for Law, Technology, and Society (TILT) at Tilburg University, and an affiliated researcher at the Institute for Cultural Inquiry at Utrecht University. Her work on the politics of digital culture and algorithmic governance straddles media and performance studies, gender and postcolonial theory, and legal and policy research.

Mara Paun is PhD candidate at the Tilburg Institute for Law, Technology, and Society (TILT) at Tilburg University in the ERC-funded project “Understanding information for legal protection of people against information-induced harms”.

Shazade Jameson is PhD candidate at the Tilburg Institute for Law, Technology, and Society (TILT) at Tilburg University on the ERC-funded “Global Data Justice” project.

comment 0

How open is government data in Africa?

This blogpost was originally published by Open Knowledge International on

Today, we are pleased to announce the results of Open Knowledge International’s Africa Open Data Index. This regional version of our Global Open Data Index collected baseline data on open data publication in 30 African countries to provide input for the second Africa Data Revolution Report.

This project mapped out to what extent African public institutions make key datasets available as open data online. Beyond scrutinising data availability, digitisation degree, and openness of national datasets, we considered the broader landscape of actors involved in the production of government data such as private actors.

The key datasets considered are:

  • Administrative records: budgets, procurement information, company registers
  • Legislative data: national law
  • Statistical data: core economic statistics, health, gender, educational and environmental statistics
  • Infrastructural data
  • Agricultural data
  • Election results
  • Geographic information and land ownership

Key datasets and methodology were developed in collaboration with the United Nations Development Program (UNDP), the International Development Research Centre (IDRC), and well as the World Wide Web Foundation. We focused on national key datasets such as

  1. data describing processes of government bodies at the highest administrative level (e.g. federal government budgets);
  2. data produced by sub-national actors but collected by a national agency (e.g. certain statistical information).

We also captured if data was available on sub-national levels or by private companies but did not assign scores to these sets. You can find the detailed methodology here.

Screenshot of the Africa Open Data Index Interface

Understanding who produces government data

Many government agencies produce at least parts of the key datasets we assessed. Some key datasets, such as environmental data, are rarely produced. For instance, air pollution and water quality data are sometimes produced in individual administrative zones, but not on national levels. Some initiatives assist producing data on deforestation, such as REDD+ or the Congo Basin Forest Atlases, with the assistance of the World Resources Institute (WRI) and USAID.

Multiple search strategies may be required to identify agencies producing and publishing official records. Some agencies develop public databases, search interfaces and other dedicated infrastructure to facilitate search and retrieval. Statistical yearbooks are another useful access point to several information groups, including economic and social statistics as well as figures on environmental degradation or market figures. In several cases it was necessary to consult third-party literature to identify which public institutions hold the remits to collect data such as World Bank’s Land Governance Assessment Framework (LGAF) and reports issued by the Extractives Industries Transparency Initiative (EITI).

Sometimes, private companies provide data infrastructure to aggregate and host data centrally. For instance, the company Trimble develops data portals for the extractives sector in 15 countries in Africa. These data portals are used to publish data on mining concession, including geographic boundaries, the size of territory, concession types, licensees, or contract start and duration.

Procuring data infrastructure from private organisations

While being a useful central access point, Trimble’s terms of use do not comply with open licensing requirements. This points to a larger concern regarding appropriate licensing schemes and how they can be integrated into the procurement process. We propose that multi stakeholder initiatives such as the Extractives Industries Transparency Initiative (EITI) and national multi stakeholder groups define appropriate terms of use, if possible using standard open licences, when procuring services in order to ensure an appropriate degree of openness to prevent lock-in and public access.

An alternative information aggregator using open licence terms is called African Legal Information Institute (AfricanLII), gathering national legal code from several African countries. It is a programme of the Democratic Governance and Rights Unit at the Department of Public Law at the University of Cape Town.

Sometimes stark differences what data gets published  

To test what data gets published online, we defined crucial data points to be included in every key data category (see here). If at least one of these data points was found online, we considered the data category for assessment. This means that we assessed datasets whose completeness can differ across countries. Figure 1 shows which data points are how often provided across our sample of 30 countries.

Budget and procurement data most often contains the relevant data points we have assessed. Several key statistical indicators are provided fairly commonly, too. Agricultural data, environmental data and land ownership data are least commonly provided. For a more thorough analysis we recommend to read the Africa Data Revolution Report, pages 16-22.

Figure 1: Percentages of data points found across key datasets. Percentage relative to the total amount of countries (100% = data point available in 30 countries).  Source: Africa Data Revolution Report, pp. 19-20.

One third of the data is provided in a timely manner

To assess timely publication our research considered whether governments publish data in a particular update frequency. Figure 2 shows a clear difference in timely data provision across different data types. The y-scale indicates the percentage of countries publishing updated information. A score of 100 would indicate that the total sample of 30 countries publishes a data category in a timely fashion.

Figure 2: Percentage of updated datasets, per data category.

We found significant differences across individual data categories and countries. Roughly three out of four countries update their budget data (80% of all countries), national laws (73% of all countries) and procurement information (70% of all countries) in a timely manner. Approximately half of all countries publish updated elections records (50% of all countries), or keep their company registers up-to-date (47% of all countries). All other data categories are published in a timely manner only by a fraction of the assessed countries. For instance, the majority of all countries does not provide updated statistical information.

We strongly advise to interpret these findings as trends rather than representative representations of timely data publication. This has several reasons. In some data categories, we included considerably more and diverse data points. For instance, the agricultural data category includes not only statistics on crop yields but also short-term weather forecasts. If one of these data types was not provided in a timely manner, the data category was considered not to be updated. Furthermore, if a country did not provide timestamps and metadata, we did not consider the data to be updated, as we were unable to proof the opposite.

Open licensing and machine-readability

Only 6% of all data (28 out of 420 datasets assessed) is openly licensed in compliance with the criteria laid out by the Open Definition. Open licence terms are used by statistical offices in Botswana, Senegal, Rwanda, and Somalia, as well as open data portals in Cote d’Ivoire, Eritrea and Kenya and Mauritius. Usually, websites provide copyright notes but do not apply licence terms dedicated to the website’s data. In rare cases we found a Creative Commons Attribution (CC-BY) licence being used. More common are bespoke terms that are compliant with the Open Definition.

14.5% of all data (61 out of 420 datasets assessed) is provided in at least one machine-readable format. Most data, however, is provided in printed reports, digitised as PDFs, or embedded on websites in HTML. Importantly, some types of data, such as land records, may still be in the process of digitisation. If we found that governments hold paper-based records, we tested if our researchers may request the data. If this was not the case, we did not consider the data for our assessment.


The following recommendations are excerpts from the Africa Data Revolution Report 2018. A comprehensive list of recommendations can be found in the report itself.

On the basis of our findings we recommend that public institutions:

  • Communicate clearly on their agency websites what data they are collecting about different government activities.
  • Clarify which data has authoritative status in case multiple versions exist: Metadata must be available clarifying provenance and authoritative status of data. This is important in cases where multiple entities collect data, or whenever governments gather data with the help of international organisations, bilateral donors, foreign governments, or others.
  • Make data permanently accessible and findable: Data should be made available at a permanent internet location and in a stable data format for as long as possible. Avoid broken links and provide links to the data whenever you publish data elsewhere (for example via a statistical agency). Add ​metadata​ to ensure that data can be understood by citizens and found via search engines.
  • When procuring data, define a set of terms of use to ensure the appropriate  degree of openness: Private vendors may want to license data under proprietary terms, which may limit data accessibility. Research found that many data-intense projects in development contexts use haphazard, proprietary licence terms which may prevent the public from accessing data, increase complexity of use terms, and costs of data access.
  • Provide data in machine-readable formats: Ensure that data is processable. ​Raw data must be published in machine-readable formats that are user friendly.
  • Use standard open licences: Use CC0 for public domain dedication or standardized open licences, preferably CC BY 4.0. They can be reused by anyone, which helps ensure compatibility with other datasets. Clarify if data falls under the scope of copyright, or similar rights. If information is in the public domain, apply legally non-binding notices to your data. If you opt for a custom open licence, ensure compatibility with the Open Definition. It is strongly recommended to submit the licence for approval under the Open Definition.
  • Avoid confusion around licence terms: Attach the licence clearly to the information to which it applies. Clearly separate a website’s terms and conditions from the terms of open licences. Maintain stable links to licences so that users can access licence terms at all times.

What to do next?

We have gathered all raw data in a summary spreadsheet. Browse the results and use the links we provide to reach a dataset of interest directly.

If you are interested in specific country assessments, please find here our research diaries.

The Open Data Survey tool, powering this project as well as our Global Open Data Index is open to be reused. If you are interested in setting up a regional or national version, get in touch with us at


We would like to thank our partners at the United Nations Development Programme, the International Development Research Centre and the Web Foundation for support, as well as the experts at Local Development Research Institute (LDRI), the Communauté Afrique Francophone pour les Données Ouvertes (CAFDO) and the Access to Knowledge for Development Center (A2K4D) at the American University, Cairo for advising on the methodology and their support throughout the research process. Furthermore, we would like to thank our 30 country researchers, as well as our expert reviewers Codrina Maria Ilie, Jennifer Walker, and Oscar Montiel.

comment 0

Open data governance and open governance: interplay or disconnect?

This piece was written by Ana Brandusescu, Carlos Iglesias, Danny Lämmerhirt, Stefaan Verhulst (in alphabetical order). It was originally published via Open Knowledge International on

The presence of open data often gets listed as an essential requirement toward “open governance”. For instance, an open data strategy is reviewed as a key component of many action plans submitted to the Open Government Partnership. Yet little time is spent on assessing how open data itself is governed, or how it embraces open governance. For example, not much is known on whether the principles and practices that guide the opening up of government – such as transparency, accountability, user-centrism, ‘demand-driven’ design thinking – also guide decision-making on how to release open data.

At the same time, data governance has become more complex and open data decision-makers face heightened concerns with regards to privacy and data protection. The recent implementation of the EU’s General Data Protection Regulation (GDPR) has generated an increased awareness worldwide of the need to prevent and mitigate the risks of personal data disclosures, and that has also affected the open data community. Before opening up data, concerns of data breaches, the abuse of personal information, and the potential of malicious inference from publicly available data may have to be taken into account. In turn, questions of how to sustain existing open data programs, user-centrism, and publishing with purpose gain prominence.

To better understand the practices and challenges of open data governance, we have outlined a research agenda in an earlier blog post. Since then, and perhaps as a result, governance has emerged as an important topic for the open data community. The audience attending the 5th International Open Data Conference (IODC) in Buenos Aires deemed governance of open data to be the most important discussion topic. For instance, discussions around the Open Data Charter principles during and prior to the IODC acknowledged the role of an integrated governance approach to data handling, sharing, and publication. Some conclude that the open data movement has brought about better governance, skills, technologies of public information management which becomes an enormous long-term value for government. But what does open data governance look like?

To expand our earlier exploration and broaden the community that considers open data governance, we convened a workshop at the Open Data Research Symposium 2018. Bringing together open data professionals, civil servants, and researchers, we focused on:

  • What is open data governance?
  • When can we speak of “good” open data governance, and
  • How can the research community help open data decision-makers toward “good” open data governance?

In this session, open data governance was defined as the interplay of rules, standards, tools, principles, processes and decisions that influence what government data is opened up, how and by whom. We then explored multiple layers that can influence open data governance.

In the following, we illustrate possible questions to start mapping the layers of open data governance. As they reflect the experiences of session participants, we see them as starting points for fresh ethnographic and descriptive research on the daily practices of open data governance in governments.

Figure: Schema of an open data governance model

The management layer

Governments may decide about the release of data on various levels. Studying the management side of data governance could look at decision-making methods and devices. For instance, one might analyze how governments gauge public interest in their datasets – through data request mechanisms, user research, or participatory workshops? What routine procedures do governments put in place to interact with other governments and the public? For instance, how do governments design routine processes to open data requests? How are disputes over open data release settled? How do governments enable the public to address non-publication? One might also study cost-benefit calculations and similar methodologies to evaluate data, and how they inform governments what data counts as crucial and is expected to bring returns and societal benefits.

Understanding open data governance would also require to study the ways in which open data creation, cleaning, and publication are managed itself. Governments may choose to organise open data publication and maintenance in house, or seek collaborative approaches, otherwise known from data communities like OpenStreetMaps.

Another key component is funding and sustainability. Funding might influence management on multiple layers – from funding capacity building, to investing in staff innovations and alternative business models for government agencies that generate revenue from high value datasets. What do these budget and sustainability models look like? How are open data initiatives currently funded, under what terms, for how long, by whom and for what? And how do governments reconcile the publication of high value datasets with the need to provide income for public government bodies? These questions gain importance as governments move towards assessing and publishing high value datasets.

Open governance and management: To what extent is management guided by open governance? For instance, how participatory, transparent, and accountable are decision-making processes and devices? How do governments currently make space for more open governance in their management processes? Do governments practice more collaborative data management with communities, for example to maintain, update, verify government data?   

The Legal and Policy layer

The interplay between legal and policy frameworks: Open data policies operate among other legal and policy frameworks, which can complement, enable, or limit the scope of open data. New frameworks such as GDPR, but also existing right to information and freedom of expression frameworks prompt the question of how the legal environment influences the behavior and daily decision-making around open data. To address such questions, one could study the discourse and interplay between open data policies as well as tangential policies like smart city or digitalization policies.

Implementation of law and policies: Furthermore, how are open data frameworks designed to guide the implementation open data? How do they address governmental devolution? Open data governance needs to stretch across all government levels to unlock data from all government levels. What approaches are experimented with to coordinate the implementation of policies across jurisdictions and government branches? To what agencies do open data policies apply, and how do they enable or constrain choices around open data? What agencies define and move forward open data, and how does this influence adoption and sustainability of open data initiatives?

Open governance of law and policy: Besides studying the interaction of privacy protection, right to information, and open data policies, how could open data benefit from policies enabling open governance and civic participation? Do governments develop more integrated strategies for open governance and open data, and if so, what policies and legal mechanisms are in place? If so, how do these laws and policies enable other aspects of open data governance, including more participatory management, more substantive and legally supported citizen participation?  

The Technical and Standards layer

Governments may have different technical standards in place for data processing and publication, from producing data, to quality assurance processes. Some research has looked into the ways data standards for open data alter the way governments process information. Others have argued that the development of data standards is reference how governments envisage citizens, primarily catering to tech-literate audiences.

(Data) standards do not only represent, but intervene in the way governments work. Therefore, they could substantially alter the ways government publishes information. Understood this way, how do standards enable resilience against change, particularly when facing shifting political leadership?

On the other hand, most government data systems are not designed for open data. Too often, governments are struggling to transform huge volumes of government data into open data using manual methods. Legacy IT systems that have not been built to support open data create additional challenges to developing technical infrastructure, but there is no single global solution to data infrastructure. How could then governments transform their technical infrastructure to allow them to publish open data efficiently?

Open governance and the technical / standards layer: If standards can be understood as  bridge building devices, or tools for cooperation, how could open governance inform the creation of technical standards? Do governments experiment with open standards, and if so, what standards are developed, to what end, using what governance approach?

The capacity layer

Staff innovations may play an important role in open data governance. What is the role of chief data officers in improving open data governance? Could the usual informal networks of open data curators within government and a few open data champions make open data success alone? What role do these innovations play in making decisions about open data and personal data protection? Could governments rely solely on senior government officials to execute open data strategies? Who else is involved in the decision-making around open data release? What are the incentives and disincentives for officials to increase data sharing? As one session participant mentioned: “I have never experienced that a civil servant got promoted for sharing data”. This begs the question if and how governments currently assess performance metrics that support opening up data. What other models could help reward data sharing and publication? In an environment of decreased public funding, are there opportunities for governments to integrate open data publication in existing engagement channels with the public?

Open governance and capacity: Open governance may require capacities in government, but could also contribute new capacities. This can apply to staff, but also resources such as time or infrastructure. How do governments provide and draw capacity from open governance approaches, and what could be learnt for other open data governance approaches?  

Next steps

With this map of data governance aspects as a starting point, we would like to conduct empirical research to explore how open data governance is practised. A growing body of ethnographic research suggests that tech innovations such as algorithmic decision-making, open data, or smart city initiatives are ‘multiples’ — meaning that they can be practiced in many ways by different people, arising in various contexts.

With such an understanding, we would like to develop empirical case studies to elicit how open data governance is practised. Our proposed research approach includes the following steps:

  • Universe mapping: Identifying public sector officials and civil servants involved in deciding how data gets managed, shared and published openly (this helps to get closer to the actual decision-makers, and to learn from them).
  • Describing how and on what basis (legal, organisational & bureaucratic, technological, financial, etc.) people make decisions on what gets published and why.
  • Observe and describe different approaches to do open data governance, looking at enabling and limiting factors of opening up data.
  • Describe gaps and areas of improvement with regards to open data governance, as well as best practices.

This may surface how open data governance becomes salient for governments, under what circumstances and why. If you are a government official, or civil servant working with (open) data, and would like to share your experiences, we would like to hear from you!  

comment 0

What data counts in Europe? Towards a public debate on Europe’s high value data and the PSI Directive

This post was jointly written by Danny Lämmerhirt, Pierre Chzranowksi, and Sander van der Waal. It was originally published via Open Knowledge International on

January 22 will mark a crucial moment for the future of open data in Europe. That day, the final trilogue between European Commission, Parliament, and Council is planned to decide over the ratification of the updated PSI Directive. Among others, the European institutions will decide over what counts as ‘high value’ data. What essential information should be made available to the public and how those data infrastructures should be funded and managed are critical questions for the future of the EU.

As we will discuss below, there are many ways one might envision the collective ‘value’ of those data. This is a democratic question and we should not be satisfied by an ill and broadly defined proposal. We therefore propose to organise a public debate to collectively define what counts as high value data in Europe.

What does PSI Directive say about high value datasets?  

The European Commission provides several hints in the current revision of the PSI Directive on how it envisions high value datasets. They are determined by one of the following ‘value indicators’:

  • The potential to generate significant social, economic, or environmental benefits,
  • The potential to generate innovative services,
  • The number of users, in particular SMEs,  
  • The revenues they may help generate,  
  • The data’s potential for being combined with other datasets
  • The expected impact on the competitive situation of public undertakings.

Given the strategic role of open data for Europe’s Digital Single Market, these indicators are not surprising. But as we will discuss below, there are several challenges defining them. Also, there are different ways of understanding the importance of data.

The annex of the PSI Directive also includes a list of preliminary high value data, drawing primarily from the key datasets defined by Open Knowledge International’s (OKI’s) Global Open Data Index, as well as the G8 Open Data Charter Technical Annex. See the proposed list in the table below.

List of categories and high-value datasets:

1. Geospatial DataPostcodes, national and local maps (cadastral, topographic, marine, administrative boundaries).
2. Earth observation and environmentSpace and situ data (monitoring of the weather and of the quality of land and water, seismicity, energy consumption, the energy performance of buildings and emission levels).
3. Meteorological dataWeather forecasts, rain, wind and atmospheric pressure.
4. StatisticsNational, regional and local statistical data with main demographic and economic indicators (gross domestic product, age, unemployment, income, education).
5. CompaniesCompany and business registers (list of registered companies, ownership and management data, registration identifiers).
6. Transport dataPublic transport timetables of all modes of transport, information on public works and the state of the transport network including traffic information.

According to the proposal, regardless of who provide them, these datasets shall be available for free, machine-readable and accessible for download,and where appropriate, via APIs. The conditions for re-use shall be compatible with open standard licences.

Towards a public debate on high value datasets at EU level

There has been attempts by EU Member States to define what constitutes high-value data at national level, with different results. In Denmark, basic data has been defined as the five core information public authorities use in their day-to-day case processing and should release. In France, the law for a Digital Republic aims to make available reference datasets that have the greatest economic and social impact. In Estonia, the country relies on the X-Road infrastructure to connect core public information systems, but most of the data remains restricted.

Now is the time for a shared and common definition on what constitute high-value datasets at EU level. And this implies an agreement on how we should define them. However, as it stands, there are several issues with the value indicators that the European Commission proposes.

For example, how does one define the data’s potential for innovative services? How to confidently attribute revenue gains to the use of open data? How does one assess and compare the social, economic, and environmental benefits of opening up data? Anyone designing these indicators must be very cautious, as metrics to compare social, economic, and environmental benefits may come with methodical biases. Research found for example, that comparing economic and environmental benefits can unfairly favour data of economic value at the expense of fuzzier social benefits, as economic benefits are often more easily quantifiable and definable by default.

One form of debating high value datasets could be to discuss what data gets currently published by governments and why. For instance, with their Global Open Data Index, Open Knowledge International has long advocated for the publication of disaggregated, transactional spending figures. Another example is OKI’s Open Data For Tax Justice initiative which wanted to influence the requirements for multinational companies to report their activities in each country (so-called ‘Country-By-Country-Reporting’), and influence a standard for publicly accessible key data.  

A public debate of high value data should critically examine the European Commission’s considerations regarding the distortion of competition. What market dynamics are engendered by opening up data? To what extent do existing markets rely on scarce and closed information? Does closed data bring about market failure, as some argue (Zinnbauer 2018)? Could it otherwise hamper fair price mechanisms (for a discussion of these dynamics in open access publishing, see Lawson, Gray and Mauri 2015)? How would open data change existing market dynamics? What actors proclaim that opening data could purport market distortion, and whose interests do they represent?

Lastly, the European Commission does not yet consider cases of government agencies  generating revenue from selling particularly valuable data. The Dutch national company register has for a long time been such a case, as has the German Weather Service. Beyond considering competition, a public debate around high value data should take into account how marginal cost recovery regimes currently work.

What we want to achieve

For these reasons, we want to organise a public discussion to collectively define

  1. i) What should count as a high value datasets, and based on what criteria,
  2. ii) What information high value datasets should include,
  3. ii) What the conditions for access and re-use should be.

The PSI Directive will set the baseline for open data policies across the EU. We are therefore at a critical moment to define what European societies value as key public information. What is at stake is not only a question of economic impact, but the question of how to democratise European institutions, and the role the public can play in determining what data should be opened.

How you can participate

  1. We will use the Open Knowledge forum as main channel for coordination, exchange of information and debate. To join the debate, please add your thoughts to this thread or feel free to start a new discussion for specific topics.
  2. We gather proposals for high value datasets in this spreadsheet. Please feel free to use it as a discussion document, where we can crowdsource alternative ways of valuing data.
  3. We use the PSI Directive Data Census to assess the openness of high value datasets.

We also welcome any reference to scientific paper, blogpost, etc. discussing the issue of high-value datasets. Once we have gathered suggestions for high value datasets, we would like to assess how open proposed high-value datasets are. This will help to provide European countries with a diagnosis of the openness of key data.

comment 0

Advancing Sustainability Together: Launching new report on citizen-generated data and its relevance for the SDGs

This post was originally published via Open Knowledge International on

We are pleased to announce the launch of our latest report Advancing Sustainability Together? Citizen-Generated Data and the Sustainable Development Goals. The research is the result of a collaboration with King’s College London, Public Data Lab and the Global Partnership for Sustainable Development Data.

Citizen-generated data (CGD) expands what gets measured, how, and for what purpose. As the collection and engagement with CGD increases in relevance and visibility, public institutions can learn from existing initiatives about what CGD initiatives do, how they enable different forms of sense-making and how this may further progress around the Sustainable Development Goals.

Our report, as well as a guide for governments (find the layouted version here, as well as a living document here) shall help start conversations around the different approaches of doing and organising CGD. When CGD becomes good enough depends on the purpose it is used for but also how CGD is situated in relation to other data.

As our work wishes to be illustrative rather than comprehensive, we started with a list of over 230 projects that were associated with the term “citizen-generated data” on Google Search, using an approach known as “search as research” (Rogers, 2013). Outgoing from this list, we developed case studies on a range of prominent CGD examples.

The report identifies several benefits CGD can bring for implementing and monitoring the SDGs, underlining the importance for public institutions to further support these initiatives.

Figure: Illustration of tasks underpinning CGD initiatives and their workflows

Key findings:

  • Dealing with data is usually much more than ‘just producing’ data. CGD initiativesopen up new types of relationships between individuals, civil society and public institutions. This includes local development and educational programmes, community outreach, and collaborative strategies for monitoring, auditing, planning and decision-making.
  • Generating data takes many shapes, from collecting new data in the field, to compiling, annotating, and structuring existing data to enable new ways of seeing things through data. Accessing and working with existing (government) data is often an important enabling condition for CGD initiatives to start in the first place.
  • CGD initiatives can help gathering data in regions otherwise not reachable. Some CGD approaches may provide updated and detailed data at lower costs and faster than official data collections.
  • Beyond filling data gaps, official measurements can be expanded, complemented, or cross-verified. This includes pattern and trend identification and the creation of baseline indicators for further research. CGD can help governments detect anomalies, test the accuracy of existing monitoring processes, understand the context around phenomena, and initiate its own follow-up data collections.
  • CGD can inform several actions to achieve the SDGs. Beyond education, community engagement and community-based problem solving, this includes baseline research, planning and strategy development, allocation and coordination of public and private programs, as well as improvement to public services.
  • CGD must be ‘good enough’ for different (and varying) purposes. Governments already develop pragmatic ways to negotiate and assess the usefulness of data for a specific task. CGD may be particularly useful when agencies have a clear remit or responsibility to manage a problem.  
  • Data quality can be comparable to official data collections, provided tasks are sufficiently easy to conduct, tool quality is high enough, and sufficient training, resources and quality assurance are provided.

You can find the full report as well as a summary report here. If you are interested in learning more about citizen-generated data, and how to engage with it, we have prepared a guide for everyone interested in engaging with CGD.

In addition to our report we have gathered a list of more than 200 organisations, programs, and projects working on CGD. This list is open for everyone to contribute further examples of CGD. We have also prepared our raw dataset of “citizen generated data” according to Google searches accessible on figshare.

If you are interested reading more about the academic discourse around CGD and related fields, or would like to share your own work, here we have prepared a Zotero group with relevant literature.

This report was funded in part by a grant from the United States Department of State. The opinions, findings and conclusions stated herein are those of the author[s] and do not necessarily reflect those of the United States Department of State.

comment 0

New research to map the diversity of citizen-generated data for sustainable development

We are excited to announce a new research project around citizen-generated data and the UN data revolution. This research will be led by Open Knowledge International in partnership with King’s College London and the Public Data Lab to develop a vocabulary for governments to navigate the landscape of citizen-generated data.

This research elaborates on past work which explored how to democratise the data revolutionhow citizen and civil society data can be used to advocate for changes in official data collection, and how citizen-generated data can be organised to monitor and advance sustainability. It is funded by the United Nations Foundation and commissioned by the Task Team on Citizen Generated Data which is hosted by the Global Partnership for Sustainable Development Data (GPSDD).

Our research seeks to develop a working vocabulary of different citizen-generated data methodologies. This vocabulary shall highlight clear distinction criteria between different methods, but also point out different ways of thinking about citizen-generated data. We hope that such a vocabulary can help governments and international organisations attend to the benefits and pitfalls of citizen-generated data in a more nuanced way and will help them engage with citizen-generated data more strategically.

Why this research matters

The past decades have seen the rise of many citizen-generated data projects. A plethora of concepts and initiatives use citizen-generated data for many goals, ranging from citizen science, citizen sensing and environmental monitoring to participatory mapping, community-based monitoring and community policing. In these initiatives citizens may play very different roles (from assigning the role of mere sensors, to enabling them to shape what data gets collected). Initiatives may differ in the  media and technologies used to collect data, in the ways stakeholders are engaged with partners from government or business, or how activities are governed to align interests between these parties.

Likewise different actors articulate the concerns and benefits of CGD in different ways. Scientific and statistical communities may be concerned about data quality and interoperability of citizen-generated data whereas a community centered around the monitoring of the Sustainable Development Goals (SDGs) may be more concerned with issues of scalability and the potential of CGD to fill gaps in official data sets. Legal communities may consider liability issues for government administrations when using unofficial data, whilst CSOs and international development organisations may want to know what resources and capacities are needed to support citizen-generated data and how to organise and plan projects.

In our work we will address a range of questions including: What citizen-generated data methodologies work well, and for what purposes? What is the role of citizens in generating data, and what can data “generation” look like? How are participation and use of citizen data organised? What collaborative models between official data producers/users and citizen-generated data projects exist? Can citizen-generated data be used alongside or incorporated into statistical monitoring purposes, and if so, under what circumstances? And in what ways could citizen-generated data contribute to regulatory decision-making or other administrative tasks of government?

In our research we will

  • Map existing literature, online content and examples of projects, practices and methods associated with the term “citizen generated data”;
  • Use this mapping to solicit for input and ideas on other kinds of citizen-generated data initiatives as well as other relevant literatures and practices from researchers, practitioners and others;
  • Gather suggestions from literature, researchers and practitioners about which aspects of citizen-generated data to attend to, and why;
  • Undertake fresh empirical research around a selection of citizen-generated data projects in order to explore these different perspectives.
Visual representation of the Bushwick Neighbourhood, geo-locating qualitative stories in the map (left image), and patterns of land usage (right image) (Source: North West Bushwick Community project)

Next steps

In the spirit of participatory and open research, we invite governments, civil society organisations and academia to share examples of citizen-generated data methodologies, the benefits of using citizen-generated data and issues we may want to look into as part of our research.

If you’re interested in following or contributing to the project, you can find out more on our forum.

comment 0

Europe’s proposed PSI Directive: A good baseline for future open data policies?

This blogpost was originally published by Open Knowledge International on

Some weeks ago, the European Commission proposed an update of the PSI Directive**. The PSI Directive regulates the reuse of public sector information (including administrative government data), and has important consequences for the development of Europe’s open data policies. Like every legislative proposal, the PSI Directive proposal is open for public feedback until July 13. In this blog post Open Knowledge International presents what we think are necessary improvements to make the PSI Directive fit for Europe’s Digital Single Market.   

In a guest blogpost Ton Zijlstra outlined the changes to the PSI Directive. Another blog post by Ton Zijlstra and Katleen Janssen helps to understand the historical background and puts the changes into context. Whilst improvements are made, we think the current proposal is a missed opportunity, does not support the creation of a Digital Single Market and can pose risks for open data. In what follows, we recommend changes to the European Parliament and the European Council. We also discuss actions civil society may take to engage with the directive in the future, and explain the reasoning behind our recommendations.

Recommendations to improve the PSI Directive

Based on our assessment, we urge the European Parliament and the Council to amend the proposed PSI Directive to ensure the following:

  • When defining high-value datasets, the PSI Directive should not rule out data generated under market conditions. A stronger requirement must be added to Article 13 to make assessments of economic costs transparent, and weigh them against broader societal benefits.
  • The public must have access to the methods, meeting notes, and consultations to define high value data. Article 13 must ensure that the public will be able to participate in this definition process to gather multiple viewpoints and limit the risks of biased value assessments.
  • Beyond tracking proposals for high-value datasets in the EU’s Interinstitutional Register of Delegated Acts, the public should be able to suggest new delegated acts for high-value datasets.  
  • The PSI Directive must make clear what “standard open licences” are, by referencing the Open Definition, and explicitly recommending the adoption of Open Definition compliant licences (from Creative Commons and Open Data Commons) when developing new open data policies. The directive should give preference to public domain dedication and attribution licences in accordance with the LAPSI 2.0 licensing guidelines.
  • Government of EU member states that already have policies on specific licences in use should be required to add legal compatibility tests with other open licences to these policies. We suggest to follow the recommendations outlined in the LAPSI 2.0 resources to run such compatibility tests.
  • High-value datasets must be reusable with the least restrictions possible, subject at most to requirements that preserve provenance and openness. Currently the European Commission risks to create use silos if governments will be allowed to add “any restrictions on re-use” to the use terms of high-value datasets.  
  • Publicly funded undertakings should only be able to charge marginal costs.
  • Public undertakings, publicly funded research facilities and non-executive government branches should be required to publish data referenced in the PSI Directive.

Our recommendations do not pose unworkable requirements or disproportionately high administrative burden, but are essential to realise the goals of the PSI directivewith regards to:

  1. Increasing the amount of public sector data available to the public for re-use,
  2. Harmonising the conditions for non-discrimination, and re-use in the European market,
  3. Ensuring fair competition and easy access to markets based on public sector information,
  4. Enhancing cross-border innovation, and an internal market where Union-wide services can be created to support the European data economy.

Our recommendations, explained: What would the proposed PSI Directive mean for the future of open data?

Publication of high-value data

The European Commission proposes to define a list of ‘high value datasets’ that shall be published under the terms of the PSI Directive. This includes to publish datasets in machine-readable formats, under standard open licences, in many cases free of charge, except when high-value datasets are collected by public undertakings in environments where free access to data would distort competition. “High value datasets” are defined as documents that bring socio-economic benefits, “notably because of their suitability for the creation of value-added services and applications, and the number of potential beneficiaries of the value-added services and applications based on these datasets”. The EC also makes reference to existing high value datasets, such as the list of key datadefined by the G8 Open Data Charter.

Identifying high-quality data poses at least three problems:

  1. High-value datasets may be unusable in a digital Single Market: The EC may “define other applicable modalities”, such as “any conditions for re-use”. There is a risk that a list of EU-wide high value datasets also includes use restrictions violating the Open Definition. Given that a list of high value datasets will be transposed by all member states, adding “any conditions” may significantly hinder the reusability and ability to combine datasets.
  2. Defining value of data is not straightforward. Recent papers, from Oxford University, to Open Data Watch and the Global Partnership for Sustainable Development Data demonstrate disagreement what data’s “value” is. What counts as high value data should not only be based on quantitative indicators such as growth indicators, numbers of apps or numbers of beneficiaries, but use qualitative assessments and expert judgement from multiple disciplines.
  3. Public deliberation and participation is key to define high value data and to avoid biased value assessments. Impact assessments and cost-benefit calculations come with their own methodical biases, and can unfairly favour data with economic value at the expense of fuzzier social benefits. Currently, the PSI Directive does not consider data created under market conditions to be considered high value data if this would distort market conditions. We recommend that the PSI Directive adds a stronger requirement to weigh economic costs against societal benefits, drawing from multiple assessment methods (see point 2). The criteria, methods, and processes to determine high value must be transparent and accessible to the broader public to enable the public to negotiate benefits and to reflect the viewpoints of many stakeholders.

Expansion of scope

The new PSI Directive takes into account data from “public undertakings”. This includes services in the general interest entrusted with entities outside of the public sector, over which government maintains a high degree of control. The PSI Directive also includes data from non-executive government branches (i.e. from legislative and judiciary branches of governments), as well as data from publicly funded research. Opportunities and challenges include:

  • None of the data holders which are planned to be included in the PSI Directive are obliged to publish data. It is at their discretion to publish data. Only in case they want to publish data, they should follow the guidelines of the proposed PSI directive.
  • The PSI Directive wants to keep administrative costs low. All above mentioned data sectors are exempt from data access requests.
  • In summary, the proposed PSI Directive leaves too much space for individual choice to publish data and has no “teeth”. To accelerate the publication of general interest data, the PSI Directive should oblige data holders to publish data. Waiting several years to make the publication of this data mandatory, as happened with the first version of the PSI Directive risks to significantly hamper the availability of key data, important for the acceleration of growth in Europe’s data economy.    
  • For research data in particular, only data that is already published should fall under the new directive. Even though the PSI Directive will require member states to develop open access policies, the implementation thereof should be built upon the EU’s recommendations for open access.

Legal incompatibilities may jeopardise the Digital Single Market

Most notably, the proposed PSI Directive does not address problems around licensing which are a major impediment for Europe’s Digital Single Market. Europe’s data economy can only benefit from open data if licence terms are standardised. This allows data from different member states to be combined without legal issues, and enables to combine datasets, create cross-country applications, and spark innovation. Europe’s licensing ecosystem is a patchwork of many (possibly conflicting) terms, creating use silos and legal uncertainty.

But the current proposal does not only speak vaguely about standard open licences, and makes national policies responsible to add “less restrictive terms than those outlined in the PSI Directive”. It also contradicts its aim to smoothen the digital Single Market encouraging the creation of bespoke licences, suggesting that governments may add new licence terms with regards to real-time data publication. Currently the PSI Directive would allow the European Commission to add “any conditions for re-use” to high-value datasets, thereby encouraging to create legal incompatibilities (see Article 13 (4.a)). We strongly recommend that the PSI Directive draws on the EU co-funded LAPSI 2.0 recommendations to understand licence incompatibilities and ensure a compatible open licence ecosystem.

I’d like to thank Pierre Chrzanowksi, Mika Honkanen, Susanna Ånäs, and Sander van der Waal for their thoughtful comments while writing this blogpost.

** Its’ official name is the Directive 2003/98/EC on the reuse of public sector information.

comment 0

The Measurement Guide is out now!

This blogpost was written by Ana Brandusescu (Web Foundation) and Danny Lämmerhirt (Open Knowledge International), co-chairs of the Measurement and Accountability Working Group of the Open Data Charter. It was originally published by the Open Data Charter on 

We are pleased to announce the launch of our Open Data Charter Measurement Guide. The guide is a collaborative effort of the Charter’s Measurement and Accountability Working Group (MAWG). It analyses theOpen Data Charter principles and how they are assessed based on current open government data measurement tools. Governments, civil society, journalists, and researchers may use it to better understand how they can measure open data activities according to the Charter principles.

What you can find in the Guide

  • An executive summary for people who want to quickly understand what measurement tools exist and for what principles.
  • An analysis of how each Charter principle is measured, including a comparison of indicators that are currently used to measure each Charter principle and its commitments. This analysis is based on the open data indicators used by the five largest measurement tools — the Web Foundation’s Open Data Barometer, Open Knowledge International’s Global Open Data Index, Open Data Watch’s Open Data Inventory, OECD’s OURdata Index, and the European Open Data Maturity Assessment . For each principle, we also highlight case studies of how Charter adopters have practically implemented the commitments of that principle.
  • Comprehensive indicator tables show how each Charter principle commitment can be measured. This table is especially helpful when used to compare how different indices approach the same commitment, and where gaps exist. Here, you can see an example of the indicator tables for Principle 1.
  • A methodology section that details how the Working Group conducted the analysis of mapping existing measurements indices against Charter commitments.
  • A recommended list of resources for anyone that wants to read more about measurement and policy.

The Measurement Guide is available online in the form of a Gitbook and in a printable PDF version. If you are interested in using the indicators to measure open data, visit our indicator tables for each principle, or find the guide’s raw data here.

Do you have comments or questions? Share your feedback with the community using the hashtag #OpenDataMetrics or get in touch with our working group at