Blog

Using ObservableHQ notebooks for gathering and transforming data in digital research

We’ve recently been experimenting with the use of ObservableHQ notebooks for gathering and transforming data in the context of digital research. This post walks through a few recent examples of notebooks from recent Public Data Lab projects.

In one project we wanted to use the CrowdTangle “Links” API to fetch data about how certain web pages were shared online and across different platforms. After gaining access to relevant end points, we could adopt different means to call the APIs and retrieve data: such as using something like Postman (a general-purpose interface to call endpoints), or writing custom scripts (for example in Python or Javascript).

Code notebooks are a third option that lies somewhere in between these options. Designed for programmers, notebooks allow for iterative manipulation and experimentation with code whilst keeping track of creative processes by commenting on the thinking behind each step.

Notebooks allow us to both write and run custom scripts as well as creating simple interfaces for those who may not code. Thus we can use them to help researchers, students and external collaborators to collect data, making it easier to call APIs, setting parameters, or perform manipulations.

ObservableHQ is one solution for writing programming notebooks, it runs in the browser and is oriented towards data and visualisations (“We believe thinking with data is an essential skill for the future”). Hence, we thought it could be a good starting point for what we wanted to do.

Screen capture of a notebook

The first notebook that we produced allows researchers to call CrowdTangle APIs in a simplified way: it exposes calls parameters and, contextually, it provides explanations and warnings about how to set them. For instance, it transforms the selection of platforms into checkboxes or the interval between calls into a slider (with a warning about rate-limits). The insertion of dates and other parameters can also be facilitated.

Examples of input fields

Data can be browsed in tabular form or downloaded as a CSV or JSON.

Examples of data

Notebooks can be used for a lot of diverse tasks. For instance, we produced a notebook that extracts hashtags from a list of posts and formats the data so to be used in Table2Net. One for extracting URLs and domain names from texts. Another one is dedicated to extending shortened URLs.

The last example was a case where it was necessary to implement a back-end service. Notebooks in ObservableHQ, indeed, run as front-end browser Javascript, therefore certain operations are tricky or impossible (this is one of their main limitations).

However, there are also many advantages. Notebooks are very flexible and easy to be transform and adjust: we can start gathering and exploring data and, after a couple of iterations, we can decide how best to structure it. We can add and remove parts of the interface almost instantly and we can embed functions (“cells”) from other notebooks, such as an emoji loading bar. The possibility to reuse or modify an entire notebook, or just a part of it, is very useful to build on the work done by other researchers and quickly bootstrap new tools as we need them.

Notebooks are particularly useful as part of exploratory research approaches where you are iteratively refining and adjusting research questions and seeing what is possible as you adjust various settings (e.g. the structure of the data, the parameters of the APIs).

An unusual loading bar, which can be imported into your notebook.

So far these projects have been used in the context of investigations with journalists on the Infodemic project, as well as in ongoing research and collaborations around DeSmog’s Climate Disinformation Database (including a prize-winning undergraduate thesis on this topic).

As per the working principles of the Public Data Lab, all of these notebooks are open-source (MIT license) and you are most welcome to use, transform and adjust them to your own work. If you use them for a project of piece of research, or if you’re also using code notebooks for we’d love to hear from you!

Here’s a full list of the notebooks mentioned in this post:

“Data critique and platform dependencies: How to study social media data?”, Digital Methods Winter School and Data Sprint 2022

Applications are now open for the Digital Methods Winter School and Data Sprint 2022 which is on the theme of “Data critique and platform dependencies: How to study social media data?“.

This will take place on 10-14th January 2022 at the University of Amsterdam.

More details and registration links are available here and an excerpt on this year’s theme and the format is copied below.

The Digital Methods Initiative (DMI), Amsterdam, is holding its annual Winter School on ‘Social media data critique’. The format is that of a (social media and web) data sprint, with tutorials as well as hands-on work for telling stories with data. There is also a programme of keynote speakers. It is intended for advanced Master’s students, PhD candidates and motivated scholars who would like to work on (and complete) a digital methods project in an intensive workshop setting. For a preview of what the event is like, you can view short video clips from previous editions of the School.

Data critique and platform dependencies: How to study social media data?

Source criticism is the scholarly activity traditionally concerned with provenance and reliability. When considering the state of social media data provision such criticism would be aimed at what platforms allow researchers to do (such as accessing an API) and not to do (scrape). It also would consider whether the data returned from querying is ‘good’, meaning complete or representative. How do social media platforms fare when considering these principles? How to audit or otherwise scrutinise social media platforms’ data supply?

Recently Facebook has come under renewed criticism for its data supply through the publication of its ‘transparency’ report, Widely Viewed Content. It is a list of web URLs and Facebook posts that receive the greatest ‘reach’ on the platform when appearing on users’ News Feeds. Its publication comes on the heels of Facebook’s well catalogued ‘fake news problem’, first reported in 2016 as well as a well publicised Twitter feed that lists the most-engaged with posts on Facebook (using Crowdtangle data). In both instances those contributions, together with additional scholarly work, have shown that dubious information and extreme right-wing content are disproportionately interacted with. Facebook’s transparency report, which has been called ‘transparency theater’, demonstrates that it is not the case. How to check the data? For now, “all anybody has is the company’s word for it.”

For Facebook as well as a variety of other platforms there are no public archives. Facebook’s data sharing model is one of an industry-academic ‘partnership’. The Social Science One project, launched when Facebook ended access to its Pages API, offers big data — “57 million URLs, more than 1.7 trillion rows, and nearly 40 trillion cell values, describing URLs shared more than 100 times publicly on Facebook (between 1/1/2017 and 2/28/2021).” To obtain the data (if one can handle it) requires writing a research proposal and if accepted compliance with Facebook’s ‘onboarding’, a non-negotiable research data agreement. Ultimately, the data is accessed (not downloaded) in a Facebook research environment, “the Facebook Open Research Tool (FORT) … behind a VPN that does not have access to the Internet”. There are also “regular meetings Facebook holds with researchers”. A data access ethnography project, not so unlike to one written about trying to work with Twitter’s archive at the Library of Congress, may be a worthwhile undertaking.

Other projects would evaluate ‘repurposing’ marketing data, as Robert Putnam’s ‘Bowling Alone’ project did and as is a more general digital methods approach. Comparing multiple marketing data outputs may be of interest, and crossing those with CrowdTangle ‘s outputs. Facepager, one of the last pieces of software (after Netvizz and Netlytic) to still have access to Facebook’s graph API reports that “access permissions are under heavy reconstruction”. Its usage requires further scrutiny. There is also a difference between the user view and the developer view (and between ethnographic and computational approaches), which is also worth exploring. ‘Interface methods‘ may be useful here. These and other considerations for developing social media data criticism are topics of interest for this year’s Winter School theme.

At the Winter School there are the usual social media tool tutorials (and the occasional tool requiem), but also continued attention to thinking through and proposing how to work with social media data. There are also empirical and conceptual projects that participants work on. Projects from the past Summer and Winter Schools include: Detecting Conspiratorial Hermeneutics via Words & Images, Mapping the Dutchophone Fringe on Telegram, Greenwashing, in_authenticity & protest, Searching constructive/authentic posts in media comment sections: NU.nl/The Guardian, Mapping deepfakes with digital methods and visual analytics, “Go back to plebbit”: Mapping the platform antagonism between 4chan and Reddit, Profiling Bolsobots Networks, Infodemic everywhere, Post-Trump Information Ecology, Streams of Conspirational Folklore, and FIlterTube: Investigating echo chambers, filter bubbles and polarization on YouTube.

Organisers: Lucia Bainotti, Richard Rogers and Guillen Torres, Media Studies, University of Amsterdam. Application information at https://www.digitalmethods.net.

New special issue of the bilingual journal Diseña on Visual Methods for Online Images

A new special issue of the bilingual journal Diseña has just been released. The issue, edited by Gabriele Colombo and Sabine Niederer, explores the realm of online images as a site for visual research and design.

While in an image-saturated society, methods for visual analysis gain urgency, this special issue explores visual ways to study online images. The proposition we make is to stay as close to the material as possible. How to approach the visual with the visual? What type of images may one design to make sense of, reshape, and reanimate online image collections? The special issue also touches upon the role that algorithmic tools, including machine vision, can play in such research efforts. Which kinds of collaborations between humans and machines can we envision to better grasp and critically interrogate the dynamics of today’s digital visual culture?

The articles (available both in English and in Spanish) touch on the diversity of formats and uses of online images, focusing on collection and visual interpretation methods. Other themes touched by this issue are image machine co-creation processes and their ethics, participatory actions for image production and analysis, and feminist approaches to digital visual work.

Further information about the issue can be found in our introduction. Following is the complete list of contributions (with links) and authors (some from the Public Data Lab).

Editorial: Against Subject Datafication through Anti-Oppressive Data Practices – Renato Bernasconi

Diseña 19 | Visual Methods for Online Images: Collection, Circulation, and Machine Co-CreationGabriele Colombo, Sabine Niederer

The Potentials of Google Vision API-based Networks to Study Natively Digital ImagesJanna Joceli Omena, Pilipets Elena, Beatrice Gobbo, Chao Jason

Developing Online Images. From Visual Traces to Public VoicesDonato Ricci, Calibro, Duncan Evennou, Benoît Verjat

Google Images, Climate Change, and the Disappearance of Humans – Warren Pearce, Carlo De Gaetano

Data-Driven Curated Video Catalogs: Republishing Video FootageGabriele Colombo, Federica Bardelli

Creating AI Art Responsibly: A Field Guide for Artists – Claire R. Leibowicz, Emily Saltz, Lia Coleman

Feminist Data Practices: Conversations with Catherine D’Ignazio, Lauren Klein, and Maya Livio – Catherine D’Ignazio, Lauren Klein, Maya Livio, Sabine Niederer, Gabriele Colombo

Decolonizing the Imagination in Times of Crisis. Gestures for Speculative Thinking-Feeling: Interview with Martin Savransky – Martin Savransky, Martín Tironi



Investigating infodemic – researchers, students and journalists work together to explore the online circulation of COVID-19 misinformation and conspiracies

Over the past year researchers and students at institutions associated with the Public Data Lab have contributed to a series of collaborative digital investigations into the online circulation of COVID-19 misinformation and conspiracies.

Researchers and students contributed to a series of “engaged research led teaching” projects developed with journalists, media organisations and non-governmental organisations around the world.

These were undertaken in association with the Arts and Humanities Research Council funded project Infodemic: Combatting COVID-19 Conspiracy Theories, which explores how digital methods grounded in social and cultural research may facilitate understanding of WHO has described as an “infodemic” of misleading, fabricated, conspiratorial and other problematic material related to the COVID-19 pandemic.

These projects led to and contributed to a number of stories, investigations and publications including:

Continue reading

The new Public Data Lab logo

With the launch of the new PDL website, we thought it was the perfect opportunity to freshen up the identity of the laboratory to highlight its craft and convey the spirit that is behind it. We designed the new identity to be flexible and approachable, while maintaining a simple coherence that is reflected in the new colored version of the Public Data Lab logo.

A flexible approach

The core principle of the new aesthetics of the website is to highlight the uniqueness of projects and endeavours and their call to gather different disciplines, approaches and people to explore specific areas of research.

Each project can be represented with a specific color, that belong to a family of warm and rich color palette that is inspired by historical japanese and western prints. These colors come together in the new logo, that showcases three of them along the new “mango yellow” that ties all the colors in the palette together.

All the projects come together in the network that showcases the interconnections between projects, people and affiliations that make the Public Data Lab. In the network, each project retains its color, making it recognizable also from this bird’s eye view.

The same approach is applied also outside of the Public Data Lab website to mini-website related to other activities, like “Infodemic“, a blog that gathers research insights about the current misinformation “infodemic”. The Github template can be used in any PDL project that requires a small website to document collateral research. You can find documentation and a guide on how to use this theme directly on Github.

The new logo

The new Public Data Lab logo builds on the original logo of the laboratory that is already used in many projects. It is consistent in sizing and spacing, but it introduces a new font family designed to be clear and legible: the Inter font superfamily.

Inter is an open source font “specially designed for user interfaces with focus on high legibility of small-to-medium sized text on computer screens” and it is currently an active project on Github.

The new Public Data Lab logotype incorporates the new palette to highlight the 3D shape of the cube.

New edition of Data Journalism Handbook now open access with Amsterdam University Press

This blog is cross-posted from lilianabounegru.org. Further details can be found in this thread.

Today The Data Journalism Handbook: Towards a Critical Data Practice (which I co-edited with Jonathan Gray) is published on Amsterdam University Press. It is published as part of a new book series on Digital Studies which is also being launched today. You can find the book here, including an open access version: http://bit.ly/data-journalism-handbook-2

The book provides a wide-ranging collection of perspectives on how data journalism is done around the world. It is published a decade after the first edition (available in 14 languages) began life as a collaborative draft at the Mozilla Festival 2011 in London.

Book sprint at MozFest 2011 for first edition of Data Journalism Handbook.

The new edition, with 54 chapters from 74 leading researchers and practitioners of data journalism, gives a “behind the scenes” look at the social lives of datasets, data infrastructures, and data stories in newsrooms, media organizations, startups, civil society organizations and beyond.

The book includes chapters by leading researchers around the world and from practitioners at organisations including Al Jazeera, BBC, BuzzFeed News, Der Spiegel, eldiario.es, The Engine Room, Global Witness, Google News Lab, Guardian, the International Consortium of Investigative Journalists (ICIJ), La Nacion, NOS, OjoPúblico, Rappler, United Nations Development Programme and the Washington Post.

An online preview of various chapters from book was launched in collaboration with the European Journalism Centre and the Google News Initiative and can be found here.

The book draws on over a decade of professional and academic experience engaging with the field of data journalism, including through my role as Data Journalism Programme Lead at the European Journalism Centre; my research on data journalism with the Digital Methods Initiative; my PhD research on “news devices” at the universities of Groningen and Ghent; and my research, teaching and collaborations around data journalism at the Department of Digital Humanities at King’s College London.

Further background about the book can be found in our introduction. Following is the full table of contents and some quotes about the book. We’ll be organising various activities around the book in coming months, which you can follow with the #ddjbook hashtag on Twitter.

If you adopt the book for a class we’d love to hear from you so we can keep track of how it is being used (and also update this list of data journalism courses and programmes around the world) and to inform future activities in this area. Hope you enjoy it!

Continue reading

New website and blog for the Public Data Lab

Welcome to the revamped website and new blog for the Public Data Lab, courtesy of Andrea Benedetti at DensityDesign Lab in Milan. ✨

What’s new?

  • 🛠 In order to enable more people to post more easily about various projects and activities, we’re now using WordPress as the backend for the site (along with static site templates and materials for use by different lab projects).
  • 👩🏻‍💻 We have added a people page so we can highlight a much wider group of people, groups and collaborators who we work with at the Public Data Lab.
  • 🌱 We’ve added an updated projects page which includes more of what we’ve been up to than had been on the previous site, along with a little updating network diagram to show who has been working on what and the different clusters of our activities 🙂
  • 📝 We’ll be using the blog to post short notes and updates on our various projects and activities across the Public Data Lab and its associated research centres, communities and institutions.
  • 🏮 We have lightly revised our mission statement to better reflect what we do (in light of activities over the past few years)

As always you can follow our activities on Twitter at @PublicDataLab and also get in touch if you’re interested in contributing to or collaborating with the lab.

If you spot anything that should be added/amended on the new website, please let us know or leave an issue on Github.