Data: Who Owns it and Why It Matters in a World of AI
Imagine your daily interactions with the digital world. Each time you swipe your credit card, click on a news article, take a health walk with your tracker, or order your next meal delivery, you’re contributing to a vast and intricate digital tapestry. This tapestry, woven from countless interactions like yours, represents the data framework of our society.
Data has become so interwoven with our daily lives that we often overlook the vast tapestry we contribute to every second. Yet, understanding how this data is used, who controls it, and its various applications is critical in a world increasingly driven by AI.
Where is the World’s Data?
Public Data
Open-source or public data, made accessible by entities like government agencies and research institutions, is free for use, sharing, and repurposing. Typically devoid of personal data, these datasets span various topics from demographics to health, serving as key resources for fields like data science and machine learning. However, despite its value, open-source data comprises only a small fraction of the global data landscape.
Private Data
Private data comprises the lion's share of global data. It represents the individual interactions you have with digital platforms. These could be your search patterns on Google, photos shared on Instagram, or purchases made on Amazon.
When we discuss private data, we also need to consider proprietary data. Proprietary data refers to commoditized information owned and safeguarded by corporations, businesses, and individuals. This data includes elements like consumer patterns, supply chain analytics, and unique algorithms.
Controversy arises when companies treat personal data as proprietary, often utilizing, selling, or buying user data without clear consent, which we will discuss further below.
Understanding the Influence of AI
Artificial Intelligence (AI), a major consumer of data, learns and evolves based on the information it accesses. But like any learner, its comprehension of the world depends on the quality of its teachers, in this case, the datasets. If the data it learns from is biased, incomplete, or unrepresentative, AI’s understanding will be flawed, influencing how useful it is to society.
While tools like Chat-GPT and Stable Diffusion have attracted recent attention for their seemingly advanced capabilities, the reality is that their potential is constrained by the public data used to train them. While proprietary data can supplement some of this data, the datasets typically represent a narrow slice of the global data landscape and reality.
A recent study on AI image models reveals what happens when AI only knows public data: it thinks all domestic workers are women of color, and all CEOs are white males. Don’t even try to generate an image of a “young girl” without an NSFW filter in place.
The Value of Private Data
Access to private data could be transformative for the advancement of AI. It represents a reservoir of information that could sharpen AI’s understanding of the world: making it more grounded in reality.
However, accessing private data is an increasingly challenging exercise. The sale and purchase of private data is ethically and practically challenging. Data farming — the practice of harvesting data through surveys and other data collection means — typically results in poor-quality data due to skewed demographics and fraud.
The other route of sourcing personal data from companies and data brokers that purport to own it has attracted the ire of regulators and sparked serious debates about user privacy amongst the general population. In the post-Facebook scandal era, there’s increased sensitivity about data misuse and handling from regulators and the general public alike.
For the ambitious data scientist or machine learning engineer, good quality data is increasingly more expensive and more difficult to find.
Participatory AI
A potential solution to this impasse lies in a model of participatory AI, where individuals are active contributors and owners in the creation of AI systems. This model starts with data ownership. As the creator of my data, I own it and have the right to determine its sale or use.
With ownership as the foundation, individuals would be empowered to contribute their data to specific AI projects aligned with their interests and values. In return, contributors could receive a share of the profits or be generated by AI, ensuring alignment of economic incentives as well.
The change to participatory AI represents a significant shift in power dynamics. It’s more than profit-sharing; it transforms individuals from mere data cows, ready to be farmed and harvested, to informed participants in the trajectory of AI. This model would allow individuals to actively choose how their data is used and how AI models are purposed, also making them active partners in AI advancement.
Building Better AI Together
Bringing people closer to AI is a necessary step towards building better technology. Creating AI systems that are built by collective intelligence means that they work better in the real world. By aligning individual incentives with technological advancement, participatory AI could be as transformative for the technologist as it is for the individual.
The impasse between individual data rights and the advancement of technology will shape not only the future of AI but also the very fabric of our society. Shifting to a new paradigm in which individuals are active participants in the creation of technology will not be easy. Yet, it’s an essential transition. AI’s social and economic potential is too valuable to remain constrained by a data governance problem.