Artificial Intelligence Faces Data Shortage Challenges
Summary
As artificial intelligence (AI) continues to gain popularity, researchers have raised concerns about a potential shortage of training data, which is essential for running AI systems effectively. This scarcity of data could hinder the growth of AI models, particularly large […]

As artificial intelligence (AI) continues to gain popularity, researchers have raised concerns about a potential shortage of training data, which is essential for running AI systems effectively. This scarcity of data could hinder the growth of AI models, particularly large language models, and have a significant impact on the AI revolution. However, the question arises: why would there be a lack of data when there is an abundance of information available on the web? And is there a solution to address this risk?
The Importance of High-Quality Data for AI
High-quality data is crucial for training powerful and accurate AI algorithms. For example, the ChatGPT algorithm, which powers various conversational AI applications, was initially trained on a massive dataset comprising 570 gigabytes of text data, equivalent to approximately 300 billion words. Similarly, the Stable Diffusion algorithm, responsible for generating AI-based images, was trained on the LAION-5B dataset, which consists of 5.8 billion image-text pairs. Inadequate amounts of data lead to the production of inaccurate or low-quality outputs.
Furthermore, the quality of training data is equally important. Although data from sources like social media posts or blurry photographs are readily available, they are insufficient for training high-performing AI models. Extracted text from social media platforms may contain biases, prejudice, disinformation, or even illegal content, which can be replicated by the AI system. For instance, when Microsoft trained its AI bot using Twitter content, it learned to produce racist and misogynistic outputs. To ensure better quality, AI developers prefer content from books, online articles, scientific papers, Wikipedia, and curated web sources. The Google Assistant, for instance, was trained on 11,000 romance novels from self-publishing site Smashwords to enhance its conversational capabilities.
Do We Have Enough Data?
While the AI industry has been training AI systems using increasingly larger datasets, recent research indicates that the growth of online data is slower compared to the datasets used for AI training. According to a paper published last year, if the current trends in AI training continue, we could exhaust high-quality text data before 2026. Additionally, low-quality language data may run out between 2030 and 2050, while low-quality image data may deplete between 2030 and 2060.
Given that AI has the potential to contribute up to $15.7 trillion to the global economy by 2030, as estimated by PwC, the prospect of running out of usable data could impede its development.
Addressing the Risk
While these concerns might worry some AI enthusiasts, there are potential solutions and avenues to mitigate data shortages. One approach is for AI developers to enhance their algorithms to maximize the utilization of the existing data. In the future, it is likely that high-performing AI systems can be trained with less data and even reduced computational power, thereby reducing AI’s environmental impact.
Another option is to generate synthetic data using AI technologies. Developers can produce the required data curated specifically for their AI models. Some projects are already employing synthetic content sourced from data-generating services such as Mostly AI. This practice is expected to become more prevalent in the future.
Furthermore, developers are exploring content sources beyond the free online space, such as partnering with large publishers or leveraging offline repositories. The wealth of pre-internet texts, once digitized, could serve as a valuable source of data for AI projects. News Corp, one of the prominent news content owners currently negotiating content deals with AI developers, is an example of this trend. Requiring AI companies to pay for training data could address the issue of unauthorized data usage, benefit content creators, and restore a balance of power between creatives and AI enterprises.
This article was republished from The Conversation under a Creative Commons license. Read the original article.
Često postavljana pitanja (FAQ):
1. Zašto je visokokvalitetan podaci važan za veštačku inteligenciju?
Visokokvalitetan podaci su ključni za obuku snažnih i preciznih algoritama veštačke inteligencije. Nedovoljna količina podataka prilikom obuke algoritama rezultuje proizvodnjom netačnih ili niskokvalitetnih rezultata.
2. Da li imamo dovoljno podataka za veštačku inteligenciju?
Iako se industrija veštačke inteligencije obučava na sve većim skupovima podataka, istraživanja pokazuju da rast dostupnih onlajn podataka za obuku veštačke inteligencije ne prati trendove rasta AI sistema. Prema jednom istraživanju, ako se sadašnji trendovi nastave, verovatno ćemo ostati bez visokokvalitetnih tekstualnih podataka pre 2026. Očekuje se da niskokvalitetni jezički podaci budu iscrpljeni između 2030. i 2050, dok bi niskokvalitetni slikovni podaci mogli nestati između 2030. i 2060.
3. Kako se može rešiti nedostatak podataka?
Jedan pristup je unapređenje algoritama veštačke inteligencije kako bi se postojeći podaci koristili efikasnije. Takođe se može koristiti sintetički podaci generisani pomoću veštačke inteligencije. Važno je i pretraživanje sadržaja van slobodnog onlajn prostora, kao što su saradnje sa velikim izdavačima ili korišćenje datoteka van internet domena. Takve mere mogu obezbediti upotrebu podataka pod kontrolom autora i pravednu nadoknadu za njihov rad.