Вештачка интелигенција

The Impact of Irregular Training Data on Artificial Intelligence Models

Summary

The field of artificial intelligence has seen significant advancements in recent years, with generative systems capable of creating images and text that mimic human creativity. However, there is a growing concern regarding the source and legality of the training data […]

The Impact of Irregular Training Data on Artificial Intelligence Models

The field of artificial intelligence has seen significant advancements in recent years, with generative systems capable of creating images and text that mimic human creativity. However, there is a growing concern regarding the source and legality of the training data used for these models. Companies such as OpenAI, Meta, and Stability AI have faced legal battles due to their reliance on copyrighted and pirated works in training their machine learning models. This raises questions about the ethics and potential biases embedded in these models.

One of the main challenges in training artificial intelligence models is the acquisition of large datasets. Development teams strive to collect vast amounts of data from the internet, but this process raises concerns about copyright infringement and privacy violations. The lack of corporate transparency further complicates the identification of the exact sources of training data. Nevertheless, experts in the field have provided some general insights into the origins of training data for artificial intelligence.

Where does the training data for artificial intelligence come from?

To build large generative models, developers heavily rely on publicly available data from the internet. However, there is no single repository where one can download the entire internet. Instead, developers use automated tools called web crawlers and web scrapers to collect and extract data from various online sources. Web crawlers travel from link to link, indexing information in a database, while web scrapers download and extract specific information.

Companies with significant resources, like Google, have their own web crawling tools for this purpose. However, other companies turn to existing resources such as Common Crawl or LAION, which contain links to images and their descriptions. These resources allow developers to access a wide range of data, including blogs, personal websites, company pages, online marketplaces, voter registration databases, government websites, Wikipedia, Reddit, research repositories, media outlets, and academic institutions. Additionally, there are compilations of pirated content and web archives that may contain data no longer available on the original websites.

While web crawlers and scrapers cannot access data behind login pages, they can retrieve data visible in search results or on websites that do not require authentication. This encompasses publicly accessible platforms like LinkedIn profiles. Some crawlers can even bypass paywalls, posing as paid subscribers to access content. News websites hidden behind paywalls have been common sources of data for training AI models, according to a joint analysis by The Washington Post and the Allen Institute.

Furthermore, web scrapers have been known to collect surprising amounts of personal data of unknown origin. Medical images and private diagnostic records have been found in AI training datasets, raising concerns about privacy and consent. Unfortunately, the exact pathways of how these images end up in training datasets remain unclear. The involvement of third-party data sources and the lack of robust privacy settings contribute to such data leaks.

In addition to web scraping, companies may deliberately include other sources, including their own internal data, in the model training process. OpenAI, for example, customizes its models based on user interactions with their chatbots. Meta has acknowledged training its latest AI models partially on public posts from Facebook and Instagram. This trend of incorporating user-generated content into training data is expected to continue, as companies like Twitter plan to leverage their users’ content. Amazon also intends to use voice data from customer interactions with Alexa to train their models.

The need for transparency and ethical considerations

Despite the importance of training data, many companies have become increasingly secretive about the details of their datasets. The release of new versions or models does not always come with comprehensive information regarding the sources of data. The lack of transparency raises concerns about potential biases and the overall reliability of these AI systems.

One of the significant challenges associated with irregular training data is the potential replication of sensitive personal information and copyrighted material by AI models. The repetition of such content not only poses legal implications but also raises ethical concerns. Additionally, the lack of understanding regarding the origins of training data makes it difficult to identify and rectify biases embedded in these models.

Frequently Asked Questions:

Q: What are web crawlers and web scrapers?
A: Web crawlers are automated tools that navigate the internet, indexing information in a database, while web scrapers extract specific information from websites.

Q: Can web scrapers access data behind login pages?
A: Web scrapers cannot access data behind login pages, but they can retrieve data visible in search results or on websites that do not require authentication.

Q: Are there legal concerns associated with the use of irregular training data?
A: Yes, the use of copyrighted and pirated works as training data raises legal implications and potential copyright infringement.

Q: What are some potential sources of training data for AI models?
A: Sources of training data can include public websites, social media posts, online databases, government websites, research repositories, media outlets, and academic institutions.

Sources:
– Unavailable