A recent New York Times report examines the methods OpenAI used to train its GPT-4 language model, including the controversial use of YouTube videos.
OpenAI's use of YouTube videos
According to the New York Times, OpenAI transcribed more than a million hours of YouTube videos to develop GPT-4, potentially violating the platform's guidelines.
This information, asserted by the newspaper, indicates a practice that could violate the rules established by YouTube regarding the independent use of its content.
Google's reactions to OpenAI's actions
According to the same report, certain employees of Google (the owner of YouTube) were informed of OpenAI's activities but did not intervene.
Which suggests a possible tolerance or conflict of interest, given that Google also reportedly uses YouTube videos to train its own AI models, although the company says it only does so with the consent of the creators.
Possible legal and ethical consequences
OpenAI's data collection methods, including the use of the Whisper tool to transcribe videos, raise questions about respect for copyright and fair use principles.
The practice, while widespread among tech giants, could lead to legal complications and a broader debate over the ethics of using data to train AI.
Towards a shortage of data for AI?
The report also highlights the possibility of a future shortage of actionable data for AI, pushing companies to seek sometimes-limiting alternatives.