In addition to the trove of books, the Institutional Data Initiative is also working with the Boston Public Library to scan millions of articles from different newspapers that are now in the public domain, and says it is open to forming similar collaborations in the future. The exact way in which the books data set will be published has not been determined. The Institutional Data Initiative has asked Google to work together on public distribution, and the company has pledged its support.
Regardless of how the IDI data set is released, it will join a number of similar projects, startups and initiatives that promise to give companies access to substantial, high-quality AI training materials without the risk of running into problems. of copyright. Companies like Calliope Networks and ProRata have emerged to issue licenses and design compensation schemes designed to make creators and rights holders pay for providing AI training data.
There are also other new projects in the public domain. Last spring, French AI startup Pleias launched its own public domain dataset, Common Corpus, containing approximately 3 to 4 million books and periodical collections, according to project coordinator Pierre-Carl Langlais. Backed by the French Ministry of Culture, Common Corpus has been downloaded more than 60,000 times this month alone on the open source artificial intelligence platform Hugging Face. Last week, Pleias announced that it will release its first set of large language models trained on this data set, which Langlais told constitute the first models “trained exclusively on open data and comply with the [EU] AI Law.”
Efforts are also underway to create similar wizard data sets. AI Start Spawn released This summer he released his own book called Source.Plus, which contains public domain images from Wikimedia Commons, as well as a variety of museums and archives. Several important cultural institutions They have long made their own archives accessible to the public as independent projects, such as the Metropolitan Museum of Art.
Ed Newton-Rex, a former Stability AI executive who now runs a nonprofit that certifies ethically trained AI tools, says the rise of these data sets shows there’s no need to steal copyrighted materials to build quality, high-performance AI models. OpenAI previously told UK lawmakers it would “impossible”to create products like ChatGPT without using copyrighted works. “Large public domain data sets like these further break down the ‘necessity defense’ that some AI companies use to justify mining copyrighted works to train their models,” says Newton-Rex.
But he still has reservations about whether IDI and similar projects will really change the training status quo. “These data sets will only have a positive impact if they are used, probably in conjunction with the licensing of other data, to replace copyrighted work. “If they are simply added to the mix, as part of a data set that also includes the unlicensed life’s work of the world’s creators, they will overwhelmingly benefit AI companies,” he says.