Rare Book Monthly

Articles - July - 2025 Issue

Vast Amounts of New Data from Books Being Made Available to AI Chatbox Programs like ChatGPT

A large source of additional information for AI (artificial intelligence) chatbox programs, like ChatGPT or Microsoft's Llama, has been opened. Those are the online search programs that answer just about every question you ask in seconds. A type of software known as “Large Language Models” are able to take vast amounts of data, use it to familiarize itself with manners of speech so as to understand this vast database of information, and then pull out what it needs to answer your question. It is utterly amazing what they do, but they can't do it all by themselves. They know nothing but what they are fed, and if they are to respond from the knowledge of vast amounts of information, that information must come from somewhere.

 

Much of it comes from the internet, which means they must be enough smart to separate the wheat from the chaff, and “chaff” is an overly polite word for a lot of what is out there. In other words, they also need some more reputable sources of information, and books and other publications are an important source for that. However, many (but not all) of the authors and publishers are not pleased with their work being used without payment. Authors, deservedly, get royalties for their work in books, but not for their work when it is copied and used by AI. They have sued to stop this practice and cite copyright law, as these works are copyrighted.

 

All of this is in the courts and how it is resolved is as yet unknown. However, a new source has emerged lately. That is from books in libraries. Harvard University announced that they are making their vast dataset of books from their library available to AI models at no cost. Most of this was created almost two decades ago as part of the Google Books project, where Google scanned and digitized millions of books at various libraries. Harvard compiled this and more as part of their Institutional Data Initiative at the Harvard Law Library. Harvard has files for 386 million pages from almost one million books. They are now making it available for services like ChatGPT to learn from and find answers to your questions.

 

This will be helpful, particularly for understanding historic material, but there is one very major drawback. It is safe to use these books without risk of being sued because they are out of copyright. Copyright terms are 95 years. Therefore, none of these books is less than 95 years old. This will not be much good for providing medical advice, even if it sometimes feels like this must be where RFK Jr. gets his medical recommendations. You want the latest opinions for medical diagnoses and the same for other scientific knowledge. Good luck fixing your computer or car with advice that predates 1930, unless you have a Model T. Of course, these programs already have a lot of later information in place (some of which they are being sued to remove). It just means that these 386 million new pages won't add much to answers you seek for these sorts of questions.

 

It should be noted that some information Harvard is providing is more recent since it is not subject to copyright. One example is legal case law. These court opinions are available to anyone to read – they need to be for legal experts to understand the law. This recent case law is being provided to the AI models that want to add it.

 

 

Update: A few days ago, the first court decision came down in a case of authors suing chatbox for copyright violation. The authors lost. Click here for more.


Posted On: 2025-07-09 14:41
User Name: hjrobin

No links in this discussion to the actual data. How un-bibliographic!


Rare Book Monthly

  • Rare Book Hub is now mobile-friendly!
  • Bonhams, Oct. 13-23: These are the Times that Try Men's Souls, Thomas Paine. $80,000-$120,000
    Bonhams, Oct. 13-23: Manuscrpit from Aboard The Discovery, Signed by George Vancouver. $80,000-$120,000
    Bonhams, Oct. 13-23: Exceedingly Rare Holograph Fragment of James Cook's Logbook. $80,000-$120,000
    Bonhams, Oct. 13-23: Colonial America: The Collection of William Nesheim: Thomas Lechford: Important First-Hand Account of Life in New England. $40,000-$60,000
    Bonhams, Oct. 13-23: The First Expanded Edition of Common Sense, Thomas Paine. $30,000-$50,000
    Bonhams, Oct. 13-23: California! The Gold Rush Collection of Bruce Maclin: Album of Exceptional California Lettersheets. $20,000-$30,000
    Bonhams, Oct. 13-23: California! The Gold Rush Collection of Bruce Maclin: An Exceptional Group of Gold Rush Letters, c. 1849-1850. $20,000-$30,000
    Bonhams, Oct. 13-23: Colonial America: The Collection of William Nesheim: Mather's King Phillips War Tract 1639-1723. $15,000-$25,000
    Bonhams, Oct. 13-23: Colonial America: The Collection of William Nesheim: The First Contemporaneous Account of the Salem Witch Trials, Cotton Mather. $15,000-$25,000
    Bonhams, Oct. 13-23: Poor Richard's Almanack 1749, Benjamin Franklin. $15,000-$20,000
    Bonhams, Oct. 13-23: California! The Gold Rush Collection of Bruce Maclin: Fruits of Mormonism by Nelson Slater. $15,000-$25,000
    Bonhams, Oct. 13-23: California! The Gold Rush Collection of Bruce Maclin, Across the Plains in '49 by Emanuel Goughnour. $12,000-$18,000
  • Rare Map, Book, and Autograph Fair
    17 and 18 Oct
    Rare Map, Book, and Autograph Fair
    17 and 18 Oct
    Rare Map, Book, and Autograph Fair
    17 and 18 Oct
  • Sotheby’s
    By a Lady
    1-15 October 2025
    Sotheby’s, Oct. 1-15: Queen Elizabeth I. A queen’s defense of the realm, and the birth of the British Empire. $500,000 to $700,000.
    Sotheby’s, Oct. 1-15: Vanessa Bell — [Virginia Woolf]. An exceptional encapsulation of the Bloomsbury Group. A striking tile created by Vanesa Bell for her sister, Virginia Woolf, ca. Christmas 1926. $25,000 to $35,000.
    Sotheby’s, Oct. 1-15: Austen, Jane. A long and intimate autograph letter signed ("JA"), to Cassandra Austen. $300,000 to $400,000.
    Sotheby’s, Oct. 1-15: Austen, Jane. “Lines on Maria Beckford,” autograph manuscript signed ("Jane Austen"). $100,000 to $150,000.
    Sotheby’s, Oct. 1-15: [Austen, Jane]. Emma, the extraordinary Edgeworth-Butler copy. $250,000 to $350,000.

Article Search

Archived Articles