Rare Book Monthly

Articles - July - 2025 Issue

Vast Amounts of New Data from Books Being Made Available to AI Chatbox Programs like ChatGPT

A large source of additional information for AI (artificial intelligence) chatbox programs, like ChatGPT or Microsoft's Llama, has been opened. Those are the online search programs that answer just about every question you ask in seconds. A type of software known as “Large Language Models” are able to take vast amounts of data, use it to familiarize itself with manners of speech so as to understand this vast database of information, and then pull out what it needs to answer your question. It is utterly amazing what they do, but they can't do it all by themselves. They know nothing but what they are fed, and if they are to respond from the knowledge of vast amounts of information, that information must come from somewhere.

 

Much of it comes from the internet, which means they must be enough smart to separate the wheat from the chaff, and “chaff” is an overly polite word for a lot of what is out there. In other words, they also need some more reputable sources of information, and books and other publications are an important source for that. However, many (but not all) of the authors and publishers are not pleased with their work being used without payment. Authors, deservedly, get royalties for their work in books, but not for their work when it is copied and used by AI. They have sued to stop this practice and cite copyright law, as these works are copyrighted.

 

All of this is in the courts and how it is resolved is as yet unknown. However, a new source has emerged lately. That is from books in libraries. Harvard University announced that they are making their vast dataset of books from their library available to AI models at no cost. Most of this was created almost two decades ago as part of the Google Books project, where Google scanned and digitized millions of books at various libraries. Harvard compiled this and more as part of their Institutional Data Initiative at the Harvard Law Library. Harvard has files for 386 million pages from almost one million books. They are now making it available for services like ChatGPT to learn from and find answers to your questions.

 

This will be helpful, particularly for understanding historic material, but there is one very major drawback. It is safe to use these books without risk of being sued because they are out of copyright. Copyright terms are 95 years. Therefore, none of these books is less than 95 years old. This will not be much good for providing medical advice, even if it sometimes feels like this must be where RFK Jr. gets his medical recommendations. You want the latest opinions for medical diagnoses and the same for other scientific knowledge. Good luck fixing your computer or car with advice that predates 1930, unless you have a Model T. Of course, these programs already have a lot of later information in place (some of which they are being sued to remove). It just means that these 386 million new pages won't add much to answers you seek for these sorts of questions.

 

It should be noted that some information Harvard is providing is more recent since it is not subject to copyright. One example is legal case law. These court opinions are available to anyone to read – they need to be for legal experts to understand the law. This recent case law is being provided to the AI models that want to add it.

 

 

Update: A few days ago, the first court decision came down in a case of authors suing chatbox for copyright violation. The authors lost. Click here for more.

Rare Book Monthly

  • Sotheby’s
    Books, Manuscripts and Music from Medieval to Modern
    Now through July 10, 2025
    Sotheby’s, Ending July 10: Book of Hours by the Masters of Otto van Moerdrecht, Use of Sarum, in Latin, Southern Netherlands (Bruges), c.1450. £20,000 to £30,000.
    Sotheby’s, Ending July 10: Albert Einstein. Autograph letter signed, to Attilio Palatino, on his research into General Relativity, 12 May 1929. £12,000 to £18,000.
    Sotheby’s, Ending July 10: John Gould. The Birds of Europe, [1832-] 1837, 5 volumes, contemporary half morocco, subscriber’s copy. £40,000 to £60,000.
    Sotheby’s
    Books, Manuscripts and Music from Medieval to Modern
    Now through July 10, 2025
    Sotheby’s, Ending July 10: Ian Fleming. A collection of James Bond first editions, 8 volumes in all. £8,000 to £12,000.
    Sotheby’s, Ending July 10: J.K. Rowling. Harry Potter and the Philosopher's Stone, 1997, first edition, hardback issue. £50,000 to £70,000.
    Sotheby’s, Ending July 10: J.R.R. Tolkien. Autograph letter signed, to Amy Ronald, on Pauline Baynes's map of Middle Earth, 1970. £7,000 to £10,000.
  • Rare Book Hub is now mobile-friendly!
  • DOYLE
    Rare Books, Autographs & Maps
    July 23, 2025
    DOYLE, July 23: WALL, BERNHARDT. Greenwich Village. Types, Tenements & Temples. Estimate $300-500
    DOYLE, July 23: STOKES, I. N. PHELPS. The Iconography of Manhattan Island, 1498-1909. New York: Robert H. Dodd, 1915-28. Estimate: $3,000-5,000
    DOYLE, July 23: [AUTOGRAPH - US PRESIDENT]FRANKLIN D. ROOSEVELT. A signed photograph of Franklin D. Roosevelt. Estimate $500-800
    DOYLE, July 23: [ARION PRESS]. ABBOTT, EDWIN A. Flatland. A Romance of Many Dimensions. San Francisco, 1980. Estimate $2,000-3,000.
    DOYLE, July 23: TOLSTOY, LYOF N. and NATHAN HASKELL DOLE, translator. Anna Karénina ... in eight parts. New York: Thomas Y. Crowell & Co., [1886]. Estimate: $400-600
    DOYLE, July 23: ROWLING, J.K. Harry Potter and the Goblet of Fire. London: Bloomsbury, 2000. Estimate $1,200-1,800

Article Search

Archived Articles