• CO/AI
  • Posts
  • Publishers vs Big Tech 🟢

Publishers vs Big Tech 🟢

Why HarperCollins is paying authors $2500 and how the NYT-OpenAI battle exposes the ever growing tension between publishers and AI companies.

In partnership with

NEW LAUNCHES

The latest features, products & partnerships in AI

GOVERNMENT

Law, regulation, defense & politics.

AI MODELS

Deployment, research, training & infrastructure

IMPLEMENTATION

Announcements, strategies & case studies

OTHER INTERESTING STORIES

Compelling stories beyond the usual categories

What’s happening in AI right now

The scramble for training data

As training data becomes the new oil powering AI development, Meta even floated acquiring Simon & Schuster for AI training data, traditional publishers are moving quickly to monetize their catalogs.

The publisher's gambit

HarperCollins made waves this week by offering authors $2,500 for three-year licenses to use their nonfiction books in AI training - a move that could set industry standards but has raised eyebrows over valuation. For context, this represents one of the first structured licensing programs from a major publisher, though the modest compensation - averaging less than $1,000 per year - has sparked debate about fair rates for AI training content.

Meanwhile, the Copyright Clearance Center has launched the industry's first collective licensing solution for AI content use through a modified annual blanket license. This framework operates globally and aims to standardize how businesses can legally access training data while ensuring creator compensation.

Hollywood's hidden data mine

Behind the scenes, tech giants including Apple, Anthropic, Meta and Nvidia have been training on an enormous dataset of subtitles from over 53,000 movies and 85,000 TV episodes - largely without explicit authorization. This widespread use of entertainment industry content helps explain the growing pushback from creative professionals concerned about their intellectual property being used without compensation. Runway struck a pioneering deal with Lionsgate to train AI video models on the studio's catalog - though this comes as California enacts new restrictions on AI replicas in film and while Runway faces litigation over unauthorized training data use.

The transparency battle

OpenAI's legal clash with The New York Times has exposed the challenges of auditing AI training data. The Times estimates needing $800,000 worth of API credits to properly inspect OpenAI's models - far above OpenAI's proposed $15,000 cap. This dispute highlights the broader tension between AI companies' desire to protect proprietary technology and growing demands for transparency.

The data wall ahead

The scramble for training data comes amid signs that AI progress may be hitting limitations. Researchers project that high-quality public text data suitable for AI training could be fully utilized between 2026-2032. OpenAI's latest Orion model showed only marginal gains over GPT-4, suggesting diminishing returns from traditional training approaches.

New defensive tools emerge

Content creators are developing innovative protection methods. A new "copyright trap" tool lets creators embed unique gibberish sentences into their text that serve as digital watermarks. These traps are designed to be easily detectable in AI outputs, providing evidence of unauthorized training use. While the tool works best with smaller AI models that tend to memorize training data, it represents a creative approach to a complex challenge. Researchers have made the implementation code publicly available on GitHub, though they acknowledge limitations including the potential impact on text readability.

The core tension reflects an industry racing to accumulate training data while grappling with thorny questions of ownership, compensation and transparency. Publishers and creators are pushing back against unauthorized use while exploring new licensing models. This battle over content rights and training data may define the next phase of AI development.

We publish daily research, playbooks, and deep industry data breakdowns. Learn More Here

One of the holy grails of unlocking the potential of AI is to build systems that can reason like humans. By improving AI's, Large Language Models in particular, ability to break down complex problems and apply logical steps.

Bagel’s research team has been exploring this problem. Analyzing LLM building techniques, especially fine-tuning techniques, to allow Large Language Models to evolve from pattern-recognizing prediction agents to true cognitive agents

AI generated art

A look at the art and projects being created with AI.

Midjourney Prompt: A photograph of an orange car with red flowers in front, parked outside white houses on the island of Lanzarote, against a blue sky on a sunny day. The image exhibits minimalism, natural colors, and natural lighting, with high-resolution photography that is insanely detailed and intricate, resulting in a hyper-realistic style in the manner of David LaChapelle.

How'd you like today's issue?

Have any feedback to help us improve? We'd love to hear it!

Login or Subscribe to participate in polls.

Reply

or to participate.