Aug 2024 . 4 min read

Exploring LongWriter: Ultra-Long Text Generation from LLMs

In the world of large language models, processing vast amounts of text is routine. But generating lengthy outputs has always been a significant challenge. Ask your favorite model to write an in-depth 10,000-word article, and you will likely receive a fraction of the content you need.

A study from Tsinghua University titled "LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs" has unlocked a practical approach to overcoming this limitation.

The Core Problem

Most LLMs can understand and process huge chunks of text, up to 100,000 tokens, but they fall short when producing anything longer than about 2,000 words of coherent output. The issue is not with the model's capacity. It is with the training data. These models simply are not given enough long examples during supervised fine-tuning. They are not used to thinking long-form.

AgentWrite

The researchers developed AgentWrite, an agent-based pipeline that breaks down the task of writing long-form content into manageable chunks:

  1. Planning. Like a good author starting with an outline, AgentWrite creates a detailed writing plan. It breaks the task into subtasks, each with a clear objective and a specific word count. A 10,000-word article on the Roman Empire might be split into 15 paragraphs, each covering a different aspect.
  2. Sequential generation. With the plan set, the model writes each paragraph one by one, ensuring every section connects with the previous. Like building a structure piece by piece, the final product is a coherent narrative rather than disconnected thoughts.

Under the Hood

LongWriter is not just about training with longer texts. It also optimizes the underlying transformer architecture. The research refines how the model processes and retains information across longer sequences, with innovations in attention mechanisms and memory management. This minimizes the degradation in output quality that typically occurs with extended text.

The AgentWrite pipeline interacts with the model at multiple stages, dynamically adjusting parameters like attention span and token processing to maintain coherence and relevance throughout the text.

LongWriter-6k: The Dataset

The real breakthrough is LongWriter-6k, a dataset specifically designed for training on ultra-long outputs. It contains over 6,000 examples of texts ranging from 2,000 to 32,000 words. Think of it as a training regimen designed to build the model's stamina for long-form writing.

Benchmarks

To validate these capabilities, the researchers developed LongBench-Write, a benchmark specifically for evaluating long-form generation. They tested models of various sizes, including their 9B parameter model, and compared performance against much larger proprietary models.

The result: their 9B model did not just hold its own. It outperformed its bigger counterparts, demonstrating that training data and methodology matter more than raw parameter count.

Takeaway

This research points to a future where LLMs can generate extensive, high-quality text on demand. Whether for academic papers, detailed reports, or creative writing, the possibilities expand significantly. The code and models are available for everyone to experiment with.

By addressing the inherent limitations in how current models are trained, the team has opened up new ground for ultra-long text generation. The combination of agentic planning and purpose-built datasets is a pattern worth paying attention to.

Back to all posts