AI Doesn’t Write Very Well and Isn’t Getting Any Better at It. So Why is Everyone so Freaked Out?
AI and Writing Part I
I think LLMs are, on balance, great for writers and writing, but I think they don’t write very well. This series of posts will attempt to unravel any apparent paradoxes in that statement.
I enjoy a good moral panic as much as the next guy. But both the evidence and my professional, if anecdotal, experience has forced me to the conclusion that the creative hysteria over AI is overblown. Especially when it comes to writing. Let’s start with the evidence.
To prove their creations’ intelligence, OpenAi gives its models standardized tests and publishes the results here. And, when measured this way, GPT-4 looks like an exponential improvement over GPT-3.5. Just look at the jump in its score on the Bar Exam or the GRE Quantitative Exam. That’s impressive. So impressive, in fact, that you can overlook that a few key tests didn’t change at all.
GPT-3 and -4 not only fail the AP Language and AP Literature Exams (2/5 on both), but its score didn’t improve AT ALL from version 3.5 to 4. You can see the same lack of improvement on the GRE Verbal Exam. Iterations and more training don’t seem to help with these tasks. So what’s going on?
I’ve never taken the GRE, but I have gotten perfect scores on both the AP English exams. And a perfect score on the verbal portion of the SAT. So I can venture some guesses. The SAT is a test where every question has a correct answer. The AP English exams are tests where the most important questions have more than one right answer. And some answers can be said to be *righter* than others. They are a true test of writing as well as of thinking about writing at a very high level.
There is an element of a musical performance to these kinds of tests. There isn’t a *best* way to play a piece of music, and certainly not a *best* way to play a jazz solo. There are a lot of better and worse. Once you reach a certain threshold, technical proficiency is price of entry. Things like taste level and depth of feeling make the difference.
So an evaluation at this level is not about seeing if someone knows the tools, as much as seeing if they can express an idea in a way that moves us. Fiction or non-fiction, that’s what great writing does. And in the greatest of writing the ideas that drive the work are inseparable from the style it’s written in.
What I’ve Seen and Tried
GPT-4 is trained on my work. When I asked to “describe the signature style of Patrick E. McLean, “ here’s what it produced:
Patrick E. McLean is an author known for his engaging storytelling and a style that often features a blend of humor, insightful observations, and well-crafted prose. While I cannot provide real-time or the most current information, as of my last update in April 2023, I can highlight some characteristic elements of his style based on his works up to that date:
Wit and Humor: McLean's writing often includes a sharp, witty sense of humor. He has a knack for infusing his narratives with comedic elements that are both entertaining and thought-provoking.
Clear Prose: He tends to write in a clear, accessible style that is free of unnecessary complexity, making his stories easy to read and understand while still being rich in meaning and layered in their messaging.
**Philosophical Underpinnings: Many of his stories, while entertaining on the surface, also delve into deeper philosophical questions about life, morality, and the human condition, inviting readers to think critically about the world around them.
Character Development: McLean has a talent for creating memorable characters with distinct voices and personalities. He often explores their development and growth over the course of a story, making his characters relatable and engaging.
Narrative Structure: His stories typically feature strong narrative structures with a clear beginning, middle, and end. He is adept at pacing his stories in a way that keeps readers interested from start to finish.
Genre Blending: McLean is known to blend genres effectively, often mixing elements of fantasy, science fiction, and real-world scenarios in ways that create unique and compelling worlds.
Please note that the description provided here is based on his style as understood up to 2023, and authors can evolve over time. For the most current insights into Patrick E. McLean's style or recent works, you may want to look for his latest publications or reviews.
I certainly aspire to live up to that description. I think the model is probably indexing things that have been written about my work, rather than doing any kind of literary analysis. Which, if we can believe the AP tests, it sucks at. But let’s allow that GPT-4 has an ‘understanding’ of what I do. Now let’s see if it can do it.
Can you give me 200 words on why AI doesn’t write very well in the style of Patrick E. McLean?
Let's be honest, an AI trying to write like me is like a robot trying to do stand-up comedy. It's got the timing down, sure, but it doesn't know why it's funny. I put a little bit of soul into every sentence—a little heart, a little pain, a little of the human condition. That's something you can't code into silicon, no matter how smart your programmers are.
You see, writing isn't just about stringing fancy words together. It's about stringing up a piece of yourself along with them. Every joke, every insight, every piece of wisdom—that's all hard-won from years of living, loving, losing, and the occasional embarrassing trip over the dog. An AI? It doesn't live. It doesn't feel the sting of heartbreak or the belly laugh of a good joke with an old friend. It doesn't pull from a well of experiences; it pulls from a database. Impressive? Sure. But it's the difference between reading about a sunset and actually watching the sky catch fire as the sun dips below the horizon.
So, while an AI can crank out words that might look like mine on the surface, it can't infuse them with the spirit of a well-lived life. It can't write with the heart of a human being who's been through the wringer and came out with a story to tell. And that, my friends, is the secret ingredient.
Ugh. “Living, loving, losing” What kind of basic bitch bullshit is that? And what kind of critique is “can’t infuse them with the spirit of the well-lived life”? This is the Cargo Cult school of writing — putting words together in a way that is statistically likely to create meaningful sentences without and understanding of the underlying meaning. This is a powerful and useful tool, it just isn’t *writing.*
Clearing Up the Confusion
Here’s an insight that I don’t think GPT-4 can’t have on it’s own. Every formal definition of writing we have is about the production of marks, either on a clay tablet or with pen and paper, or on a screen. So it’s awkward to talk about the process of writing apart from the production of text. But the production of text isn’t the important part of writing. The important part is the production of meaning, first in the writer, then, perhaps, in a reader.
For fiction they say, “No surprise in the writer, no surprise in the reader.” Which is my pithiest critique of why LLMs can’t write fiction. (Note: this is different than saying a human can’t use LLMs to write fiction. I’m getting to this.)
For essays like this, it’s important to remember that word essay means ‘to try’. I am attempting to articulate something at the limit of my understanding. Even if I fail in this attempt, through the attempt I will understand it better. And even a failed attempt can have beauty and power.
What will make this essay work or fail has nothing to do with stylistic flourishes, but whether the attempt succeeds. That’s what brings you along in the writing. We share curiosity, concern, and a sense of fumbling through a nest of complicated issues while we face something profound and important for the first time with no guidelines or roadmap.
Dear Reader, you and I are on an intellectual adventure. And, try as you might, you can’t get an LLM to take an intellectual adventure on its own. It simply can’t produce new knowledge in this manner.
How LLM’s Work
I’m a writer who knows a little Linear Algebra, not an AI expert. So take this explanation with a grain of salt:
At the heart of every LLM is a large collection of vector representations of words. Words that are like one another are grouped together in vector space. But this vector space has far more than three dimension. Word can be next to each other one space, that are conceptually very far apart in another. So if we’re thinking of an axis that is the space of ‘petness’ f the words dog and cat are very close to together. But if you could imagine another axis of ‘animals that get along’ on which they are very far apart.
A great example of how extra axrs can lead to new understanding, you can consider models of politics. There is the traditional left-right model (one axis) but you can also think of a two-, or three-axis model.
You can see how two axes gives you more meaning. And maybe three axes gives you even more meaning. I’m not convinced that there’s a meaningful distinction between Libertarianism-Totalitarianism and Individualism-Collectivism. But whether you agree or not, you can see that, as we keep adding axes, as some point another set of distinctions isn’t going to add any more information.
This is why consultants love two-axis conceptual models. (SWOT analysis, Gartner Magic Quadrant anyone?) They’re just complicated enough to make people who use them feel smart, without having to do the hard (or often impossible) work to truly understand the conceptual space.
The remarkable thing about LLMs is that no human has to (or can) specify the axes for these vector embeddings. We compute them and a superhuman level of understanding emerges from the mere categorization of relationships of words in a large enough set of texts. If the implication of the last sentence doesn’t blow your mind, you didn’t understand it.
But I think this strength is exactly why LLMs can’t produce great or even good writing. Or, indeed, write at all in the most profound, hardest-to-define sense of the word.
The Faustian Bargain Baked into the Algorithm
For all its complexity, I think a large language model is inherently a kind of conceptual or semantic average.
I ran the idea by GPT-4 and it agreed.
Vector Space and Semantics:
Words and phrases are represented as vectors in a multi-dimensional space, and the "meaning" of a word is determined by its position relative to other words in this space. These positions are determined by the aforementioned weights and are fine-tuned based on the vast amounts of text the model is exposed to during training.
Weighted Average and Word Meanings:
If we stretch the analogy, you could think of a language model's output as a kind of "weighted average" where the "average" is not just of word meanings in isolation, but also of their use in context, their relationships with other words, the structure of language, and the semantics and syntax rules learned during training. The "weights" in this average are the learned importance of different aspects of language, which are dynamically adjusted based on the input the model receives.
Concepts and Relationships:
Regarding what concepts belong together, the model learns associations between words and phrases through the contexts in which they appear. Words that frequently appear in similar contexts will have similar vector representations. So, the model "understands" concepts belonging together by how often and how closely they are used in the same context across the training data.
In summary, while an LLM does not compute a weighted average in the traditional sense, the concept of a weighted average can be metaphorically applied to understand how it processes language and generates responses. The model takes in all the contextual clues and outputs what it deems the most likely continuation based on the weighted patterns it has learned.
Since it’s been trained on the entire internet, its default is the average of everything that’s been written on the internet. And that’s not a good average.
“But,” you might protest, “It’s brilliant. Look at the answer it gave to your question!” The answer is great, and very helpful. But I narrowed the conceptual space. I asked the question. And it’s only as good as the questions I ask. Can I prompt it to improve the quality of it’s written response? Of course, but that’s just editing, not AI.
Let’s Ask “Somebody” Smarter
I wonder what Tyler Cowen would say about these questions. Here’s what happens when I ask GPT-4 to rewrite the answer as Tyler Cowen.
Consider the linguistic marketplace, where words and phrases are much like economic agents in a vast, multidimensional trade space. In this space, the currency is meaning, and it fluctuates. The value—or "meaning"—of these agents is not fixed but is instead a function of their position relative to others, much like the relative prices of goods and services in response to supply and demand.
In training language models, we encounter a form of price discovery. Weights, akin to market signals, adjust the positioning of words within this semantic trade space, refining their "meaning" based on a deluge of textual data, a veritable ocean of human thought and expression. This process is analogous to the way market prices are refined by the myriad transactions that occur every day.
If one dares to extend this market analogy, the output of a language model could be viewed as a kind of "semantic pricing mechanism," where the "price" is a composite of not only individual word meanings but also their contextual relationships, the structure of language, and the grammatical and syntactical norms that have been internalized during the model's "market research."
As for the aggregation of concepts, our language model is akin to a market analyst, discerning patterns and associations between words much as a seasoned trader might notice trends and correlations in the market. Words that cozy up to each other in various textual neighborhoods are like correlated assets, their value in meaning-space linked by the frequency and proximity of their contextual interactions.
In sum, while a language model does not engage in weighted averaging in the econometric sense, this metaphor serves to elucidate how it processes language and formulates responses. It surveys the vast landscape of linguistic exchanges and, in a sense, "predicts market movements," offering the most plausible continuations of the discourse based on the patterns it has observed and internalized.
It’s utter gibberish. While I can't predict what Tyler would say on this subject, I believe it would be brilliant. That's the point of truly great writing and insight: across some dimension it is unpredictable.
I suspect he might have something interesting to say about what the implications of Straussian writing would have on training LLMs. Leo Strauss thought that for most of history, writers were afraid to write what they really thought, so great texts are written in a way that tried to be intelligible to the smartest people yet palatable to the censors. Would an LLM miss this? Could an experiment with an LLM prove this relationship in a text?
I asked GPT what Tyler might say about the Straussian implications of training LLMs and it spit out more useless gibberish. Stuff that read like a precocious graduate student trying to show off the facts he had at his disposal about the subject, in an attempt to distract you from the fact that he had nothing meaningful to say.
Where Does All This Leave Us?
AI is not fundamentally different than any other writing tool. If you use it to write about a subject or an idea without having a deep understanding of what you are after, you wind up with crap. As I’ll explain in the later posts, this is not a bug, it’s a feature. So are hallucinations. But the point seems incontestable to me. Give a Mont Blanc pen to a hack, you get dreck. Give the same hack a word processor, you get dreck faster. Give that hack GPT-4 and you get a frightening amount of dreck at a potentially internet-wrecking speed.
You can’t blame the tool for the craftsperson’s output. While LLMs are an amazing tool for writing and creativity that I believe every writer should enthusiastically make full use of, they’re just tools. And, as far as I can see, they’re not on a path to be anything else anytime soon.
Here’s why I have a high degree of confidence about this. Tyler Cowen has written an impressive amount of words that are freely available on the internet. On that score, I’m something on the order of a million words. But it’s all quite unpredictable and fictional. Tyler on a much wider range of subjects, almost always applying a logically consistent (if complicated) set of tools (Economics) to them. If an LLM was going to be able to successfully model a writer with a human level of input, I think Tyler would be it. Yet it falls pretty short.
In March,Tyler used GPT-4 to interview Johnathon Swift who died in 1745. You can judge the results for yourself, but what they certainly aren’t is fresh thoughts from Jonathon Swift. Tyler is having a conversation with an impressive search engine. It’s close, maybe 80% or even 90% convincing. But the nature of the material makes it easier for AI for two reasons:
We don’t want ‘new’ Swift.
We can write off any weirdness as an anachronism.
We don’t really know what people were like back then. So when the AI output gets weird, it’s easy to read it as authenticity. But try this type of interview with a living person and I think all the flaws all become much easier to spot.
Another curious reason why unedited AI writing is easy to spot is it is almost impossible to get it to write in anything other than the expository mode. It always sounds like the student trying to show of how much he or she knows. Which is, for my money, the worst kind of writing.
We could posit this a a new kind of Turing test. A machine that writes well enough to fool a reader into thinking they are not just a person, but a SPECIFIC person. And even mediocre writing requires sustaining this illusion over many words.
But I’m getting ahead of my skis…
In the next few posts, I’m going to write about how writers can use AI, why they should use it and the ethical questions that are raised in the process. Excited as I am, I am not an unreserved cheerleader for these tools. No technology is without its downside. As the man said, “There are no solutions, only tradeoffs.”
Good article. Multiple spelling and grammar mistakes in the “not-AI” portions are ironic. Intentional Mr. McLean?
The basic problem is most people *don't* *care* if AI writes well or not. If the article gets written and published management is happy and that's all that matters. What information or entertainment gets passed is a distant second.