Literature Is not Data: Against Digital Humanities

'Big data is coming for your books.'

By Stephen MarcheOctober 28, 2012

Did you know LARB is a reader-supported nonprofit?

LARB publishes daily without a paywall as part of our mission to make rigorous, incisive, and engaging writing on every aspect of literature, culture, and the arts freely accessible to the public. Help us continue this work with your tax-deductible donation today!

Donate

Data banks are the Encyclopedia of tomorrow. They transcend the capacity of each of their users. They are "nature" for postmodern man.
— Jean-François Lyotard, The Postmodern Condition: A Report on Knowledge

BIG DATA IS COMING for your books. It’s already come for everything else. All human endeavor has by now generated its own monadic mass of data, and through these vast accumulations of ciphers the robots now endlessly scour for significance much the way cockroaches scour for nutrition in the enormous bat dung piles hiding in Bornean caves. The recent Automate This, a smart book with a stupid title, offers a fascinatingly general look at the new algorithmic culture: 60 percent of trades on the stock market today take place with virtually no human oversight. Artificial intelligence has already changed health care and pop music, baseball, electoral politics, and several aspects of the law. And now, as an afterthought to an afterthought, the algorithms have arrived at literature, like an army which, having conquered Italy, turns its attention to San Marino.

The story of how literature became data in the first place is a story of several, related intellectual failures.

In 2002, on a Friday, Larry Page began to end the book as we know it. Using the 20 percent of his time that Google then allotted to its engineers for personal projects, Page and Vice-President Marissa Mayer developed a machine for turning books into data. The original was a crude plywood affair with simple clamps, a metronome, a scanner, and a blade for cutting the books into sheets. The process took 40 minutes. The first refinement Page developed was a means of digitizing books without cutting off their spines — a gesture of tender-hearted sentimentality towards print. The great disbinding was to be metaphorical rather than literal. A team of Page-supervised engineers developed an infrared camera that took into account the curvature of pages around the spine. They resurrected a long dormant piece of Optical Character Recognition software from Hewlett-Packard and released it to the open-source community for improvements. They then crowd-sourced textual correction at a minimal cost through a brilliant program called reCAPTCHA, which employs an anti-bot service to get users to read and type in words the Optical Character Recognition software can’t recognize. (A miracle of cleverness: everyone who has entered a security identification has also, without knowing it, aided the perfection of the world’s texts.) Soon after, the world’s five largest libraries signed on as partners. And, more or less just like that, literature became data.

Or rather, it had the potential to be data. Page and his team subsequently ran into a problem too knotty even for their ever-untangling minds: the literary world. The legal case brought by the Authors Guild and the Association of American Publishers against Google was a revelation, as important, if not as celebrated, as the obscenity trial of D.H. Lawrence’s Lady Chatterley’s Lover. In the face of the openness and honest labor of engineers, the priestly class closed ranks. Instead of accepting the gift of digitization, the possibility of bringing the wealth of the tradition to the widest possible public for free, literary people immediately set about doing what they do best: vapid, internecine squabbling. The librarians stepped in. Authors wanted to be heard. The situation soon became untenable.

Google’s mistake was listening to all this chatter, respecting it, and actually trying to broker a settlement, which was naturally impossible, like trying to negotiate with a flock of sparrows. In hindsight, perhaps, Google should have followed the law for “fair use” of copyright, come to agreements with the world’s major libraries to provide the Book Search to public institutions in perpetuity, and stepped aside. Then again, they did nothing wrong.

The problem lay at the feet of literary institutions and their inherent fearfulness. In the popular imagination, writers and professors are liberals, hedonists, bohemians. Nothing could be further from the truth. They are, in fact, profoundly, deeply, organically conservative. The birth of print saw the emergence of many of the same Luddite tendencies recognizable today. The great German scholar Trithemius’s “In Praise of Scribes” has become the clichéd example of early scribal resistance to print, but his arguments were not ridiculous: printed books were much less beautiful than handmade ones; copying out a text allowed the scribes to identify and stop the reproduction of errors. The process of writing out a text produced a spiritually powerful condition. “In Praise of Scribes” tells the story of one scribe who had to be disinterred after years of scribal work; his colleagues find the three fingers of his writing hand “incorruptible.” Nonetheless “In Praise of Scribes” was printed, not copied. Trithemius went along with the changing world even as he claimed to despise it.

Google Books, in its way, represents an even more profound shift than the printing press, because it ends the relationship to the codex which began much earlier, in the fourth century. Binding together texts into portable libraries was one of the original Christian acts. For the Romans, texts were isolated events contained in scrolls. The ferocious squeamishness of hundreds of librarians and writers and scholars who resist this disbinding of literature today isn’t mere self-interest. The end of the book is a kind of sacrilege to them, and they’re not wrong. Cutting open the book is literally a return to the forms and modes of paganism.

There are, of course, many hundreds of digital projects on the go at the moment. For instance, Early English Books Online has existed for a decade. That wonderful database in its own way demonstrates how digitization leads to the decline of the sacred. Before EEBO arrived, every English scholar of the Renaissance had to spend time at the Bodleian library in Oxford; that’s where one found one’s material. But actually finding the material was only a part of the process of attending the Bodleian, where connections were made at the mother university in the land of the mother tongue. Professors were relics; they had snuffboxes and passed them to the right after dinner, because port is passed left. EEBO ended all that, because the merely practical reason for attending the Bodleian was no longer justifiable when the texts were all available online.

Stylometry, the analysis of definable patterns in literary styles, has also been a mode of desacralization. Among its biggest achievements was adding Middleton’s name to Shakespeare’s under the title of Timon of Athens. The most sacred idea in literary history — the pure and lonely genius of Shakespeare conjuring his work out of a mythy mind — was one of the method’s most prominent casualties.

At the advent of print, the humanities emerged, under the aegis of Erasmus and others, to negotiate the spread of the classical tradition out of the monasteries into private hands. Today, with the advent of the Internet, Google’s self-described project is to make the world’s information “universally accessible and useful.” Academia could have done what humanists have done throughout history and tried to add to Google’s mandate: make the texts legible and available. They could have tried to bring out the contemporary relevance that only historical context, knowledge of literary tradition, and scholarly standards can provide. But this ancient task was anathema, for the simple reason that it would have involved honest work. Much easier to remain in the safe irrelevance of mass publication in the old mode, what Kingsley Amis called “the pseudo-light it threw on non-problems.” For at least 50 years, humanities departments have been in the business of creating problems rather than solving them.

All in all, it’s fair to say that the conversion of literature into data could not have gone much worse, which does not bode well for the second, oncoming phase, where we decide what to do with the literary data we now have.

The phrase “digital humanities” produces instant titillation and an equally instant sense of fading comedy. William Pannapacker, blogging for The Chronicle of Higher Education, had a good view of the initial euphoria: “Amid all the doom and gloom of the 2009 MLA Convention, one field seems to be alive and well: the digital humanities. More than that: Among all the contending sub-fields, the digital humanities seem like the first ‘next big thing’ in a long time.” That’s what the digital humanities is: yet another next big thing. It’s a phrase with a wide array of meanings. It can mean nothing more than being vaguely in touch with technological reality — being an English professor who is aware of the existence of Twitter, for example — or understanding that there are better ways of disseminating academic studies than bound academic journals languishing on unvisited shelves. There are niche fields within digital humanities which are obviously valid, too, such as readings in avant-garde digital fiction or the analysis of how the development of word processing has affected contemporary writing practice. These are growing fields, important even, but necessarily minor.

“Distant reading” — a phrase Stanford’s Franco Moretti coined over a decade ago — is the most promising path, at least on the surface. Data mining is potentially transformative, more for its shift in attitude than for any actual insight it has generated. Some of its lexigraphical generalizations have been remarkably astute as philology, establishing scalable n-grams of word sequences over time. The problem comes when these generalizations are applied to literary questions proper. One recent paper in The Proceedings of the National Academy of Science, “Quantitative Patterns of Stylistic Influence in the Evolution of Literature,” contended that contemporary writers have less familiarity with the classics than writers of previous eras. It was immediately pilloried by other scholars and in the press.

And it’s easy to see why. Even a relatively casual examination of the fundamental assumptions underlying the argument reveals the mushiness of the words beneath the hard equations. What is a “classic”? What is “influence”? Are similarities of language the most fundamental way of establishing the similarities between authors? The problem with “distant reading” is, naturally, the distance involved.

But there is a deeper problem with the digital humanities in general, a fundamental assumption that runs through all aspects of the methodology and which has not been adequately assessed in its nascent theory. Literature cannot meaningfully be treated as data. The problem is essential rather than superficial: literature is not data. Literature is the opposite of data.

Data precedes written literature. The first Sumerian examples of written language are recordings of beer and barley orders. But The Epic of Gilgamesh, the first story, is the story of “the man who saw the deep,” a hero who has contact with the ineffable. The very first work of surviving literature is on the subject of what can’t be processed as information, what transcends data.

The first problem is that literature is terminally incomplete. You can record every baseball statistic. You can record every trade over the course of a year. You can work out the trillions of permutations and combinations available on a chessboard. You can even establish a complete database for all of the legislation and case law in the world. But you cannot know even most of literature, even English literature. Huge swaths of the tradition are absent or in ruins. Among the first Anglo-Saxon poems, from the eighth century, is “The Ruin,” a powerful testament to the brokenness inherent in civilization. Its opening lines:

The masonry is wondrous; fates broke it
The courtyard pavements were smashed; the work of giants is decaying.

The poem comes from the Exeter Book of Anglo-Saxon poetry and several key lines have been destroyed by damp. So, one of the original poems in the English lyric tradition contains, in its very physical existence, a comment on the fragility of the codex as a mode of transmission. The original poem about a ruin is itself a ruin.

Literature is haunted by such oblivion, by incipient decay. The information we have about the past is, in almost every case, fragmentary. There are always masses of data which are simply missing or which cannot be untangled. The most obvious and relevant example is Shakespeare. There are nine different versions of Richard III; there are three versions of Hamlet, each with missing sections or added sections. There are missing plays. Cardenio. Love’s Labour’s Won. They no longer exist. So even the work of Shakespeare, which has been scrupulously attended to by generations of scholars, cannot be completely described. Literature is irredeemably broken and messy. Its brokenness and its messiness are part of its humanness.

The experience of the mystery of language is the original literary sensation. The exuberance of ancient literature — whether it is in the simple, inscrutable lyrics of Sappho or Oedipus’s tragic misunderstanding of the oracles — contains a furiously distressed joy that words mean so much more than they mean. Take any meaningful line in literature and the same fugitive release from the status of information is there. Take my favorite line of Shakespeare’s, from Macbeth: “Light thickens, and the crows make wing to the rooky wood.” What is the difference between a crow and a rook? Nothing. What does it mean that light thickens? Who knows? The lines, as data, are more or less nonsense. And yet they illuminate their moment radiantly.

The reality of context is even more problematic. Distant reading is new exactly insofar as it separates itself from the traditional modes of comprehending the tradition. The pleasure of big data, and the algorithmic analysis of it, is its democratic spirit. One heartening aspect of reading dozens of digital humanities journals is how many of its authors, though terminally misguided in my view, retain the spirit of the engineer: open-minded, clear about the limitations of the data and the methodology, and frank about what they think they are accomplishing. That attitude is unspeakably refreshing, brushing away entire apparatus of professorial self-importance. Unfortunately they can only do so by treating all literature as if it were the same. The algorithmic analysis of novels and of newspaper articles is necessarily at the limit of reductivism. The process of turning literature into data removes distinction itself. It removes taste. It removes all the refinement from criticism. It removes the history of the reception of works. To the Lighthouse is just another novel in its pile of novels.

Borges predicted this phenomenon, as he seems to have predicted so many of our current predicaments. His story “Pierre Menard, Author of the Quixote,” practically reads like a polemic against the premise of the digital humanities. Pierre Menard decides to write Don Quixote — not a revamped Don Quixote or a transcription of the original, but the same exact text of Don Quixote as if it were written by a contemporary author. Borges writes:

It is a revelation to compare Menard’s Don Quixote with Cervantes’s. The latter, for example, wrote (part one, chapter nine): “truth, whose mother is history, rival of time, depository of deeds, witness of the past, exemplar and adviser to the present, and the future’s counselor.” Written in the seventeenth century, written by the “lay genius” Cervantes, this enumeration is a mere rhetorical praise of history.

Menard, on the other hand, writes: “truth, whose mother is history, rival of time, depository of deeds, witness of the past, exemplar and adviser to the present, and the future’s counselor.” History, the mother of truth: the idea is astounding. Menard, a contemporary of William James, does not define history as an inquiry into reality but as its origin. Historical truth, for him, is not what has happened; it is what we judge to have happened. The final phrases — exemplar and adviser to the present, and the future’s counselor — are brazenly pragmatic.

The data are exactly identical; their meanings are completely separate.

The implications of literature as resistance to data extend well beyond the mostly irrelevant little preserve of literature and literary analysis. Algorithms are inherently fascistic, because they give the comforting illusion of an alterity to human affairs. “You don’t like this music? The algorithms have worked it out” is not so far from “You don’t like this law? It works objectively.” Algorithms have replaced laws of human nature, the vital distinction being that nobody can read them. They describe human meanings but are meaningless.

Which is why algorithms, exactly like fascism, work perfectly, with a sense of seemingly unstoppable inevitability, right up until the point they don’t. During the Flash Crash of May 6, 2010, the Dow Jones lost nine percent of its value in five minutes. More recently, Knight Capital lost 440 million dollars at a rate of about 10 million dollars a minute due to what it called “a rogue algorithm.” Algorithms cannot, of course, be rogue. But rogue is the term we have invented for algorithms that don’t do what they’re supposed to, which is as much as to say that their creators don’t comprehend what they’re doing. Before that 440 million dollar loss, Knight Capital had used science to identify a functional law of the marketplace. They had engineered an end to the fundamental human condition of risk. They had not, 45 minutes later. As Borges also wrote, “There is no exercise of the intellect which is not, in the final analysis, useless.” This same futility, it should be remembered, haunts mathematical modeling as much as literary contextualization.

Meaning is mushy. Meaning falls apart. Meaning is often ugly, stewed out of weakness and failure. It is as human as the body, full of crevices and prey to diseases. It requires courage and a certain comfort with impurity to live with. Retreat from the smoothness of technology is not an available option, even if it were desirable. The disbanding of the papers has already occurred, a splendid fluttering of the world’s texts to the winds. We will have to gather them all together somehow. But the possibility of a complete, instantly accessible, professionally verified and explicated, free global library is more than just a dream. Through the perfection of our smooth machines, we will soon be able to read anything, anywhere, at any time.

Insight remains handmade.

Stephen Marche is a novelist and an essayist.

In Defense of Data: Responses to Stephen Marche’s “Literature Is not Data”
Two rebuttals to 'Literature is not Data: Against Digital Humanities'
Holger S. Syme, Scott SeliskerNov 5, 2012
The Sum of Two Cubes (And the Uses of Literature)
An experimental essay on the potential of writing.
Erik AndersonSep 23, 2012

Did you know LARB is a reader-supported nonprofit?

Donate

Literature Is not Data: Against Digital Humanities

'Big data is coming for your books.'

By Stephen MarcheOctober 28, 2012

Did you know LARB is a reader-supported nonprofit?

Recommended Reads:

"The Death of the Book" by Ben Ehrenreich

"The Pump You Pump the Water From" by Sven Birkerts

"The Sum of Two Cubes (And the Uses of Literature)" by Erik Anderson

Stephen Marche is a novelist and an essayist.

LARB Staff Recommendations

In Defense of Data: Responses to Stephen Marche’s “Literature Is not Data”

The Sum of Two Cubes (And the Uses of Literature)

Did you know LARB is a reader-supported nonprofit?