In Defense of Data: Responses to Stephen Marche's "Literature is not Data"




In Defense of Data: Responses to Stephen Marche's "Literature is not Data" by Holger S. Syme & Scott Selisker

Two rebuttals to 'Literature is not Data: Against Digital Humanities'

November 5th, 2012 reset - +

 
The Digital Inhumanities?

Scott Selisker

 

STEPHEN MARCHE'S RECENT ESSAY in the Los Angeles Review of Books, “Literature is Not Data: Against Digital Humanities,” has garnered a fair amount of negative attention the past few days. In a lively comment thread and across many conversations on Twitter, there has been ample talk of straw men, overly defensive digital humanists, Marche’s unaccountable nostalgia, as well as the meaning of the word “data." (Natalia Cecire gets it right relative to Marche’s title: “Almost nothing in the world IS data. ‘Data’ is an abstraction we use to make certain kinds of inquiry possible.”) In playing the digital humanities alternately as a straw man and a bogeyman, though, Marche gives only vague hints about what it actually is.

Marche’s most consistent theme is the notion that Google’s digitization of millions of books, the universal accessibility of literary archives, and the machine-assisted crunching of literary datasets (or “distant reading”) are all a sort of algorithmic and inhuman development in the study of literature. Quantitative analysis is “necessarily at the limit of reductivism,” and, considered only as data, “To the Lighthouse is just another novel in [the database’s] pile of novels.” The notion seems to be that a professor somewhere will feed all of Virginia Woolf’s books into a machine and forget they’re good books. (For the record: Virginia Woolf is in no danger on this count.) Marche imagines that, in turning to the digital humanities, literature scholars, and maybe even readers, will lose sight of literature’s “humanness.” The threat of the digital itself, then, prompts Marche’s catalog of literature’s “human” qualities: its messiness and fragmentation, the irreducibility of its content to mere data, the insights it contains that cannot be machine-read.

It’s Marche’s hyperbolic comparisons between “algorithms” and “fascism” that suggest the essay is a brand of all-too-familiar doomsaying about technology. In talking about the “humanness” of literature, he may as well be talking about “humanness” per se: camaraderie, human contact, meaning, “messiness,” and insight. The engine behind the essay seems to be an appeal to the vague fears we share about how “cockroach”-like machines will threaten the things we value, from the experience of reading to our life savings. In ascribing a kind of fascism to the algorithm, Marche echoes dystopian writers like George Orwell and Anthony Burgess, who have posited technology as the totalitarian instrumentalizer of all things, as the opposite of the human, humane, and democratic. Only in this sense can it be tragic, as Marche suggests, that an algorithm will “treat all literature as if it were the same,” even if only for a few processor cycles. At other moments, however, Marche acknowledges the utopian aspects and “democratic spirit” of freely available digital texts through Google Books. The ungainly mix of utopian and dystopian thinking about technology in contemporary culture is something Marche’s essay conveys quite well, if perhaps unwittingly.

With Marche’s critical sights set on the digital humanities’ destruction of all things sacred, one would expect he’d offer more thorough consideration of the topic. His single example is a paper written not by literary scholars, but by computer scientists: “Quantitative Patterns of Stylistic Influence in the Evolution of Literature.” Moreover, his tone shares much in common with a similar popular article about the same study, which ran in Salon this spring. Most other stories about the emergence of the digital humanities have also mentioned the National Endowment for the Humanities’ establishment of an Office of Digital Humanities, and Patricia Cohen’s profile-raising New York Times Humanities 2.0 series, which covered the work by historians Dan Cohen, Fred Gibbs and others around the Google Books-affiliated “n-grams” tool via literary and historical studies. (To Patricia Cohen’s great credit, that series includes plenty of examples of the kinds of sophisticated thinking that must accompany algorithmic criticism — thinking that library-related and interdisciplinary digital humanities centers around the country have been fostering for the past decade.)

The digital humanities has been described — and derided — as “the next big thing,” but it would be hasty to conclude that they will radically alter the overall values or missions of literary scholarship. I like to think digital technologies will have, and have had already, their biggest impact in changing the ways scholars communicate their work to the public. It’s easier than ever to deliver useful content to readers, both on engaging online multimedia archives, and on websites — like this one — that combine scholarly and popular writing to put avid readers in touch with buzzing literary and cultural scenes." But in their own corners, the new methodologies of distant reading and data mining are also changing how we understand the limits of what we can know about culture in the digital age.

The digital humanities scholars to whom Marche alludes (but never names) have been working with large datasets, and coming up with corresponding new questions about literary, cultural, and linguistic history. Rest assured, this new set of questions has not included “what does literature mean?” or “what is the ineffable thing that Gilgamesh is about?” — precisely because those are questions that scholars and critics don’t need computers to keep trying to answer. In Douglas Adams’ The Hitchhiker’s Guide to the Galaxy, a computer is constructed that’s powerful enough to provide the answer to “life, the universe, and everything,” but the questions digital humanists have put to computers are far more specific. The digital humanities don’t sound the klaxon of doom or, for that matter, heavenly trumpets of triumph. They don’t threaten the individuality of literary works, but rather help us return to those literary works with more information at hand.

Take a few examples of the questions asked by recent and forthcoming projects in the digital humanities:

  1. How do the word-choices of poets (with their often more expansive, “literary” vocabularies) shift across time relative to those of other writers?
     
  2. Why are the titles of eighteenth century novels longer than those of nineteenth century novels? If we look at exactly how and when the titles shrink, can we hypothesize what historical events, technological changes, and marketplace thresholds might have played a role in that shift?
     
  3. What can the timeline of the many reprintings of a particular short story in American nineteenth century newspapers tell us about the circulation of information and the public sphere in the railroad age?
     
  4. How did the early twentieth century publication of literary concordances — that is, tabulated lists of individual words and authors like Chaucer and Shakespeare used — affect the quantity of scholarship produced about those authors? (Surprisingly, it didn’t.)
     

This is a tiny sampling of the variety of distant-reading projects to which Marche gestures. I would hope this list allays any possible fears that literary studies is aiming to replace literature with computer code, or to trade literary complexity for mindless formulae. Instead, these projects are merely thinking in creative ways about literary-historical problems that couldn’t be easily addressed without computers. As Marche acknowledges, there are numerous other niches within the big tent that is the digital humanities, which include new strategies for mapping information and networks, new online archives to make works of literature more accessible and understandable, new experiments in the forms of scholarship, and new experiments in humanities and literary communication. 

As in the sciences, some of this research will come to nothing; a percentage of it will change how we account for particular works of literature; some may change how we understand the sweep of literary history. But there are no monsters, or fascists, under any of these beds. None of these questions is going to endanger the ways that literature spurs all of us to think. They will not encroach on the experience of students in classrooms, adults curled up on couches, and literature professors in their offices. The machine-driven projects of distant reading will humbly supplement — usually by just a little, and perhaps one day by a great deal — what we know about literature, just as historical data and biographical data have done all along. We’ll have a good idea exactly how many detective novels were written in each year of the nineteenth century, just as we already know what five pounds sterling could buy you in Arthur Conan Doyle’s London. As a new and interdisciplinary set of concerns, the digital humanities still has much to learn from computer scientists, from geographers and statisticians, and from the social scientists who have thought carefully about data visualization. But making practical sense of new methodologies and new kinds of information has always been one of the strong suits of humanities research.

Data, machines, and algorithms are coming, but Marche misses what we need most: technologically astute writing and scholarship that understands the uses and limitations of data and the unquantifiable values found in literature. As he points out, algorithmic stock trades, the monetization of more and more of our activities, drone warfare, and more, pose challenges to our values, in both economic and ethical terms. In the U.S. in particular, we live amidst a bombardment of images and information from a heavily lobbied political system and politically partisan scientific studies and news outlets. As technology and communication advance more rapidly than ever, we not only face information overload but the challenge of dealing with conflicting kinds of knowledge and expertise. How do we adjudicate between them? How do we think clearheadedly about what we value, and why? I don't think we’ll stop turning to the humanities, or to literature, for answers.
 

¤
 

Imaginary Targets

Holger Schott Syme

 
STEPHEN MARCHE'S “Literature is not Data” launches an attack on authors and academics. Or on digital humanists. Or on algorithms (which are, saith Marche, fascist). Or something. It’s a very strange, very ill-informed, very incoherent essay, and demands a more in-depth response from someone who is more immersed in current Digital Humanities practices than a mere dabbler such as myself. As a theatre and book historian, I spend a fair bit of time shuffling and reshuffling spreadsheets of sixteenth-century box office and publication data and have become used to making my arguments via pie charts and line diagrams; I’ve also edited Shakespeare for an online format. But I don’t truly consider myself a Digital Humanist. (In the interest of full disclosure, I should also mention that I’ve taken Stephen Marche to task for the sloppiness of his research before, in a long review of his book How Shakespeare Changed Everything.)

All that said, I feel compelled to address a couple of characteristic blunders in Marche’s article.

First of all, there is the weird narrative of Google Books he spins. In that story, “the openness and honest labor of engineers” comes face to face with the “closed ranks” of the “priestly class:” poor old Google just wants to make all the books it’s digitized freely available, or at least searchable, while “literary people” selfishly reject “the gift of digitization.” If Marche is to be believed, the conflict over Google Books was fought between a benign team of practically-minded innovators and a coterie of “writers and professors,” who, far from being “liberals, hedonists, bohemians,” are “in fact, profoundly, deeply, organically conservative.” He mentions, but then quickly ignores, that the legal case against Google was brought not just by the Authors Guild, but also the Association of American Publishers. Corporations, in Marche’s story, are good: they solve problems. Writers and thinkers, on the other hand, are bad: they squabble, and they “create problems rather than solving them.” Publishers, somehow, don’t appear to have a dog in this fight at all.

This is nonsense. It’s similarly nonsense to claim that “professors” were especially active in fighting Google’s noble mission to democratize knowledge. Very few academics make any money at all from their publications, as Marche must know. I recall some colleagues reacting with trepidation to the prospect of their books becoming available in full on Google, but that, of course, never happened; I would think that the vast majority of us are quite happy to have our work more widely accessible than it is when contained in the pages of $100-plus volumes and locked away in university libraries. The authors who objected most strenuously to Google’s project were those who stood to lose royalties — and, of course, their publishers.

Most revealingly, however, Marche claims that the idea of a digitized library of the world’s books was Google’s idea. It wasn’t. I wouldn’t presume to offer an authoritative alternative history, but I will point out that the Internet Archive, now containing almost 3 million public-domain texts, started in 1996, well before Google got on the bandwagon. Nor is it true that “the world’s five largest libraries signed on as partners” in the Google Books project. They didn’t, and they haven’t. Some very large libraries were among the initial partners (Harvard and the New York Public Library, currently ranked 3rd and 4th in the US). But none of the major National Libraries have joined in, and the project remains extremely Anglocentric in focus.

Marche thinks Google should never have bothered to engage with authors, because that way, well, lies madness (and squabbling). Instead, he proposes, “[i]n hindsight, perhaps, Google should have followed the law for ‘fair use’ of copyright, come to agreements with the world’s major libraries to provide the Book Search to public institutions in perpetuity, and stepped aside.” Sounds good. Except that what “fair use” means in this context is far from a settled legal issue. It was the question at the heart of the lawsuit, and a question left open when Google and the other parties in the suit came to a settlement in 2009. That settlement was rejected by a judge in 2011, and the case is currently pending. (Robert Darnton has written perceptively on the reasons the settlement failed.)

However, none of this has stopped Google from digitizing books: the collection is steadily growing. Nor has it made conducting full-text searches harder. Google won’t display copyrighted material from books whose publishers have not signed an agreement, but the text is still being searched. And in any case, this only concerns material still in copyright. Pre-1923 texts are fully and freely available.

All of which is to say, I have no idea exactly why Marche thinks Google Books has been a “failure” — or why he claims that scholars have simply refused to engage with the kind of work Google is doing:

Academia could have done what humanists have done throughout history and tried to add to Google’s mandate: make the texts legible and available. They could have tried to bring out the contemporary relevance that only historical context, knowledge of literary tradition, and scholarly standards can provide. But this ancient task was anathema, for the simple reason that it would have involved honest work. Much easier to remain in the safe irrelevance of mass publication in the old mode, what Kingsley Amis called “the pseudo-light it threw on non-problems.”

The central sneer here appears to be that academics don’t like “honest work” and prefer “mass publication in the old mode” — a mode that apparently does not involve making legible texts available. I honestly have no idea what Marche is talking about. The past 20 years have seen an astonishing wealth of academic, not-for-profit undertakings that make online texts available in reliable versions all over the place all the time — independently of or in cooperation with business enterprises such as Google’s. That Marche would locate the true scholarly spirit so emphatically inside the hallowed halls of Googleplex speaks volumes. Just as the commercial interests of publishers are absent from his narrative of the Google Books law case, Google’s own commercial agenda is rendered invisible in his portrayal of the company as a heroic fighter for the free and uninhibited circulation of the world’s knowledge. And since such a hero needs an antagonist, Marche dreams one up in the guise of the terminally lazy, obstinately conservative academic profoundly opposed to the notion of sharing intellectual wealth.

Even as he seems to describe an anti-innovative attitude as a hallmark of the modern academic, however, Marche also has quite a bit to say about the supposed impact of the digital revolution on academic research. His prime example is EEBO (Early English Books Online). He does not seem to be aware that EEBO is an expensive subscription service, nor does he seem to realize that the vast majority of the books it contains are simply digitized from microfilms that were available long before the World Wide Web changed everything. He also doesn’t seem to know that EEBO’s full-text search capabilities entirely rely on the efforts of participants in the not-for-profit Text Creation Partnership (TCP), an undertaking funded by over 150 libraries worldwide, whose efforts will become freely available in a few years. (In other words, EEBO is commercial, expensive, and limits access to its collection; TCP is academic, will soon be free, and open-access.) But never mind that. Far more baffling is how Marche imagines Renaissance scholars worked in the bad old days:

Before EEBO arrived, every English scholar of the Renaissance had to spend time at the Bodleian library in Oxford; that’s where one found one’s material. But actually finding the material was only a part of the process of attending the Bodleian, where connections were made at the mother university in the land of the mother tongue. Professors were relics; they had snuffboxes and passed them to the right after dinner, because port is passed left. EEBO ended all that, because the merely practical reason for attending the Bodleian was no longer justifiable when the texts were all available online.

No British Library in Stephen Marche’s world; no Huntington, no Houghton, no Beinecke, no Folger, no Newberry, no Library of Congress; no Cambridge University Library, no National Library of Scotland. Renaissance scholars all flocked to the Bod — and now, one supposes, the Bod stands empty, while we all stare at our screens. I’m glad Stephen Marche was treated to snuff in the hall at whatever college he was staying at in Oxford — I never have been, though I can report that professors still eat dinner there, and still pass the port. Some of them may fairly be considered relics, though, I expect, no more or fewer of them than in the pre-EEBO days. And the Bodleian remains busy, as do all the other excellent and well-stocked research libraries I’ve mentioned.

It is certainly true that things have changed. Scholars fortunate enough to work at institutions with an EEBO subscription can read far more materials at home, just as those whose libraries owned full runs of the old STC microfilms could. But that hasn’t spelled the end of research trips to archives. What is true is that there is greater interest in manuscript work now than there has been for a long time, and there is doubtless a connection between that shift in focus and the wider availability of digital versions of printed texts. Cynically, one might suggest that scholars need to justify taking research trips somehow, and looking at manuscripts, or at individual copies of works, is a great way of doing that. More idealistically, one might argue that services such as EEBO have freed up more time for archival exploits that were simply not manageable for most scholars before. Either way, the scene Marche describes still plays out, all around the world, not just in Oxford. (Though without the snuff.) Same as it ever was.

Elsewhere, Marche similarly muddles his accounts of history and of the present: “Stylometry, the analysis of definable patterns in literary styles, has also been a mode of desacralization.” Fair enough, I suppose. But stylometry has nothing to do with Google Books. Or, for that matter, with the internet per se (as I imagine the fifteenth century stylometrist Lorenzo Valla would point out, if he still could). Marche’s single example of the triumph of stylometry — the addition of Middleton’s name to the title page of Timon of Athens — has its basis in R. V. Holdsworth’s unpublished 1982 PhD thesis. In published form, the most prominent summary of the arguments can be found in Brian Vickers’s Shakespeare as Co-Author, which appeared in 2002: two years before Google Books put a single digitized volume online.

Ill-informed or not, this long opening salvo to Marche’s essay must come as a surprise, given his ostensible purpose of arguing against the Digital Humanities. He spends over half his article singing the praises of Google Books, highlighting the virtues of EEBO and of the new internetified science of Stylometry, and castigating crusty, old, sluggish scholars for refusing to do their bit to make the media revolution happen. Sounds like a grand defence of DH to me, or at least of a heavily corporatized version of DH.

But then Marche switches from one imaginary target to another. If the majority of academics are loathsome in their retrograde attachment to paper and their unwillingness to share their knowledge, their digitally open-minded colleagues are vile in their refusal to acknowledge the special status of the literary: “Literature cannot meaningfully be treated as data. The problem is essential rather than superficial: literature is not data. Literature is the opposite of data.”

To which one obvious answer is: well, duh. And another obvious response may be, “Well, only if you don’t understand what ‘data’ means.”

On one level, Marche is clearly right, though his insight is hardly original: “The experience of the mystery of language is the original literary sensation. The exuberance of ancient literature — whether it is in the simple, inscrutable lyrics of Sappho or Oedipus’s tragic misunderstanding of the oracles — contains a furiously distressed joy that words mean so much more than they mean.” As so often in Marche, it’s all expressed in too absolute terms — too, if you will, exuberantly — but the ideas are anything but new. Or controversial.

I don’t know, to be honest, what text-mining DHers would have to say in response. I doubt anyone has yet come up with software that can explain how great literature works. (I hope no one has.) And if anyone ever were to develop a program that can deliver the ultimate analysis of any text we feed it, our jobs as teachers of literature would probably be over. But so would the jobs of literary authors (since a machine that can decode literary mystery would presumably also be capable of mimicking the creative act: once a mystery is dispelled, it becomes reproducible, as medieval guild members knew all too well). And as far as I know, no one is actively trying to destroy literature through electronic demystification.

Marche writes as if all literary scholarship were engaged in acts of critical interpretation — more specifically, in acts of close reading. As he must know well enough, given his academic background, that isn’t true now and never has been the case. Criticism is one kind of literary scholarship, but it’s only part of the larger enterprise; and I suspect it’s the part DH is least good at. Literary history, on the other hand, is far more likely to benefit from the broad-based, distant view data-rich approaches can offer — although Marche, bizarrely, thinks that “the process of turning literature into data removes […] the history of the reception of works.” He’s right that a data-centric approach is less likely to be influenced by “taste” or “refinement,” but for my money, that’s a good thing. History dictated by taste is history written by the winners. And that’s bad history.

“Meaning is mushy,” Marche writes, not inaptly. But whereas the meaning of a line of poetry may emerge more clearly, or more richly, simply through contemplation and critical engagement, the meaning or the shape of a historical development is just as likely to become apparent through a process of accumulating more data — of stepping back and seeing the development in the broadest possible context, the kind of context data analysis can provide with a clarity and a neutrality likely lost in a critical endeavor propelled by questions of taste and a desire for refinement.

I can’t be the only one who’s finding it difficult to reconcile Marche in Matthew-Arnold-mode with Marche in Google-Books-acolyte-mode.

Finally, Marche seems to think, rather puzzlingly, that “data” somehow implies “completeness.” “Literature is terminally incomplete,” he notes. What he means, as far as I can tell, is that not every literary text ever written has survived, though he quickly moves from this discussion of literature’s archival fragmentation to the (unrelated) challenge of the fragmentation of meaning in the literary text. He appears to concede that this problem of partial transmission does not afflict literature alone (“The information we have about the past is, in almost every case, fragmentary”), so that it is presumably not literature alone but all human existence that is “haunted by such oblivion, by incipient decay.”

But it’s unclear why any of this should matter. No data set is ever complete. Marche’s counterexamples are baseball statistics and case law. He doesn’t seem to be aware that in both those cases, we’re dealing with flawed and incomplete sets of data. Baseball stats have become ever more detailed and fine grained in recent years, and many of the analyses now possible (of pitching data in particular) cannot be undertaken for historical figures, as the numbers aren’t available. And the idea that it’s possible to “establish a complete database for all of the legislation and case law in the world” is just preposterous. Like any other human activity, law cases are subject to transcription and transmission, to conventional editing and pruning and to archival loss. There is nothing special about literature’s transmission challenges. Working with incomplete and unreliable data sets is an entirely familiar and common experience for analysts of all kinds.

It’s thus not news for anyone in the field that “there are always masses of data which are simply missing or which cannot be untangled,” though some of us may be surprised to learn that “the most obvious and relevant example is Shakespeare.” Why would he be? Obvious, perhaps — but relevant? How? To whom? And in what sense? What’s clear is that Marche himself finds the Shakespearean data set confusing, so let me clarify: “There are nine different versions of Richard III; there are three versions of Hamlet, each with missing sections or added sections,” writes Marche. Well, no, there aren’t. There are eight quartos of Richard III, though they don’t differ much (if at all) from edition to edition after the third quarto. And then there are four reprints in folio form, but the second through fourth folio aren’t usually considered to have independent authority. So that’s either four different versions or 13. Hamlet exists in three different texts, two in quarto, one in folio; the second quarto was reprinted three times, but there are later quartos from the second half of the seventeenth century, five in total. So Hamlet, counting by the method Marche seems to use, half-heartedly, for Richard III, exists in somewhere between nine and 14 “versions.”

If anything, we have too much literary data in these cases. What we don’t have is enough non-literary data. The problem is not the indeterminacy of the literary work, or its incomplete transmission as such. It’s the absence of metadata: the supplementary information that would elucidate the status and the genesis of all these texts. The challenge, that is to say, is not literary: it’s historical. And the mystery is not the mystery of language, but that of commercial publishing practices, playhouse conventions, censorship decisions, archival and collecting decisions, and so on. Algorithms won’t be the ultimate solution to those challenges and mysteries. But who ever said they would be?

¤

print

Comments