KCL • CCH • Minor programme • AV1000 • Text-analysis
In the preceding discussion on "Method in text-analysis", markup was identified as a kind of algorithmically governed activity done essentially without the computer but with it in mind. As you already know from experience with HTML, markup in our sense is done with a computational "metalanguage", i.e. an artificial language used to mark features of a natural language text. These features are those that fully automatic methods cannot reliably identify—indeed, almost anything beyond the word-level, including many features we have no difficulty whatever in identifying, such as chapter divisions and proper names.
HTML is a metalanguage used mostly to indicate how a browser should format a text, e.g. that a long string of characters surrounded by <p> … </p> should be represented as a paragraph, or that another string of characters surrounded by <h1> … </h1> should be treated as a header of a certain size. There are, however, other reasons than formatting to mark-up a text, and so other metalanguages that are intended for the purpose. We will look at one of these briefly here.
Markup compensates for the limitations of "natural language processing" by providing a way for a knowledgeable person to make implicit entities explicit. At the same time, it imposes a rigorous discipline: the encoder, as we saw earlier, must attempt to think algorithmically if the results are to be useful. In consequence he or she must mark all instances of the same kind of thing in exactly the same way: if they are not, they cannot be processed alike. In other words, encoding obeys the two fundamental constraints of computing: complete explicitness and absolute consistency.
Inevitably these constraints, especially when applied to features other than formatting, come into severe conflict with the demands of imaginative language. (Quite ordinary language, as used in daily life, can be highly imaginative, so the problem is not confined to literature, although poetry does put the greatest strain on markup.) From this conflict—from the failure of any markup scheme to capture significant features of imaginative language—arises a powerful tool for interpretation. It poses the question, how do we know something is there if we cannot say exactly what it is or exactly where? How do we know what we know about the text?
This is a typical kind of question in the humanities: it has no final answer, rather it is a means by which we gain knowledge of the text. In other words, encoding is essentially a way of doing scholarship.
In a sense markup is not new. From very early on in the development of written language graphical devices and words have been used to tell us about the text we are reading. Word-separators are an example. In classical Roman inscriptions, marks were sometimes used for the purpose when confusion might otherwise result, but spaces between letters did not become conventional until later, in the Middle Ages, when they were introduced in manuscripts, perhaps to assist the then new practice of silent reading. Paragraphing, which began with interlinear or marginal graphics in manuscripts, indicates that a block of text is to be considered a significant unit. Punctuation marks sometimes indicate only a pause, sometimes confer a particular status or meaning on the punctuated words. Chapter titles may be indicated as such in many ways, e.g. by blank space, the word "chapter", graphics of various kinds.
All such devices are instances of metatext, i.e. text, textual symbols or other graphical devices used to say something about the text we read. Furthermore metatext is just one kind of "paratext", as Gérard Genette has called those devices and conventions which form part of the complex mediation between the text, the author, the publisher and reader. Before we even begin reading a book, its paratext tells us many things about it and so shapes our subsequent reading—if, partly on the basis of that paratext, we decide to read it.
Although the basic notion of metatext is not new, its implementation in markup creates new conditions for work by imposing those two computational constraints: total explicitness and absolute consistency. If the paratext is to be computationally tractable, if we want it to figure into our analysis, it must be rendered as markup, explicitly and consistently. As in all other cases of markup, this necessarily means some degree of interpretation—from almost none (e.g. that a block of text preceded and followed by blank lines is a paragraph) to a significant amount (e.g. that the design on the cover of a poetry magazine figures into how we read the poetry inside). In other words, again, encoding provides a means for the scholar to express his or her interpretation of the text.
In text-analysis one is for example often concerned not just with the immediate linguistic environment of the target word but also with the structure of the text in which that word occurs, especially if the analysis is literary or historical. If the corpus you are analysing is a novel, for example, you will likely want to know the chapter number of each occurrence; if it is a play, the act, scene and line numbers, perhaps also the speaker of the lines. Furthermore, you may want to specify in your query which part of the text to search, e.g. the word "blood" only when spoken by Macbeth, or the word "exit" only when it is not part of a stage direction. Since in general it is impossible automatically to extract such information from an unprepared text, text-analysis will often require that the text be prepared by manual insertion of metalinguistic tags that unambiguously denote this structural information.
Textual structure, as suggested by these examples, may involve simply a translation of the conventions of a printed original, but it may also be significantly interpretative. There may be several competing structures one wishes to take account of. The boundary between one part of a text and another may be ambiguous.
The simplest kind of tag specifies a textual location where its contents apply. Following is an example in "COCOA" markup, an old but still used scheme:
|<act 3>||the variable "act" is set to the value "3"|
|<speaker Hamlet>||the variable "speaker" is set to the value "Hamlet"|
|<source Guardian>||the variable "source" is set to the value "Guardian"|
Thus all words following the tag <speaker Hamlet> are marked as belonging to that character—until another such tag with the same variable name is encountered, e.g. <speaker Polonius>, after which all words are marked as belonging to him, and so forth.
HTML is actually a derivative of the much more sophisticated Standard Generalized Markup Language (SGML), which has recently evolved into the eXtended Markup Language (XML)—metalanguages introduced in the second year of the Humanities with Applied Computing Programme. (SGML and XML have rapidly become standard in the commercial world, with consequent high demand for practitioners.) The metalanguages derived from SGML share more sophisticated features than simple COCOA, esp. element attributes, embracing of the affected text and the so-called "Document-Type Description" (DTD), which defines the allowable features of a particular kind of document.
Here, however, we are concerned with the simplest common features of all markup systems. All tags of whatever kind are unambiguously denoted; the commonest way of doing this is to embrace them with brackets or other characters that are not used in the text for any other purpose. Whatever the tag expresses, it does so completely explicitly; indeed, that is its function—to translate implicit, computationally invisible or ambiguous meaning, into explicit declarative statements.
When texts are marked up for the purposes of literary study (or any other kind that focuses on how the text says what it says), the metatext tends to be highly interpretative. Two imperatives follow.
Two consequences follow from such work, providing of course that it has been well done:
revised November 2007