KCLCCHMinor programmeAV1000Text-analysis


AV1000
Fundamentals of the digital humanities
Method in text-analysis: An introduction

  1. Methodological background
    1. Kinds of text-analysis
    2. Application to unseen or poorly known texts
  2. Prior knowledge
    1. Genre
    2. Rhetoric and vocabulary
    3. Social or psychological circumstances
    4. Historical circumstances
    5. Nature of the artefact
  3. Steps in the analysis
    1. High-frequency words
    2. Collocations
    3. Concording

I. Methodological background

The following is an attempt briefly to sketch a methodology for elementary text-analysis, with particular emphasis on how to approach a text one does not know well. It is essentially an abstraction of the practice illustrated in the exercises of the following pages under this topic. Here no particular tool for the activity is presumed, nor are particularly sophisticated tools in view. All of what follows can be done with conceptually quite simple ones, such as Monoconc.

A. Kinds of text-analysis

Throughout “text-analysis” should be taken to mean “the analysis of text with the aid of algorithmic techniques”.

An algorithm may be defined as a step-by-step procedure capable of being run on a computer—i.e., an unambiguous and completely stated description of what the computer is to do. It can be expressed by a computer program but need not be; often the specifics of how an algorithm is implemented in a particular programming language would obscure the essentials. Text-analytic methods cover a spectrum between the completely algorithmic and the exploratory: in exploratory work we do not have a specific goal or procedure to follow but instead we look for leads. Most work mixes approaches from various points in the spectrum: we may make a word frequency list by algorithmic methods, but the results always need to be interpreted and investigated further, usually by much less algorithmic means.

Text-analysis may be divided into the following kinds, usually practiced at different places along the algorithmic–exploratory spectrum:

B. Application to unseen or poorly known texts

There are two reasons why one might legitimately be using text-analytic techniques on a text one does not know well. First, corpora of use in the humanities are approaching and some are already past the point at which a human being could read through their contents in a lifetime—especially given when that person might begin his or her reading; furthermore, some of these are not intended for normal reading, such as the non-literary collections meant for historical or linguistic purposes. Second, and more importantly, text-analysis is fundamentally different from manual methods and so reveals aspects of even well-known texts that one is likely not to have considered before. To the degree to which these texts are made new by the change in perspective, understanding will be aided by text-analytic techniques.

The first reason, that corpora tend to be too large, can be put in more positive terms: a good command of these techniques will make it practical for the ignorant but intelligent person to profit from materials outside his or her own field. Thus interdisciplinary research tends to be fostered.

II. Prior knowledge

We assume, then, application of the first kind of analysis, concording, with some use of frequency lists, to unseen texts.

Nevertheless the place to begin is with whatever you know about the given body of text (known as the corpus). It is unlikely that you will know absolutely nothing at all about it, but in any case read around in it briefly, picking up what you can. Consider

  1. Genre. What kind of a text do you have? Novelistic, poetic, bureaucratic, legal? Was it originally written, or was it delivered orally? What are the formal features you would expect such a text to have, which can you spot when you look at it? In the Stephen material, for example, we have the spontaneous secular sermons of a hippie “guru”, stream-of-consciousness, orally delivered. Stephen is talking to an audience in a highly personal style and mode.
  2. Rhetoric and vocabulary. Genre will tend to define a particular way of speaking or writing and to shape the vocabulary, including how frequently particular words appear. In the case of Stephen, personal pronouns are quite frequent—he is talking directly to the people in his audience (hence “you”) and centrally about a way of life centred on awareness (hence “know”).
  3. Social or psychological circumstances. Familiarity with the social circumstances surrounding the creation of the text may be relevant; so also the known or suspected psychology of the author or speaker. Note that Stephen's sermons, however nonsensical they may seem to subsequent generations, were hugely popular, uncomfortable to attend (people sat on the floor), raptly attended and meticulously transcribed. Evidently they had meaning to those who listened. Therefore our search for patterns of meaning in the Stephen corpus is not in the least mistaken.
  4. Historical circumstances. The more you know about the historical circumstances under which the text was produced the better. In the Stephen corpus, for example, it is crucial to understand how widely the now rather odd sounding language of Stephen's hippie subculture was accepted and spoken. As just noted, his talks apparently communicated a great deal to his audience. Hence we may conclude that the usages are richly dialectical. Awareness of his historically (and to a certain extent, regionally) defined vocabulary will give you hints as to where you might begin in a search for interesting terms.
  5. Nature of the artefact. The physical object from which the text has been taken, usually a printed book, may be relevant. Stephen's book, Monday Night Class, gives several indications of its time and subculture of origin; likewise, the photographs included in it and on the back cover reinforce the historical fact of the seriousness with which his words were taken. These, again, give reason to press forward with the analysis, and the clearly religious character of the assemblies to which he spoke direct you to the corresponding language.

In other words, the seemingly disembodied electronic text has several contexts essential to a full understanding of it. The more of that understanding you can have the better, though because the focus here is on technique, the point is not to dwell on acquiring knowledge of the contexts, only to get what you can quickly.

III. Steps in the analysis

The methodology outlined here is like a fishing expedition: you go at the text with a quiet, open mind, having little or no idea what you are going to catch. If you are after something in particular, then of course it is a different kind of activity. Even in a focused enquiry, however, software allows you to ask certain kinds of questions so easily and get answers back so quickly that curiosity is given a much freer reign; you can afford to play, ask even apparently improbable questions, and so raise the chances that you will be surprised by an important result you had little reason to expect. Thus a certain amount of fishing is recommended even for the focused questioner.

  1. High-frequency words. A quite crude but useful technique is to look through a list of the most frequent word-forms for anything that is unusual or particularly characteristic of the text in question. Frequency of word-forms is only roughly related to what a text says, but it is related, and so is useful to work with.

    Two examples spring to mind from both the Simpson and Stephen corpora: the verb “know” and the first-person singular pronoun “I”. (Note, in the comparison study outlined in Corpus analysis of meaning, how so little information says so much about both, how it draws a contrastive parallel between the two men.)

    There are of course severe limitations on what you can do with a frequency list, especially if you are interested in words (dictionary headwords, such as “know” or “I”) rather than word-forms (such as “knows” or “knew”, or “me” or “we”), and much more if you are focused on ideas (such as cognition or the self) rather than words. If the former, then you need to find all the inflected forms of the word and combine their frequencies. If the latter, you need to find all the relevant synonyms and combine the frequencies of all the inflected forms; even then, since ideas are only tangentially related to words, the result would be incomplete. Very often, however, the raw frequency list will prove useful enough.

  2. Collocations. A somewhat more sophisticated tool for relating word-forms to meaning generates information on what words tend to be found together, either contiguously, such as “I didn't know that”, or within a specified proximity or span, e.g. “black” within 5 words of “bag”. The idea here is that repeated collocations are more reliable indicators of meaning that repetitions of single word-forms. See Sinclair 1991 (chapter 8) for a full discussion.

    The program Monoconc and others will generate lists of collocations ordered by frequency so that you can identify recurring phrases and associations of words quickly. Note that if you wish to study collocations over a wider span than the program permits, you can do this by following these steps:

    1. Set the concordance “window” (the number of characters shown on either side of the target word) to a sufficiently large number;
    2. Run a concordance of the word for which you wish to study the collocates;
    3. Save the concordance as a text-file;
    4. Use that file as input to the program, generate from it a frequency listing

    This listing will thus give you the frequencies of the collocates of your target word.

    A government document, for example, will tend to have quite high frequencies of standard phrases; for a literary work, even two occurrences of a phrase may be highly significant. The Monoconc-style listing, of collocations within a span, is of course less bound to literal repetition—it will include together, for example, instances of the collocation of “don't” and “know” in the phrases “I don't know” and “I don't even know”.

    Classicists will be interested in the collocation tools implemented by the Perseus Project; see in particular the Greek and Latin context search tools.

  3. Concording. The essential idea behind the concordance, especially the KWIC, is to direct your attention to the immediate linguistic environment of the specified word. Hence when you find a potentially interesting word, often the next step is to run a concordance on it, then look down the concordance listing to see what patterns you can spot. With Monoconc generating collocation statistics will often immediately follow.

    A KWIC is made considerably more useful by the ability to sort an on-screen listing according to the words to the left and right of the target words; Monoconc offers such ability, and the same can be done with other concordance software. Such sorting tends to bring out the patterns, since repetitions are grouped together.

    Since current KWIC software deals only with word-forms rather than words, you will often also need to concord the inflected forms. In English many of these can be caught by use of the appropriate wildcards, but not all. An example is go, went and gone; another is I, me, my, mine, we, us, our(s), all forms of the first-person personal pronoun.

    Synonyms, of course, are entirely your task to identify, but doing so is made considerably easier than it might be by the tendency in many writers and speakers to emphasise an idea by using a number of synonyms together or nearby each other. Thus the text can itself help you to build a reasonable list for further concording. Compiling such a list is a recursive activity—in the beginning, a new synonym will tend to turn up others; when the law of diminishing returns asserts itself, it is time to stop. The result we may call a “fixed vocabulary”, to which can be added the contiguous collocations you have identified. All together these represent a translation of an idea, as it were, into data.

    A fixed vocabulary can then be used to turn up passages in the text for study—as is commonly done in “content analysis”. If you know the text well, then a very interesting further question to ask is, when does this vocabulary not identify passages in which the targeted idea clearly or arguably occurs? Why does it not? Some very interesting findings can result from pursuit of this question.

revised November 2007