KCLCCHMinor programmeAV1000Text analysis


AV1000
Fundamentals of the digital humanities
Corpus analysis of meaning

  1. Introduction
  2. The data
  3. Examining the Simpson corpus
    1. Refining the question
    2. Basic terminology
    3. Identifying and pursuing a research question
    4. Denials of knowledge
    5. Assertions of knowledge
    6. Things known
  4. Comparative study against the Stephen corpus

I. Introduction

The following exercise involves analysis of language in a transcribed text of colloquial spoken American English. The fundamental question of the exercise is, what are the basic preoccupations of this text and how are they manifested in detail? Reading it, even in a cursory way, will supply an answer to the first part of the question, but we are interested in the microscopic level of discourse, at which the details of how language is used shape and perhaps correct our impressions.

The text comprises the court depositions given by O. J. Simpson over a nine-day period in 1996 in the civil lawsuit brought against him by the families of Nicole Brown Simpson and Ronald Goldman, whom he was accused of murdering. The text of this corpus was originally published on the Internet by Jack Walraven as part of The Simpson Trial Transcripts, now shut down (at http://simpson.walraven.org/ when it was live). For additional material on Simpson, see the O. J. Simpson Main Page (CNN).

The corpus has the characteristics of a legal proceeding: rapid exchange of questions and answers; studied formality; pointed concern with facts and physical evidence, leading to repeated focusing on certain objects and events; and so forth. You should look for these in your analysis with the concordancer, keeping in mind the kind of vocabulary that such characteristics would tend to involve. Any knowledge you have of the Simpson affair may prove useful but is not necessary.

II. The data

The data are provided in two forms:

  1. Two plain-text files: ojsimpson1.txt (1MB) and ojsimpson2.txt (1.1MB)
  2. One zipped file: oj.zip (.62MB).

If you download the latter, you will need WinZip or the equivalent. Once you have the three text files you may proceed.

III. Examining the Simpson corpus

A. Refining the question

highest frequency words in OJ corpus

The question with which we begin—“what are the basic preoccupations of this text and how are they manifested in detail?”—is too broad for a computational analysis. We need to identify a relevant empirical feature of the text that software can help us to find, something that is correlated to a particular word-form or definable set of these. One standard way of proceeding is to look at the simplest kind of textual statistics, i.e. word frequencies, and ask, “What is unusual here?”

Look back to the work you did with the frequency list in The basics of concording—the list is reproduced here on the right. This is in fact shows the most frequently occurring words in the Simpson text. Recall the “tentative observations”, also reproduced here; note how well they fit the circumstances of O. J. Simpson's trial:

The point, then, is not that the frequency listing would surprise even someone vaguely familiar with the man, his trial and the crime, rather that it confirms and particularizes that familiarity. Conclusions drawn from the unseen list are reliable. But can we go further?

B. Basic terminology

For our purposes a few terms from linguistics are useful to have in our active vocabulary. These define basic lexical categories:

C. Identifying and pursuing a research question

Load both files of the Simpson corpus into Monoconc. Generate a frequency list, in frequency order. You should see more or less what is shown above.

Scan this list for the most frequent open-class word. What is it?

Correct—it is know. Confirm to yourself why its prominence makes perfect sense within the context of a legal proceeding: the trial centrally concerns what O. J. Simpson knows and doesn't know. Refer to the discussion of prior knowledge in Method in text-analysis. Consider especially the following:

In full confidence, then, we pursue the question of how O. J. Simpson expresses knowledge in the given text.

D. Denials of knowledge

  1. Generate a concordance for “know”. Scan quickly down through the list of matches, sorting to the immediate left and right as needed. What recurrent features in the immediate environment of the target word do you notice? To confirm your impressions, select Frequency, then Collocate frequency data. You should see the following:

    Collocates of 'know'

    Each horizontal line in the layout of this display follows exactly the layout of a KWIC concordance line, except that the target word, know, is omitted. Thus “I” occurs 1490 times as the second word to the left of know, “don't” 1435 times as the first word to the left, “Q” 518 times as the first word to the right (this is an abbreviation for “Question”), and “was” 179 times as the second word to the right.

    Notice that the most frequently occurring word immediately to the left of know is “don't”. Note further that if we also take account of related words in the same column—the past tense form “didn't”, the negative adverb “not”, the variant form “wouldn't” (as in “I wouldn't know”) and “doesn't”—we have well over 1600 apparent assertions or statements of not knowing, without going any further down the lists. (We might also wish to include conditionals, such as “would” and “whether”, perhaps also “even”.)

  2. What related denials might we find, i.e. what synonyms for not knowing? One way of finding out is to produce a concordance for don't. Do this, then sort the display in order of the first word to the right—i.e. the action that the speaker is negating. Look through the instances for all words related to knowing or being aware and write them down, for example “believe”, “recall” (there are others).

    For each of the words of knowing in your list, browse the entry in the online Oxford English Dictionary or use Wordnet. Develop a sense of how the “semantic field” of each of these words overlaps yet differs from the rest. Ask yourself, for example, what is the difference between saying:

  3. With the differences firmly in mind, now generate a concordance for all of the variants of “know” and concordances for each of the words you consider to be synonyms. To do this you need to recall how to use wildcards. You also need to know that several concordances may be kept on screen simultaneously so that you can go from one to the next and compare them.
  4. Now look through concordances for each of the entries in your grouped vocabulary; generate a table of collocates for each. Has the picture changed at all? Are assertions or statements of not knowing still predominate?

D. Assertions of knowledge

Another way of approaching the same question is to ask, what does the witness assert that he knows? Is he actually trying to establish what he knows and so attempting to be helpful, or do his assertions have another purpose?

Go back to the concordances, starting with know. Sort by the word to the left, then scroll down to find occurrences of “I know”. Look through the results, covering at least three dozen examples. For each, click on the concordance line so that the context will appear in its window. Read around the occurrence, noting whether the assertion of knowledge is qualified, and if so, how—discarding those that are simple negations, such as “not that I know of”. Make a list of the ways in which the knowledge is said to be partial, uncertain or is otherwise undercut. Consider the following examples:

In the above, note the different positions in which the qualification occurs: sometimes before the assertion of knowledge (“A: I don't recall. I do know...”), sometimes after (“A: I know l did, but I don't recall who.... ”), sometimes before and after (“.... A: I don't believe--of the interior of his room I don't believe so. I know--I don't believe so. Not to my direction, I don't believe so....”). In some expressions the qualification is simultaneous with the assertion of knowledge (“as far as I know”). What are the various effects of first asserting knowledge, then denying or qualifying it? of denying, then asserting? of vascillating?

Note also the kinds of qualification: sometimes direct denial (“I don't know.”) but often hedged by an appeal to loss of memory (“A: I don't recall.”) or by referral to a mental process weaker than knowledge (“A: I don't believe--of the interior of his room I don't believe so. I know--I don't believe so.”). What are the different effects of these qualifications in the context of your examples? What other kinds are there, and what are their effects?

Emphatic assertions may be other than they seem. Look in the concordance for the emphatic do know (in which the auxiliary verb “do” provides the emphasis). How is the emphasis used to qualify what the witness is saying? Generate other concordances on words such as honest, honesty, honestly, truth, truthful, truthfully; when these are used by the witness, how are they used?

E. Things known

Another approach to the same basic question is to examine the contexts in which the objects most relevant to the investigation are under discussion, then to ask, “What does the witness know about these objects, what does he reveal in discussing them, how does he use language when discussing them?”

Run a concordance for one or more of the words knife, blood, flashlight, bag, jacuzzi, gate, key and their variants (thus knives, flashlights, flashlight's, bags, bag's, baggage and so forth). Examine the context for at least 3 dozen examples of your chosen word(s). How is the witness's knowledge of these objects qualified, or how is knowledge of them used to qualify his knowledge of other things? How does usage of the words characterise the exchanges between questioners and witness? How do these sorts of exchanges characterise the kind of text you are analysing? To answer these questions you will need to read around in the text on either side of each instance you select.

IV. Comparative study against the Stephen corpus

A potentially quite effective way of understanding a text by the methods outlined above is to repeat the analysis on another text and compare the results. For this purpose we will return to the Stephen corpus. If you don't have a copy to hand, obtain one here.

Build a concordance again on the Stephen corpus. Sort the wordlist by frequency, look at the most frequent words. Note that the word know also occurs relatively high among the content words in this corpus. Since the corpus is also in spoken American English, although of a very different kind with a very different focus, it should make for an effective comparison.

Generate a concordance for know in the Stephen corpus. Scan these, sorting on the first word to the left and right, and produce a listing of collocates. This listing will confirm that the pronoun you collocates very frequently with know, thus that the expression you know predominates in uses of know (about 70% of those uses).

Look at the phrase you know and at the contexts. What is the predominant use? Roughly speaking, you should find the following two kinds, shown here by examples:

In the first kind you know functions as a “discourse marker”; it seems to have very little to do with what the audience knows or might be presumed to know. In the second kind, something in particular is said to be known, “you know that...”. How is the first kind marked? How would you try to produce a list only (or mostly) of the first kind? of the second?

Now return to the Simpson corpus. Generate a concordance for you know. How has the balance between the kinds changed? On the basis of this comparison would you say that you know is a discourse marker particularly characteristic of one of these corpora, if so which?

Your knowledge of the period during which Stephen flourished may suggest to you that like was used as another such discourse marker and that it is not so much used that way now. Compare the two corpora for like. When it occurs as a discourse marker, how is this function indicated in the language? When it occurs as a term of comparison? What are the differences in usages of like in the two corpora?

Finally, use the words believe and remember in a comparison of the corpora. What do they show about the kinds of mental processes of concern, first to Stephen, then to O. J. Simpson?

revised November 2007