KCLCCHMinor programmeAV1000Relational analysis


AV1000
Fundamentals of the digital humanities
The relational model

  1. Kinds of tabular data
  2. Overview of relational database management
    1. Definition
    2. Current relational packages
  3. Typical research problems requiring database management
    1. Publishing history
    2. Archaeology
    3. Literary criticism

I. Kinds of tabular data

Give the distinction between discursive and tabular data, we need to differentiate two kinds of the tabular kind: two-dimensional and multidimensional. The former is managed by so-called flatfile software, e.g. bibliographic programs and spreadsheets; these are called flatfile because they manage two-dimensional, thus “flat” data structures. Multidimensional data—so-called because a spatial representation would need 3 or more dimensions—are best handled with relational database management, which is the topic here.

II. Overview of relational database management

A. Definition

A relational database management system (DBMS) is a software package that allows you to interlink two or more tables automatically rather than to work with one at a time or to manage the interlinking manually, by moving from one table to another. The relational model and the software that implements it are specifically for kinds of data too complex for flat-file techniques to be effective. The basic idea is perhaps best illustrated through examples. Three typical research problems follow.

The relational database model has been defined by its inventor, E. F. Codd, according to whose criteria most so-called relational managers fail to qualify. (See his many books and articles on the subject.) Here, however, we will take “relational” to apply to any software that allows two or more tables to be related explicitly. For a very good treatment of relational database management, see Daniel I. Greenstein, A Historian's Guide to Computing (esp. Chapter 3).

B. Current relational packages

Under our loose definition, many software packages now available qualify as relational. Some well-known and widely-used ones are:

  1. Access (Windows), currently very popular, the software used in this course.
  2. Ingres and Oracle (multiple platforms), two systems often preferred for very large database applications.
  3. FileMaker Pro (Mac and Windows).
  4. MySQL (multiple platforms).

III. Typical research problems requiring database management

Whether you use a flat-file or relational manager depends on the nature of your data and how you conceptualize it. The following examples are meant to suggest how you might think about your data to determine which approach is better. As promised, they also help clarify the nature of relational database management.

A. Publishing history

Suppose you were studying the book trade in 18th-century London from the actual evidence of the published artifacts, publishing records, and so forth. Such a massive amount of data would clearly recommend some form of database management. Flat-file or relational?

You could use a flat-file program, but the resulting database would inevitably favour whatever physical artifact or aspect of the trade you first decided to focus on: the books themselves, the publishing houses, the transactions, etc. If, for example, you began with the book as the basic item, then for each book published by a particular firm you would be forced either to re-enter all the data about that firm which you wished to have on hand, or to keep that information elsewhere, put a code into your database for the firm, and relate the two sources by hand. Repetition of data would be tiresome, consume storage space, and inevitably lead to variations of all sorts, which would then make matching records hard to find. Data for the favoured artifact or aspect would get full representation in the database, everything else would be abbreviated or rendered more difficult of access. The potentially most serious consequence would be that you would find it very difficult to take other approaches to your data than the initial one.

In the relational approach, you would first block out the basic units of your data in terms of their representation in the tabular format of the DBMS. Thus you might, for example, have a table for the physical books, another for publishing houses, others for printers, bindaries, paper-makers, book-sellers, advertisers, and so forth. As with the flat-file approach, for each table you would then decide on the fields for each record. In each field representing the data in another table, for example the field for publisher in the book table, you would enter a unique code that would then be repeated exactly in the table with the corresponding data. The relational manager would then allow you automatically to link the tables, and so flexibly to combine data in two or more of them. You would then easily be able to get answers to queries like, “Give me all instances in which any work by an author who lived outside of London was published in the city by a firm that went bankrupt within that year or the next and used paper made by the Wokey Hole mill.” An improbable query in substance, perhaps, but not in form.

Most importantly for some approaches to the project, you could easily reorient the data primarily to address what originally seemed an ancillary aspect, e.g. bindaries of the 18th Century rather than the finished books.

B. Archaeology

Suppose you were in charge of a series of archaeological excavations attesting to a variety of cultures; one but by no means the only question you had in mind was trade between these historically overlapping cultures. Flat-file or relational?

A flat-file approach would encounter essentially the same kinds of problems as before: repetition of data, with the potential for variance and error; and inflexibility of the result. Again the research involves not only a large amount of data but also numerous, complex interrelationships among them, with no guarantee that the primary interests of the individual doing the research will be identical to what others want to know, or indeed the same as this individual will wish to pursue later. Since digging necessarily destroys aspects of the evidence it uncovers, the excavator will benefit everyone by using a multi-dimensional tool in which to record as much of what is found as possible, in a way that will not obscure or obliterate answers to potential questions.

The relational approach is clearly better for archaeological data, which in any excavation are of many different kinds and come in relatively small bits that need to be sorted and resorted many times to yield their secrets. Archaeological data also conform to a clear hierarchy of regions, sites, strata, buildings or locations, etc., which makes it relatively easy to determine what tables will be required. The challenge is likely to be on the level of detail: for example, what aspects of objects are recorded, which ones in fields of their own, and what categories are used? Considerable knowledge and imaginative grasp of the research field are needed.

C. Literary criticism

Suppose that you were attempting to understand patterns of personification in a medieval text: what is personified, under what conditions does a personification occur, and can these be defined in terms of the immediate linguistic environment as well as broader context? Suppose that hundreds of cases were involved and consistency of criteria both crucial and difficult to achieve. Once again some kind of database is called for. Flat-file or relational?

The basic unit for these data is the instance of or candidate for personification, e.g. “O woods! Has anyone ever been loved more cruelly than I?” or “Love commands my heart”. With close knowledge of the text, you would set certain conditions on personification, such as direct address or attribution of speech; use of a verb normally only attributed to humans; familial relationship; possessions not usually associated with the entity; and so forth. You would also, however, have to make some distinction between “strong” and “weak” personifying agents, the number of these in any given case, parallel entities, and other contextual factors. To get an idea of how to weigh these factors, you would likely need to examine many instances, and to sort your data again and again by different criteria: order of occurrence in the text, particular kinds of personifying agents (e.g. all cases that use possessions, all those with quoted speech, all those involving verbs), number of these agents, reasons for accepting or rejecting.

The mechanical requirements implied here can all be satisfied nicely with a flat-file manager, in fact by a spreadsheet program, since only small bits of text have to be recorded in any one field. No field requires a table of its own unless you wish significantly to enlarge the scope of the research, in which case the single table of the spreadsheet could be imported into a relational manager (although perhaps with a certain amount of reworking). So, although the intellectual problems involved here are difficult and complex, the structure of the data is not. Hence the flat-file approach.

How might you reasonably discover that a flat-file approach is sufficient here? By going for the simplest explanation: beginning with a spreadsheet, you find that you do not need to go any further. If you did, a modern relational DBMS would be able to accept as input the output of a modern spreadsheet program.

revised February 2008