KCLCCHMinor programmeAV1000Relational analysis


AV1000
Fundamentals of the digital humanities
Introduction to tabular data analysis

“Science in general… does not consist in collecting what we already know and arranging it in this or that kind of pattern. It consists in fastening upon something we do not know, and trying to discover it. Playing patience with things we already know may be a useful means toward this end, but it is not the end itself. It is at best only the means. It is scientifically valuable only in so far as the new arrangement gives us the answer to a question we have already decided to ask. That is why all science begins from the knowledge of our own ignorance, not our ignorance of everything, but our ignorance of some definite thing….” (R.G. Collingwood, The Idea of History, rev. edn., Oxford, 1993, p. 9)

  1. Tabular vs discursive data
  2. The database approach
  3. Matching data to software
    1. Retrieval software
    2. Concording and text-analysis
    3. Flat-file databases
    4. Relational databases

I. Tabular vs discursive data

As noted in the introduction to the topic on the course homepage, verbal data can be divided into two kinds: discursive, i.e. “running” or sequential text, intended primarily for reading, and the object of text-analytic techniques; and tabular. Tabular data is either found that way, in relatively small, disjunct segments (e.g. bibliographic information, archaeological data), or it results from construing a source as comprising such segments. Tabular data may also consist of classificatory labels and bits of factual information abstracted from non-verbal objects, e.g. buildings and other architectural monuments.

The primary computational form for tabular data is the database.

II. What is a database?

The term “database” is loosely used to mean “any large collection of information”, (OED s.v. 1 transf.), e.g. research notes in a wordprocessed document, a collection of files containing the text of a novel, or some bibliographic records. As such, the term is not very useful, since it can be applied to just about anything. Much better is the narrower, more technical definition, “A structured collection of tabular data held in computer storage; esp. one that incorporates software to make it accessible in a variety of ways” (OED). This software is known as a “database management system” (DBMS).

Structure of a database management system

The structure of a DBMS is represented in the diagram to the right. Note particularly the intermediating layer of software between the user and the data. This software allows the researcher to retrieve the same data in many different arrangements and so discover otherwise hidden patterns. Furthermore, a given database can thus prove useful to researchers with widely divergent interests.

For the sake of simplicity, we consider here only textual data, although image- and sound-files can be attached to fields in a DBMS.

III. Matching data to software

Design and construction of a database requires a significant effort, and typically DBMS software is among the least easy to learn. The first question one should ask, therefore, is whether one actually needs to make such an effort—perhaps other, simpler techniques will do the job. The following is intended as a review of the analytic approaches to textual data that you have already used, in order better to identify the situations in which a DBMS is the best or only choice.

A. Retrieval software

Text-retrieval software searches for strings of characters and usually permits wildcards and operators meant to restrict the conditions under which the specified string is retrieved. Such software is, for example, found in Web search engines such as Google. You may also encounter basic searching utilities for use on standalone machines, e.g. the “find” utility built into Windows.

In general, text-retrieval software is most clearly indicated for large amounts of unstructured or variously structured data, such as Web pages or collections of papers. It is the least demanding but often least helpful kind, since it does not rely on any information about the text.

B. Concording and text-analysis software

Concording and text-analysis software includes the functions of the previous category but is intended primarily for the analysis of the literary and linguistic features of running or narrative text, i.e. text considered as a continuous sequence of words. This is the kind of analysis commonly done in literary studies, corpus linguistics, history, anthropology, sociology, psychology—i.e. when the language in which something is said or written is significant. Such software serves analysis by producing the retrieved text in one or more helpful formats (such as KWIC) and, especially, by allowing for metalinguistic tagging that the researcher introduces into the text in order to denote implicit textual structures and similar phenomena. Such software is often not the best choice for dealing with essentially discontinuous text, however—i.e. the kind that occurs in relatively small, independent chunks, such as lists of items. For that kind, one may need a database.

C. Flat-file databases

“Flat-file software” is based on the model of the table. It requires that the data be put into a highly structured tabular format of rows and columns. It thus particularly suits data that text-analysis software is not well-adapted to handle—data that occur naturally in small chunks, as for example in bibliographic management, where each record of information consists of short fields (for author, title, publisher, and so forth). Other than bibliographic management, the commonest sort of flat-file software is the spreadsheet, which can handle text as well as numbers and so comes in handy for managing notes and many kinds of tabular analysis involving either numbers or text. Flat-file software runs into trouble either when the textual data is too discursive (i.e. comes in very large chunks) or when the interrelationships among the chunks is complex. In the latter situation, a relational database program is required.

D. Relational database software

Relational database design is the most powerful technique currently available for managing the complex kinds of datasets common to academic research and major businesses. Relational design begins with tabular data, like the flat-file program, but avoids the major restrictions of flat-file management. For example, if you were to collect data on the architecture of London, you would most likely find that a single table would be radically inadequate for two reasons:

  1. In a database of buildings, it would force you to repeat information common to two or more entries, for example the name of the architect, his or her birth and death dates, style of work, and so forth. Such repetition makes the database unusually subject to errors and inconsistencies, and much larger than it needs to be.
  2. More seriously, it would force you to commit your design to one particular view of the data. If, for example, you were interested primarily in building materials, then because of the difficulty of entering information on each architect every time he or she was involved, the resulting database would tend not to be very useful to someone interested in architects.

In the relational design, the data are divided into separate tables (e.g. one for buildings, one for architects, one for building materials). When the user makes a query, the component tables are related by software according to his or her instructions so that only the parts of the data required by the question are brought together and put in an order that will best reveal whatever pattern he or she is looking for. Thus, for example, a single entry for a given architect can be referenced by the entries for each building for which he or she was responsible. No repetition of the information is required, so nothing in the nature of the model forces you to commit your design to one particular view. It can thus serve a wide variety of users.


Typically a researcher will have all four kinds of software at hand and use each of them, alone or in combination, as the occasion demands. The novice is well advised to familiarize him- or herself with these kinds before undertaking a major project, since a significant amount of time can be lost by attempting to fit data into the wrong kind of software. An insufficiently sophisticated package will either not allow you to do what you want or, more seriously, will silently hamper your ability to think about your data; its lack of sophistication may not be obvious until you have input a relatively large amount. A package that offers more sophistication than you need will likely require you to make many unnecessary choices.

revised February 2008