KCLCCHMinor programmeAV1000


AV1000
Fundamentals of the digital humanities
Basic technical knowledge and terminology

  1. Units of hardware measurement
  2. Hardware components
    1. Processors and motherboard
    2. Primary storage
    3. Secondary storage
  3. Organisation of secondary storage
    1. Hierarchical structure
    2. Software tools
    3. Tips and techniques
  4. Data structures
    1. Traditional data files
    2. Relational databases
    3. Text files
    4. Image and sound files
  5. Data representation
    1. Bits and numbering systems
    2. Bytes and character encoding
    3. Unicode
  6. Glossary

I. Units of hardware measurement

  1. Data. The smallest unit of data in a computer is a bit, which may in one of two states (represented by the numbers 0 and 1). Eight of these comprise a byte, which is the standard base-unit for measuring memory and storage, hence kilobyte (or K, for 1,000 bytes), megayte (or MB, for 1,000,000 bytes), gigabyte (or GB, for 1,000,000,000), or terabyte (1,000 times a GB). A file containing a reasonably sized essay, for example, will be measured in K; most software for downloading and RAM-capacity in current machines in MB; current hard-disks in GB; measures of total electronic information in the world in many, many terabytes.
  2. Speed. Computer-circuitry is regulated by an internal clock that measures machine-cycles in herz, named after the physicist Heinrich Hertz (1857–94). Hence the speed of computers is measured in these terms, in megahertz (MHz, 1,000,000 cycles) and gigahertz (GHz). Current desktop machines are rated from ca. 1–2 GHz.
  3. Throughput. The speed with which computers actually process data is measured in the number of operations they perform per second, FLOPS (for “floating-point operations per second”) or MIPS (“millions of instructions per second”). Throughput is highly technical jargon that you may encounter but are unlikely to use.

II. Hardware components

A. Processors and motherboard

  1. The basic physical unit of hardware is the processor or chip. (These are called “chips” because they are made from wafers of a material that regulates the electrical current passed through it. These chips are thin, flat, usually rectangular objects with connecting pins emerging from them, somewhat like the legs of a spider, as shown here.
  2. The most important processor in a computer is the central processing unit or CPU. Computers are even now commonly identified by the kind of CPU they contain, since newer ones are always faster and may have new, highly desirable characteristics. It is best to seek advice when attempting to judge the actual worth of a given CPU in terms of overall performance.
  3. The chips in a computer are affixed to a single, flat surface called a motherboard, into which other, smaller boards may be plugged and to which various connectors are attached. These boards are for connecting peripheral devices, such as the monitor, and for adding subsidiary equipment, such as a modem.

B. Primary storage

Primary storage consists of chips from which data is immediately and very rapidly accessible. This storage may be volatile or non-volatile, as follows.

  1. RAM (random-access memory), a type of volatile storage within which any byte may be accessed directly, by its address, in any order rather than in a particular sequence (thus “random-access”). RAM is used for the main working memory of the computer; currently a gigabyte or more is the norm.
  2. ROM (read-only memory), non-volatile storage programmed to contain static data repeatedly of use by the computer.

C. Secondary storage

Secondary storage consists of electro-magnetic or optical devices that provide a stable, non-volatile medium. All such devices are considerably slower than primary storage.

  1. Magnetic disc (non-volatile, cheap, fairly durable, medium speed, random access)
  2. Magnetic tape (non-volatile, cheap, durable, very slow, sequential access). Once the medium of choice for mass-storage, now chiefly of use for system backups and transportation of large quantities of data from one computing system to another.
  3. Optical disc (non-volatile, cheap, durable, slow, random access)

III. Organisation of secondary storage

The operating system, which stands between the user and the physical media of storage, presents these in terms of a logical, hierarchical structure and supplies software tools for their manipulation. Beware: some of the terms are the same as those used to denote physical structures.

A. Hierarchical structure

  1. Disc or other name for a physical medium, considered as a logical unit and assigned a name or designator, such as “C:”.
  2. Volume, a single, logical unit, usually of a very large size.
  3. Directory or folder, a unit of organisation into which discs or volumes are divided; each is named and may contain both files and other directories. The figure to the right shows a graphical representation of a directory structure in the Windows operating system.
  4. File, a collection of data, usually on a single subject or with a single purpose, such as an essay or a spreadsheet; for certain kinds of files, a logical collection of records.
  5. Record, the primary subdivision of formally structured files, usually consisting of a sequence of fields.
  6. Field, the basic unit of segmented information, as in a database or spreadsheet.

B. Software tools

  1. Windows: Explorer, a file-management device; the Find tool (under the Start menu), which allows searching for file-names and contents; security mechanisms for setting file and directory access permissions.
  2. Macintosh: totally integrated into the operating system; the Finder, for locating files.

C. Tips and techniques

  1. The point of a hierarchical file system is to represent as nearly as possible the logical structure of one's data and what one does with them. When the match is close, needed objects are easy to find and to maintain; when it is not, objects tend to get lost or to be confused with others similarly named.
  2. An effective scheme of organisation balances the grouping together of similar objects against the representation of their differences. Such schemes are pragmatic entities: they are good if they help the user locate objects, bad if they impede him or her, however elegant they may otherwise be.
  3. Choice of name is almost always important. The identity of a file can in part be carried by the directory in which it is located, but the name should insofar as possible allow you to identify it long after the memory of having created the file has faded. Remember: a file-name may be quite long, have many parts, but a very long name will be a chore to read and decode.
  4. With all versions of Windows, the file-name extension should reflect the software used to access it. Most programs will automatically designate files they create with the proper extension, but whever choice is necessary, make sure the extension corresponds to the right program.

IV. Data structures

A. Traditional data files

The oldest standard scheme for conceptualising a collection of data partitions it hierarchically into files, each file into records and each record into fields. This scheme may be visualised as a table, thus:

FILE: Authors

R# Surname Forename YoB YoD PoB PoD
1 Shakespeare William 1564 1616 Stratford-upon-Avon Stratford-upon-Avon
2 Byron, Lord George Gordon 1788 1824 London Missolonghi
3 Yeats William Butler 1865 1939 Dublin Cap Martin


Fields, and thus records, may be fixed-length, as suggested by the rigid table-structure above, or variable-length. If the former, length is specified at the time the data-structure is created and before it is “filled” with data.

A database consisting of such a file is often called a flat-file database because of its two-dimensional structure. Contrast the following.

B. Relational databases

A relational database broadly speaking consists of a coherent set of explicitly related flat-files or tables. A relational scheme of two such tables is illustrated below:

Database: Poets and poems

Table 1: Poets

AuthorID Surname Forename YoB YoD PoB PoD
shakw Shakespeare William 1564 1616 Stratford-upon-Avon Stratford-upon-Avon
byrol Byron, Lord George Gordon 1788 1824 London Missolonghi
yeatw Yeats William Butler 1865 1939 Dublin Cap Martin

Table 2: Poems

AuthorID Title Volume Year written Year published First line
yeatw To the Rose upon the Rood of Time The Rose ? 1893 Red Rose, proud Rose, sad Rose of all my days!
yeatw After Long Silence Words for Music 1929 1931 Speech after long silence; it is right,
shakw Sonnet XVIII   ? ? Shall I compare thee to a summer's day?

The common field, AuthorID, allows matching entries in the two tables to be related automatically, by the database software. The division of the data into tables avoids costly and error-prone repetition, for example of biographical data for each poet each time a poem is listed. More importantly, the relational structure allows us to represent complex data structures in such a way that the data may be viewed efficiently from numerous perspectives, some or many of which may not have been anticipated by the designer.

C. Text files

For text files, the logical structure is more difficult to represent in terms of a physical structure of files, records and fields. Poetry written in highly regular forms such as the sonnet comes closest to a predictable structure, but even there multiple overlapping hierarchies will cause problems. Physically electronic text tends to be segmented into internally unstructured files, with the logical structure indicated by meta-textual markup, just as in HTML paragraphs are indicated by the <p> … </p> tagging element. More on this later.

D. Image and sound files

Digital images and sounds require very different schemes of organisation. These are not covered here.

V. Data representation

A. Bits and numbering systems

Internally, within the computer, data is most conveniently represented as a series of numbers based on the binary or “base-2” numbering system. Most modern computers organise the binary “digits” in groups of 8, each of which is called a byte. A byte may take on a maximum of 256 values (0 to 255), or 28. Most commonly the value of a byte is expressed in hexadecimal (base-16).

A progression of binary, decimal and hexadecimal equivalents will illustrate how the numbering systems relate to each other:

Binary number Decimal value Hexadecimal value
00000000 0 0
00000001 1 1
00000010 2 2
00000011 3 3
00000100 4 4
00000101 5 5
00000110 6 6
00000111 7 7
00001000 8 8
00001001 9 9
00001010 10 A
00001011 11 B
00001100 12 C
00001101 13 D
00001110 14 E
00001111 15 F
00010000 16 10
11111111 255 FF

Thus the value of a byte in hexadecimal may range from 00 to FF. You have already encountered hexadecimal in the HTML <BODY> tag attribute for background colour, BGCOLOR, where the colour white is given as FFFFFF, i.e. FF (255) units of red, FF units of green and FF units of blue; red as FF0000; black as 000000; and so forth.


B. Bytes and character encoding

Since internally data is represented as numbers, we require a consistent assignment of alphanumeric (alphabetic and numeric) characters to hexadecimal numbers. This is not a simple problem to solve, as there are a very large number of graphic characters used to represent the languages of the world, even if one considers only the alphabetic languages. In the early days of computing, when the only recognised character set was that of English, manufacturers adopted the American Standard Code of Information Interchange (better known as ASCII), which allowed for 128 character positions, including several “control characters” used to send commands initiating various functions of the output devices then available. Later, the company IBM independently extended ASCII to include positions for the accented characters of the major Western European languages, in so-called Extended ASCII, represented below:

The first 32 positions (hexadecimal 00 to 1F) here are for control characters (used, originally, to operate a mechanical teletype machine); the bottom 128 are the extensions made by IBM. The order of these is arbitrary. Many of the world's languages obviously cannot be represented using such a limited scheme.

C. Unicode

Unicode is a relatively new “character coding system designed to support the interchange, processing, and display of the written texts of the diverse languages of the modern world. In addition, it supports classical and historical texts of many written languages.” Information about Unicode may be obtained from the Unicode Consortium Home Page [X].

VI. Glossary

Following is a list of current terms with which you should be entirely familiar; each is supplied with a link to the Free On-Line Dictionary of Computing.

revised October 2007