KCL • CCH •
Minor
programme
• AV1000
AV1000
Fundamentals of the digital humanities
Basic technical knowledge and terminology
- Units of hardware measurement
- Hardware components
- Processors and motherboard
- Primary storage
- Secondary storage
- Organisation of secondary storage
- Hierarchical structure
- Software tools
- Tips and techniques
- Data structures
- Traditional data files
- Relational databases
- Text files
- Image and sound files
- Data representation
- Bits and numbering systems
- Bytes and character encoding
- Unicode
- Glossary
I. Units of hardware measurement
- Data. The smallest unit of data in a computer is a
bit, which may in one of two states (represented by the numbers
0 and 1). Eight of these comprise a byte, which is the standard
base-unit for measuring memory and storage, hence kilobyte (or K, for
1,000 bytes), megayte (or MB, for 1,000,000 bytes), gigabyte (or GB,
for 1,000,000,000), or terabyte (1,000 times a GB). A file containing
a reasonably sized essay, for example, will be measured in K; most
software for downloading and RAM-capacity in current machines in MB;
current hard-disks in GB; measures of total electronic information in
the world in many, many terabytes.
- Speed. Computer-circuitry is regulated by an internal clock
that measures machine-cycles in herz, named after the physicist
Heinrich Hertz (1857–94). Hence the speed of computers is
measured in these terms, in megahertz (MHz, 1,000,000 cycles) and
gigahertz (GHz). Current desktop machines are rated from ca. 1–2
GHz.
- Throughput. The speed with which computers actually process
data is measured in the number of operations they perform per second,
FLOPS (for “floating-point operations per second”) or
MIPS (“millions of instructions per
second”). Throughput is highly technical jargon that you may
encounter but are unlikely to use.
II. Hardware components
A. Processors and motherboard
- The basic physical unit of hardware is the processor or
chip. (These are called “chips” because they are
made from wafers of a material that regulates the electrical current
passed through it.
These chips
are thin, flat, usually rectangular objects with connecting pins
emerging from them, somewhat like the legs of a spider, as shown
here.
- The most important processor in a computer is the central
processing unit or CPU. Computers are even now commonly identified
by the kind of CPU they contain, since newer ones are always faster
and may have new, highly desirable characteristics. It is best to seek
advice when attempting to judge the actual worth of a given CPU in
terms of overall performance.
- The chips in a computer are affixed to a single, flat surface called a motherboard, into which other, smaller boards may be plugged and to which various connectors are attached. These boards are for connecting peripheral devices, such as the monitor, and for adding subsidiary equipment, such as a modem.
B. Primary storage
Primary storage consists of chips from which data is immediately and very rapidly accessible. This storage may be volatile or non-volatile, as follows.
- RAM (random-access memory), a type of volatile
storage within which any byte may be accessed directly, by its
address, in any order rather than in a particular sequence (thus
“random-access”). RAM is used for the main working memory
of the computer; currently a gigabyte or more is the norm.
- ROM (read-only memory), non-volatile storage
programmed to contain static data repeatedly of use by the
computer.
C. Secondary storage
Secondary storage consists of electro-magnetic or optical devices that provide a stable, non-volatile medium. All such devices are considerably slower than primary storage.
- Magnetic disc (non-volatile, cheap, fairly durable, medium speed, random access)
- Hard disc, which consists of one or more rigid platters of fero-magnetic material in a sealed environment—thus also known as a fixed disc. The usual storage capacity is ca. 40GB.
- Floppy disc, a flexible platter of fero-magnetic material contained in a thin plastic case that is inserted into a disc drive. The usual storage capacity is 1.44 MB.
- Removable hard disc, a set of compromise technologies that offer higher storage capacity than the floppy but the convenience of a removable cartridge. This kind includes the proprietary ZIP Drive and some others.
- Magnetic tape (non-volatile, cheap, durable, very slow, sequential access). Once the medium of choice for mass-storage, now chiefly of use for system backups and transportation of large quantities of data from one computing system to another.
- Optical disc (non-volatile, cheap, durable, slow, random access)
- CD-ROM, an adaptation of the technology invented for analogue sound, using the same physical medium. Each CD contains about 640 megabytes of information. Speed of access is still quite slow in comparison to a hard disc. A CD drive is usually read-only (thus CD-R), but CD read-write (CD-RW) drives are now commonplace.
- DVD (abbrev. for digital versatile disc or digital video disc), a new kind of CD-ROM designed to hold a minimum of 4.7 gigabytes of information, enough for a full-length movie. Replacement of the CD seems inevitable.
III. Organisation of secondary storage
The operating system, which stands between the user and the physical media of storage, presents these in terms of a logical, hierarchical structure and supplies software tools for their manipulation. Beware: some of the terms are the same as those used to denote physical structures.
A. Hierarchical structure
- Disc or other name for a physical medium, considered as a logical unit and assigned a name or designator, such as “C:”.
- Volume, a single, logical unit, usually of a very large size.
- Directory or folder, a unit of organisation into which discs or volumes are divided; each is named and may contain both files and other directories. The figure to the right shows a graphical representation of a directory structure in the Windows operating system.
- File, a collection of data, usually on a single subject or with a single purpose, such as an essay or a spreadsheet; for certain kinds of files, a logical collection of records.
- Record, the primary subdivision of formally structured files, usually consisting of a sequence of fields.
- Field, the basic unit of segmented information, as in a database or spreadsheet.
B. Software tools
- Windows: Explorer, a file-management device; the Find tool (under the Start menu), which allows searching for file-names and contents; security mechanisms for setting file and directory access permissions.
- Macintosh: totally integrated into the operating system; the Finder, for locating files.
C. Tips and techniques
- The point of a hierarchical file system is to represent as nearly as possible the logical structure of one's data and what one does with them. When the match is close, needed objects are easy to find and to maintain; when it is not, objects tend to get lost or to be confused with others similarly named.
- An effective scheme of organisation balances the grouping together of similar objects against the representation of their differences. Such schemes are pragmatic entities: they are good if they help the user locate objects, bad if they impede him or her, however elegant they may otherwise be.
- Choice of name is almost always important. The identity of a file can in part be carried by the directory in which it is located, but the name should insofar as possible allow you to identify it long after the memory of having created the file has faded. Remember: a file-name may be quite long, have many parts, but a very long name will be a chore to read and decode.
- With all versions of Windows, the file-name extension should reflect the software used to access it. Most programs will automatically designate files they create with the proper extension, but whever choice is necessary, make sure the extension corresponds to the right program.
IV. Data structures
A. Traditional data files
The oldest standard scheme for conceptualising a collection of data partitions it hierarchically into files, each file into records and each record into fields. This scheme may be visualised as a table, thus:
FILE: Authors
| R# |
Surname |
Forename |
YoB |
YoD |
PoB |
PoD |
| 1 |
Shakespeare |
William |
1564 |
1616 |
Stratford-upon-Avon |
Stratford-upon-Avon |
| 2 |
Byron, Lord |
George Gordon |
1788 |
1824 |
London |
Missolonghi |
| 3 |
Yeats |
William Butler |
1865 |
1939 |
Dublin |
Cap Martin |
| … |
… |
… |
… |
… |
… |
… |
Fields, and thus records, may be fixed-length, as suggested by the rigid table-structure above, or variable-length. If the former, length is specified at the time the data-structure is created and before it is “filled” with data.
A database consisting of such a file is often called a flat-file database because of its two-dimensional structure. Contrast the following.
B. Relational databases
A relational database broadly speaking consists of a coherent set of explicitly related flat-files or tables. A relational scheme of two such tables is illustrated below:
Database: Poets and poems
Table 1: Poets
| AuthorID |
Surname |
Forename |
YoB |
YoD |
PoB |
PoD |
| shakw |
Shakespeare |
William |
1564 |
1616 |
Stratford-upon-Avon |
Stratford-upon-Avon |
| byrol |
Byron, Lord |
George Gordon |
1788 |
1824 |
London |
Missolonghi |
| yeatw |
Yeats |
William Butler |
1865 |
1939 |
Dublin |
Cap Martin |
Table 2: Poems
| AuthorID |
Title |
Volume |
Year written |
Year published |
First line |
| yeatw |
To the Rose upon the Rood of Time |
The Rose |
? |
1893 |
Red Rose, proud Rose, sad Rose of all my days! |
| yeatw |
After Long Silence |
Words for Music |
1929 |
1931 |
Speech after long silence; it is right, |
| shakw |
Sonnet XVIII |
|
? |
? |
Shall I compare thee to a summer's day? |
The common field, AuthorID, allows matching entries in the two tables to be related automatically, by the database software. The division of the data into tables avoids costly and error-prone repetition, for example of biographical data for each poet each time a poem is listed. More importantly, the relational structure allows us to represent complex data structures in such a way that the data may be viewed efficiently from numerous perspectives, some or many of which may not have been anticipated by the designer.
C. Text files
For text files, the logical structure is more difficult to represent in terms of a physical structure of files, records and fields. Poetry written in highly regular forms such as the sonnet comes closest to a predictable structure, but even there multiple overlapping hierarchies will cause problems. Physically electronic text tends to be segmented into internally unstructured files, with the logical structure indicated by meta-textual markup, just as in HTML paragraphs are indicated by the <p> … </p> tagging element. More on this later.
D. Image and sound files
Digital images and sounds require very different schemes of organisation. These are not covered here.
V. Data representation
A. Bits and numbering systems
Internally, within the computer, data is most conveniently represented as a series of numbers based on the binary or “base-2” numbering system. Most modern computers organise the binary “digits” in groups of 8, each of which is called a byte. A byte may take on a maximum of 256 values (0 to 255), or 28. Most commonly the value of a byte is expressed in hexadecimal (base-16).
A progression of binary, decimal and hexadecimal equivalents will illustrate how the numbering systems relate to each other:
| Binary number |
Decimal value |
Hexadecimal value |
| 00000000 |
0 |
0 |
| 00000001 |
1 |
1 |
| 00000010 |
2 |
2 |
| 00000011 |
3 |
3 |
| 00000100 |
4 |
4 |
| 00000101 |
5 |
5 |
| 00000110 |
6 |
6 |
| 00000111 |
7 |
7 |
| 00001000 |
8 |
8 |
| 00001001 |
9 |
9 |
| 00001010 |
10 |
A |
| 00001011 |
11 |
B |
| 00001100 |
12 |
C |
| 00001101 |
13 |
D |
| 00001110 |
14 |
E |
| 00001111 |
15 |
F |
| 00010000 |
16 |
10 |
| … |
… |
… |
| 11111111 |
255 |
FF |
Thus the value of a byte in hexadecimal may range from
00 to FF. You have already encountered hexadecimal
in the HTML <BODY> tag attribute for background colour,
BGCOLOR, where the colour white is given as
FFFFFF, i.e. FF (255) units of red, FF
units of green and FF units of blue; red as
FF0000; black as 000000; and so forth.
B. Bytes and character encoding
Since internally data is represented as numbers, we require a consistent assignment of alphanumeric (alphabetic and numeric) characters to hexadecimal numbers. This is not a simple problem to solve, as there are a very large number of graphic characters used to represent the languages of the world, even if one considers only the alphabetic languages. In the early days of computing, when the only recognised character set was that of English, manufacturers adopted the American Standard Code of Information Interchange (better known as ASCII), which allowed for 128 character positions, including several “control characters” used to send commands initiating various functions of the output devices then available. Later, the company IBM independently extended ASCII to include positions for the accented characters of the major Western European languages, in so-called Extended ASCII, represented below:
The first 32 positions (hexadecimal 00 to 1F) here are for control
characters (used, originally, to operate a mechanical teletype
machine); the bottom 128 are the extensions made by IBM. The order of
these is arbitrary. Many of the world's languages obviously cannot be
represented using such a limited scheme.
C. Unicode
Unicode is a relatively new “character coding system designed
to support the interchange, processing, and display of the written
texts of the diverse languages of the modern world. In addition, it
supports classical and historical texts of many written
languages.” Information about Unicode may be obtained from the
Unicode Consortium Home Page [X].
VI. Glossary
Following is a list of current terms with which you should be
entirely familiar; each is supplied with a link to the Free On-Line Dictionary of
Computing.
- Boolean
search. A query using the Boolean operators, AND, OR, and NOT, and parentheses to
construct a complex condition from simpler criteria.
- computer. A machine that can be programmed to manipulate
symbols.
- data. Numbers,
characters, images, or other method of recording, in a form which can
be assessed by a human or (especially) input into a computer, stored
and processed there, or transmitted.
- ASCII;
cf. Unicode, below. The basis of character sets used in almost all
present-day computers; a means of representing letters of the Roman
alphabet (including accented characters for the standard Western
European languages), numbers, marks of punctuation, miscellaneous
Greek (without accents or breathings) and a few other characters.
- binary. A
number representation consisting of zeros and ones implemented by
nearly all computers because of its ease of implementation using
digital electronics.
- digital. A
description of data which is stored or transmitted as a sequence of
discrete symbols from a finite set, most commonly this means binary
data represented using electronic or electromagnetic signals.
- hexidecimal. A number representation using the digits 0–9,
with their usual meaning, plus the letters A–F (or a–f) to
represent hexadecimal digits with values of (decimal) 10 to 15.
- Unicode; cf. ASCII, above. A system of character-representation designed to address the limitations of ASCII by covering all major modern written languages with characters that have exactly one encoding of uniform size.
- hardware. The physical, touchable, material parts of a computer or other system; cf. software.
- CD-ROM. A non-volatile optical data storage medium using the same physical format as audio compact discs, readable by a computer with a CD-ROM drive..
- CPU (central processing unit). The part of a computer which controls all the other parts.
- disk drive. A peripheral device that reads and writes hard discs or floppy discs. Hard (or fixed) disc: one or more rigid magnetic disks rotating about a central axle with associated read/write heads and electronics, used to store data; a form of non-volatile storage, usually not removable. Floppy disc: a small, portable plastic disk coated in a magnetisable substance used for storing computer data, readable by a computer with a floppy disk drive; a form of non-volatile storage, currently with a capacity of 1.44 MB.
- memory. Usually used synonymously with Random Access Memory (RAM) or Read-Only Memory (ROM), but in the general sense it can be any device that can hold data in machine-readable format. RAM is usually applied to internal working memory that is volatile, i.e. maintained only as long as the computer is switched on; ROM to the non-volatile internal memory chip that stores information and software used by the machine when it initialises.
- monitor. The display device of a computer, consisting of a a CRT (cathode-ray tube) and associated electronics, which is connected to the output of a video card in the computer.
- peripheral. Any part of a computer other than the CPU or working memory, i.e. disks, keyboards, monitors, mice, printers, scanners, tape drives, microphones, speakers, cameras etc.
- network. Hardware and software data communication systems.
- Internet. A world-wide communications system that includes commercial (.com or .co), university (.ac or .edu) and other research
(.org, .net) and military (.mil) networks and spans many different physical networks.
- World Wide Web (WWW). An application of the Internet that provides a client-server information retrieval system which implements hypertextual links among physically distributed servers.
- software. The instructions executed by a computer, as opposed to the physical device on which they run (the “hardware”); “program” (rather than “programme”) is a common synonym.
- application program. A complete, self-contained program that performs a specific function directly for the user.
- font. A set of images representing the characters from some particular character set in a particular size and typeface.
- GUI (graphical user interface). The use of pictures (icons) rather than words to represent manipulable objects, such as programs and data, to the user of a computing system.
- hypertext or, more generally, hypermedia. A collection of documents (or “nodes”) containing cross-references or “links” which allow the reader to move easily from one document to another; usually these are WWW documents.
- meta-. A prefix meaning one level of description “higher”, i.e. more abstract; thus in computing a “metalanguage” is an encoding language used to add information about data representing a natural language, and “metadata” is data about data.
- operating system. The low-level software which handles the interface to peripheral hardware, schedules tasks, allocates storage, and presents a default interface to the user when no application program is running.
- analogue computer; contrast with digital, above.
revised October 2007