KCLCCHMinor programmeAV1000


AV1000
Fundamentals of the digital humanities
Cleaning up your corpus: an exercise in thinking with rules

The following is an exercise in applying a specialised “language” within Microsoft Word automatically to identify character-strings in a text file and replace them with other character-strings. You should also make use of Word's internal documentation.

Devising and following rules for transforming data

In manipulating alphanumeric data one occasionally needs to transform a large number of identical data-elements in the same way. It is very tedious to do this manually, one instance at a time, so automating the process is quite a good idea. The standard mechanism is called “search-and-replace” or simply “replace”. Wordprocessors and text-editors are usually used for the job, although there are programs specifically written for the task.

Finding the right sequence of commands or rule can be tricky.

The first caution applies to computing in general: the replace command will do exactly what you tell it to do, which is often rather different from what you want it to do. You are therefore well advised to try out a replace operation one instance at a time until you are convinced that there are none which conform to the specifications you have given but are not instances of the thing you want changed. (Word, for example, gives you two buttons in the Replace dialogue-box, “Replace” and “Replace All”; the former changes only the next instance, the later all of them.) Also be sure to save a copy of a file you are changing before you make each series of changes; the Undo command is a great help here, but in a complex series of changes the number of Undos can lead to confusion.

The second caution follows from the first. This is that the way you actually go about transforming your data will often have to be rather different from the way you would ordinarily describe the change. For example, let's say I have manually numbered some items, 1, 2, 3, … 9, 10, then I want to insert a new item 4. Ordinarily I might think first, so as to make room for the new 4, to change the old 4 into 5, the 5 into 6 and so forth up to the 10, which would become 11. But if I do this with the computer exactly as described, the following will happen:

original sequence1 2 3 4 5 6 7 8 9 10
1st change1 2 3 5 5 6 7 8 9 10
2nd1 2 3 6 6 6 7 8 9 10
3rd1 2 3 7 7 7 7 8 9 10
4th1 2 3 8 8 8 8 8 9 10
5th1 2 3 9 9 9 9 9 9 10
6th1 2 3 10 10 10 10 10 10 10
7th1 2 3 11 11 11 11 11 11 11

The example given below for you to apply is considerably more complex, but it follows the same principle. The computer scientist Edsger Dijkstra used to tell a story that illustrates it much more memorably; see “A Parable”.

Replace in Word

The simplest replace command in Word specifies the entity to be replaced explicitly, e.g. in English, “change ‘4’ into ‘5’”. Often, however, what you need to do is of the form e.g. “change ‘4’ followed by any character whatever into ‘5: ’”. A symbol devised to represent that “any character whatever” is called a “wildcard”. Wildcards also may specify a more limited set of characters, e.g. “any numerical character”, “any alphabetic character between a and m or “any mark of punctuation”. The wildcards for Word are given below, under “Wildcards in Word”.

Especially with text files you often want to affect the formatting, as in the exercise given below. Then you require special symbols to represent the formatting effects, e.g. for paragraphing, tabs, spaces, page-breaks. These are given under “Wildcards in Word”, items 2 and 3.

Exercise: cleaning up text copied from the Web

The case in point is transforming (i.e. cleaning up) text copied by cut-and-paste from the Web so that you can concord it effectively. The most common problems with such text are multiple contiguous spaces and multiple contiguous EOL (end-of-line) characters, perhaps with one or more spaces intermingled.

The following find-and-replace commands will eliminate these. In Word select Replace under the Edit menu, tick “Use wildcards”, then enter the following in the order given. Note that the symbol • is used here to represent a blank space; you should type the space character. Word will show you the hidden formatting characters if you select Tools: Options: View and then check the right boxes under “Formatting marks”; this will make it easier for you to follow what is happening.

  1. Change all multiple, contiguous spaces into a single space:

    Find what: •(•@)

    Replace with: •

  2. Change all groups of EOLs intermingled with spaces (an EOL followed by one or more spaces followed by an EOL) into two contiguous EOLs:

    Find what: (^13) •@(^13)

    Replace with: ^p^p

  3. Change all multiple, contiguous EOLs into two EOLs:

    Find what: (^13)(^13)@

    Replace with: ^p^p

Can you follow what these symbol-sequences are describing? You may have to experiment and adapt the above to suit the copied text you are attempting to transform. Again, if you turn on display of the ordinarily hidden formatting characters it will be easier to figure out exactly what needs doing.

Wildcards in Word

1. Type wildcards for items you want to find

You can fine-tune a search by using any of the following wildcards. On the Edit menu, click Find or Replace. If you don't see the Use wildcards check box, click More. Then select the Use wildcards check box and type the wildcard character and any other text in the Find what box.

To find Use this wildcard Examples
Any single character ? s?t finds "sat" and "set".
Any string of characters * s*d finds "sad" and "started".
One of the specified characters [ ] w[io]n finds "win" and "won".
Any single character in this range [-] [r-t]ight finds "right" and "sight". Ranges must be in ascending order.
Any single character except the characters inside the brackets [!] m[!a]st finds "mist" and "most", but not "mast".
Any single character except characters in the range inside the brackets [!x-z] t[!a-m]ck finds "tock" and "tuck", but not "tack" or "tick".
Exactly n occurrences of the previous character or expression {n} fe{2}d finds "feed" but not "fed".
At least n occurrences of the previous character or expression {n,} fe{1,}d finds "fed" and "feed".
From n to m occurrences of the previous character or expression {n,m} 10{1,3} finds "10", "100", and "1000".
One or more occurrences of the previous character or expression @ lo@t finds "lot" and "loot".
The beginning of a word < <(inter) finds "interesting" and "intercept", but not "splintered".
The end of a word > (in)> finds "in" and "within", but not "interesting".

Notes

2. Type codes for items you want to find and replace

You can find and replace the following special characters and document elements. Just click Find or Replace on the Edit menu, and then type codes for the items in the Find what and Replace with boxes. As specified, some codes work only if the Use wildcards option is on or off.

Codes that work in the Find what or Replace with box

To specify Type
Paragraph mark ( ) (doesn't work in the Find what box when wildcards are on) ^p
Tab character ( ) ^t
ANSI or ASCII characters ^0nnn, where nnn is the character code
Em dash (–) ^+
En dash (—) ^=
Caret character ^^
Manual line break ( ) ^l
Column break ^n
Manual page break (also finds or replaces section breaks when wildcards are on) ^m
Nonbreaking space ( ) ^s
Nonbreaking hyphen ( ) ^~
Optional hyphen ( ) ^-

Codes that work in the Find what box only

To specify Type
Comment mark ^a
Graphic ^g

Codes that work in the Find what box only (when wildcards are off)

To specify Type
Any character ^?
Any digit ^#
Any letter ^$
Footnote mark ^f
Endnote mark ^e
Field ^d
Section break ^b
White space (any combination of regular and nonbreaking spaces, and tab characters) ^w

Codes that work in the Replace with box only

To specify Type
Windows Clipboard contents ^c
Contents of the Find what box ^&

Notes

3. When I use wildcard characters, Word can't search for certain items.

To use wildcards, click Find or Replace on the Edit menu, click More, and then select the Use wildcards check box. When this check box is selected, Word doesn't recognize the codes you enter in the Find what box for the following items: endnote and footnote marks, fields, paragraph marks, section breaks, or white space. To search for these items, you can type the following substitute codes in the Find what box. (Note that no substitute code is available for fields.)

To specify Type Notes
Footnote mark or endnote mark ^2 Word can't distinguish between footnote marks and endnote marks.
Paragraph mark ^13
Section break ^12 Word will search for manual page breaks as well as section breaks.
White space space {1,} Type a space and then {1,}.

revised October 2007