KCLCCHMinor programmeAV1000Electronic communications and publishing


AV1000
Fundamentals of the digital humanities
How to find things (and people) online

  1. Introduction: old and new
  2. Resources
  3. Standard reference sources
  4. Four techniques for locating online sources
  5. Tricks of the trade
  6. Further reading

I. Introduction: old and new

Finding scholarly or other material online is in some respects no different from the analogous process with printed sources in a conventional library. In both you use a combination of three finding-aids: keyword-searches, hierarchical lists and an assortment of clues you pick up along the way. Bibliographic skills may involve intuition and imagination, but most of what is involved may be taught and so learned.

In the library, these finding-aids take the form of the catalogue, giving access by author, subject and title keywords; bibliographies, in which references are organised according to the agreed-upon divisions of the subject; and a number of secondary references you pick up in review journals, articles and books located during the search. You may supplement these with ancillary reference works, such as Books in Print and the Humanities Citation Index. Experience and imagination will help you develop sufficient bibliographic skill that you should be able to find the major and many of the minor sources for any subject within about 2 weeks of work in a good research library. A temporary apprenticeship to a skilled reference librarian can save many hours if not days of initial blundering.

The library increasingly overlaps with the online world in its reliance on electronic catalogues and collections. What follows is specifically about the World Wide Web, but many of the techniques and cautions that apply there are also relevant to searches on CD-ROM, for example.

On the Web you also find keyword-searching devices, hierarchical lists and clues. These are discussed below. The Web differs primarily from the conventional library in its mutability and its openness.

In brief, the shift in the nature of the medium gives strong mediological reasons why expecting the online mechanism to serve criteria developed for print is fundamentally mistaken. The pragmatic question is how we exploit the new medium effectively and responsibly, both as publishers of scholarly material and as users of others' work.

There is a set of exercises on this topic.

II. Resources

Mostly when looking online for resources in aid of academic work we think of the products of research, and so look by subject, keywords or name, e.g. “classical literature”, “Aeneas”, “Perseus Project”. Another approach to locating knowledge about a subject is, however, to look for researchers in the chosen area by name or through their institutional affiliations. Many publishing scholars now put versions of their work online. Finding their home-pages can often yield great riches in the form of articles in digital form, e.g. as PDF or HTML files. Their CVs will yield bibliographic references to articles and books you can locate in libraries—or, sometimes, on the Web-sites of others if you search by the name of the article. In many cases where an author has not put his or her articles, book chapters and the like online, these will be available via “electronic journal” offerings, such as JSTOR and Project MUSE. These in turn are utilized in the electronic journals collections of many libraries, such as the Ejournals A–Z List at King's.

Articles you find in digital form, either when researching their topic or by chance when looking for something else, are easily downloaded and collected. If you collect systematically in one or more topical areas, give the files reasonable names and organize them for easy browsing, you can without much effort quickly compile a highly useful local resource. This can be searched manually or by means of a desktop searching utility, such as X1 or Google Desktop Search. (Note: some PDFs cannot be searched digitally because only page-images have been saved.) Following is a snapshot of such a local collection, saved into a dedicated folder on a hard disc. Note the file-naming convention adopted here.

List of collection contents

You can also on occasion find what you are looking for via the “Look inside the book” service implemented for new books by amazon.co.uk and its American parent amazon.com. This service is also available through a9.com, described in the next section.

The basic strategy for locating resources through other resources has much in common with older, pre-digital techniques: find one good source on a subject; look at its notes and references; look these up; find their notes and references; continue iteratively until you begin to find the same items and authors' names mentioned again and again. Look for other things the authors have written. It actually does not take terribly long to locate major sources in a subject this way.

Again, keeping physical libraries in mind and actually using them should not be forgotten! Only a tiny fraction of worthy material is in digital form. Make a short-list of useful online library catalogues: in London, for example, those at the British Library, catalogue.bl.uk; the London School of Economics, catalogue.lse.ac.uk; King's College London, www.kcl.ac.uk/iss/library.

III. Standard reference sources online

Each field of study will have its own reference-sources in digital form. Some, such as the Stanford Encyclopedia of Philosophy or the Perseus Digital Library, are free to anyone. Others, such as the Grove Music Online (the full text of The New Grove Dictionary of Music and Musicians, The New Grove Dictionary of Opera and The New Grove Dictionary of Jazz), require a license. The Grove and many others are accessible via academic libraries, such as King's; see the Databases A-Z list.

General reference sources follow the same pattern. There are several free online encyclopedias, the most interesting of which is the Wikipedia. The best dictionary of the English language, the Oxford English Dictionary, is online and accessible to anyone in King's or via an Athens password from elsewhere.

IV. Four techniques for locating online sources

1. Search engines

Exhaustive indexes to pages on the World Wide Web are automatically maintained by several indexing mechanisms, otherwise known as “search engines” or “Web crawlers”. These indexes are normally to every word on every page, although they vary significantly in how often they are updated (for addition of new pages and removal of broken links), the speed with which they work and what percentage of the Web they actually cover. Three will be illustrated here. For others and for comparisons among them see below.

Finding what you want is often like searching for the proverbial needle in a haystack. Because the number of indexed pages is very large, the number of pages retrievable on a query for common words, such as “history”, is likely to be too great to be useful, e.g. 60,000,000. Four approaches to the problem are currently implemented:

  1. Google is the simplest—but perhaps the most effective. It relies on the behaviour of users: pages are ranked from the most to the least probable according to the number of other pages which link to them. Its accuracy is often startling. Though the basic interface is very simple, more sophisticated searching is also offered. Similar facilities are offered by AltaVista and numerous other search engines.
  2. a9.com, which uses a web-search engine but adds several other features, notably amazon.com's “Look inside the book” feature, Wikipedia and parallel display of images. Its facilities for searching printed books has been overtaken for many purposes by Google Books.
  3. The so-called “metasearch engines” manage several other search engines in an attempt to exploit the best characteristics of each; some in addition remove duplicates. See Meta-Search Engines for a discussion and evaluation.

Formerly a few of the so-called “limited-area search engines” were maintained (e.g. in philosophy and classics) but have been discontinued due to the great labour required. Such engines controlled for the relevance of results by limiting the pages searched to a shortlist of well-chosen sites in a given subject-domain; the technique worked well when the search-term was relatively uncommon in the domain but common elsewhere.

The most difficult search is for an idea that can only be articulated as one or more phrases comprising very common words. When such a phrase can be exactly specified, such as Hamlet's “to be, or not to be”, you can solve the problem e.g. in Google by placing quotation marks around it, which tells the engine to look for those words in the given order. Ordinarily Google will discard the commonest words (chiefly the “closed-class” or function-words, such as articles, prepositions and the like), but you can insist on these by use of special operators. Other refinements are possible; see the Google page on its “Google Services and Tools” for details.

2. Lists

General subject-oriented lists of Web resources, intended for the Internet-browsing public, are maintained, for example by Yahoo at http://www.yahoo.com/. For the humanities, somewhat more focused lists are also available. The most notable ones are given in the course Bibliography. Academic departments at King's and elsewhere tend to offer more specialised lists, as do many individuals.

Those lists that are maintained frequently and carefully—an important proviso—can be very helpful, since they offer the convenience of subject classification and, one hopes, judicious filtering. Since they are built by hand and depend on someone else's judgment, however, they are never as exhaustive as the automatic indexes, and they grow out-of-date very quickly. They may omit on principle or through carelessness exactly what you need. They are, however, perhaps the best way of getting some idea of what is available.

3. Links

Whichever of the above you use, the time-worn technique of picking up and following a trail of references will prove invaluable. On the Web pages that you do find, look especially for items entitled “related links”, under which rubric authors of pages commonly gather references to relevant materials elsewhere.

4. Discussion groups and e-mail

Most if not all fields of study will have one or more online discussion groups. Although some of these are intended for advanced research, in general most may be joined without question, and the members of most will entertain reasonably intelligent questions about the topics they serve. There is no central list of these, but finding them is not difficult using the techniques outlined above. A place to begin is with the main discussion group for humanities computing, Humanist, which is run at King's. Once someone volunteers a useful comment, you are then free to write to the person privately. You can also ask further questions of the group.

V. Tricks of the trade

1. Sampling

When you have too many hits to handle and have done everything possible to eliminate irrelevant items then you need to sample a reasonable number, as already illustrated. You will likely be able to depend to some degree on the ranking algorithm in the search-engine(s) you use, so sampling very frequently within the first 100 or so, even looking at all of these, is a good idea. Use the so-called “law of diminishing returns” to guide you: once irrelevant or repetitious hits start turning up in considerable number, abandon the attempt.

It is crucial for you to keep in mind that no search-results will ever be complete in any useful sense, for the reasons given above. Even if you are looking for a specific word-form, you cannot argue cogently that it never occurs on the Web, since a moment after you perform the search a page with that form may be added. When you use contents of the Web as evidence, be very careful what you claim about that evidence.

2. Coping with broken links

A frequent problem in accessing the Web is an address (URL, or Uniform Resource Locator) that does not work. Either you will be told that the corresponding page does not exist at that address or that the server cannot be found on the Internet. There are three common causes for this problem.

  1. error in transcription, i.e. a miscopied address. To spot one of these you need to know how a valid address is constructed—for which see an HTML manual, such as the one listed in the course Bibliography. Look particularly for errors in punctuation, e.g. a comma or forward slash (virgule) where a full-stop should be, correct them and try again.
  2. network or server error, i.e. a problem with the Internet, which may be temporarily unable to find a working address, or a problem with the particular server to which the URL refers. Try a few times immediately, then try again later.
  3. defunct address, i.e. one that is formally correct but refers to a page that has been moved. Good practice is to replace the moved page at the original address with a notice giving the new address, perhaps even with an automatic transfer after a few seconds. Good practice is not always followed, however. If the page you seek is still on the given server, then you can sometimes find it by successively chopping off segments of the URL beginning with the right-most one and proceeding to the left, trying each new address in turn and looking for clues.

    As an extreme but not uncommon example of a defunct address, let us suppose you were looking for the University of Kansas page on military history, for which you have been given the following URL:

    In some cases you may have to conclude that the page no longer exists for whatever reason and do the best you can with those you are able to find.

VI. Further reading

For one useful guide to online resources see Search Engine Resources .

revised October 2007