a guide to electronic texts
Encoding and Text Formats
‘The purpose of encoding within a text’ writes Susan Hockey, ‘is to provide information which will assist a computer program to perform functions on that text’. [1] The E-Texts listed in the following bibliography are available in a variety of different formats – some are encoded and some are not.
Plain Text or ASCII Files
Plain Text files – or ASCII Files – are not encoded. Documents in this format will consist, as the name suggests, of plain text. This means that there will be no underlining, no boldface or italics, text will be of uniform size and without any variety of font styles. The following example shows the first stanza of The Runaway Slave at Pilgrim’s Point , by Elizabeth Barret Browning, in ASCII form:
[2]
The main advantage of ASCII text is that is widely available. It can be read by the vast majority of software and in 1972 it was adopted as international standard. It is possible to perform simple searches on documents written in plain text and many E-Texts are available in this form. A text in ASCII form however will appear less sophisticated than a word processed document and any formatting will be lost. This is in important consideration in poetry for example where the form and layout of the text can often make a significant contribution to the meaning.
HTML
HTML stand for hypertext markup language and this is the language that most Internet browsers currently read and the language that is used to control the appearance of many web pages. HTML consists of ‘tags’ which a browser reads and then arranges the appearance of the text on the screen accordingly. HTML tags can be used to instruct a browser to display visual features such as bold or italics. The following shows an example of HTML code and the effect that it has upon the text on the screen:
<B>George Eliot</B>’s sixth novel is entitled <I>Middlemarch</I>.
George Eliot’s sixth novel is entitled Middlemarch.
HTML files are therefore more visually appealing than ASCII files and attempts can be made to maintain a sense of the original formatting of a document. Long documents are often easier to read on screen in HTML form than are text files.
SGML
SGML stands for Standardized Markup Language, which is the parent language of HTML. There are, however, important distinctions between them.Whereas HTML uses tags to specify the way text appears on screen, SGML uses tags to describe the structure of the text and the separate units that make up this text – for example <title>, <poem>,<stanza> etc.
SGML is the markup language recommended by the TEI as it allows for greater control of a text and enables scholars to perform much more complex and precise searches of a text. A search on document in ASCII or HTML form could find all the examples of a certain word in the entire text but a search performed on an SGML encoded document could limit that search to lines spoken by an individual character. Institutions that adhere to the TEI guidelines will produce texts that are encoded in SGML. However, it should be noted that whereas most web browsers can read these files, to perform additional searches and sophisticated textual analyses, additional software will be required.
PDF stands for Portable Document Files, and there are a few E-Texts in this format listed in the following bibliography. A PDF file reader is required in order to view a PDF file. The most commonly used program is Adobe’s Acrobat Reader – which is now available free of charge at http://www.adobe.com/acrobat
[1] Hockey, Susan. Electronic Texts in the Humanities (Oxford: Oxford University Press, 2000), p. 24.
[2] Barret Browning, Elizabeth. ‘The Runaway Slave at Pilgrim’s Point’. Available at gopher://dept.english.upenn.edu/00/Courses/Curran202/Barrett/slave.
Accessed 15 May 2002.
contents – archives – encoding – authors – bibliography