ENCODING GUIDELINES FOR REPRESENTATIVE POETRY


Poetry edited by
members of the Department of English at the
University of Toronto for publication by the
University of Toronto Press
from 1912 to 1967

(c) Ian Lancashire 1994

  1. SGML-encoded Root Files
  2. HTML Markup
  3. COCOA- or TACT-style Markup
  4. In the Context of an Electronic Library
  5. Sample Tagging Programs

Other Representative Poetry Indexes
[by Poet] [by First Line] [by Date] [by Keyword]

Related Materials
[UT English Library]

Representative Poetry root files are encoded in Standard Generalized Markup Language (SGML), an ISO-sponsored syntax for the tagging of electronic documents, but are converted to Hypertext Markup Language (HTML) for display on the World Wide Web. SGML tags are useful for the exchange of electronic documents across many different computers and for the enrichment of texts with information. Although employing the syntax of SGML, HTML tags are few in number and are designed to display files for browsing on the World-Wide Web.

You have access on the World Wide Web only to the HTML encoded file.

SGML-encoded Root Files

Here is a typical SGML-encoded file, Blake's "Little Lamb":
<h1>The Lamb</h1>

<pmdv1 type="poem" n="8" rhyme="aabbccddee">
<heading>
THE LAMB
</heading>
<bibl> Blake 1789 </bibl>
<a href="utel_rp_sourcebib.html#Blake 1789">Source reference.</a>
<docDate> 1789 </docDate>
<docEedit> Ian Lancashire </docEedit>
<docEditr> Northrop Frye </docEditr>
<copytext> 3RP.II.278 </copytext>

<pmdv2 type="stanza" n="1">
<a name="8"> <pmdv3 n="8-1">{ }{ } { }{ } Little Lamb, who made thee?</a> </pmdv3>
<pmdv3 n="8-2">{ }{ } { }{ } Dost thou know who made thee?</pmdv3>
<pmdv3 n="8-3">{ }{ } Gave thee life, and bid thee feed</pmdv3>
<pmdv3 n="8-4">{ }{ } By the stream and o'er the mead;</pmdv3>
<pmdv3 n="8-5">{ }{ } Gave thee clothing of delight,</pmdv3>
<pmdv3 n="8-6">{ }{ } Softest clothing, woolly, bright;</pmdv3>
<pmdv3 n="8-7">{ }{ } Gave thee such a tender voice,</pmdv3>
<pmdv3 n="8-8">{ }{ } Making all the vales rejoice?</pmdv3>
<pmdv3 n="8-9">{ }{ } { }{ } Little Lamb, who made thee?</pmdv3>
<pmdv3 n="8-10">{ } { }{ } Dost thou know who made thee?</pmdv3>
</pmdv2>

<pmdv2 type="stanza" n="2">
<pmdv3 n="8-11">{ } { }{ } Little Lamb, I'll tell thee,</pmdv3>
<pmdv3 n="8-12">{ } { }{ } Little Lamb, I'll tell thee:</pmdv3>
<pmdv3 n="8-13">{ } He is called by thy name,</pmdv3>
<pmdv3 n="8-14">{ } For he calls himself a Lamb.</pmdv3>
<pmdv3 n="8-15">{ } He is meek, and he is mild;</pmdv3>
<pmdv3 n="8-16">{ } He became a little child.</pmdv3>
<pmdv3 n="8-17">{ } I a child, and thou a lamb.</pmdv3>
<pmdv3 n="8-18">{ } We are called by his name.</pmdv3>
<pmdv3 n="8-19">{ } { }{ } Little Lamb, God bless thee!</pmdv3>
<pmdv3 n="8-20">{ } { }{ } Little Lamb, God bless thee!</pmdv3>
</pmdv2>
</pmdv1>

SGML syntax requires that any tag be delimited or enclosed by angle or diamond brackets. Inside SGML tags, you will always find an element name (e.g., `pmdv1'), and sometimes attributes for that element (e.g., `type="poem"'). Most tags come in pairs, an opener and a closer. The closer tag normally consists only of the element name, without attributes, preceded by a slash (e.g., `'). Each tag names a feature that applies to all words between the opener tag and the closer tag. Thus the sequence

<heading> THE LAMB
</heading>
indicates that the words `THE LAMB' are a heading. Tags relate to tagged text as variables do to values.

Some tags used in the poem `Little Lamb' mark divisions that nest inside one another. Thus the top-most division (`pmdv1') marks the entire poem. Its closer tag appears at the end. This tag contains the middle division, designating the stanzas repeated in the poem (tagged as `pmdv2'). The lowest level is the verse line, which always appears inside a stanza and is generally repeated (`pmdv3').

This literary structure, then, has verse lines always and only occurring within stanzas, and stanzas always and only occurring within poems.

There are two anchor tags:

<a href="utel_rp_sourcebib.html#Blake 1789">Source reference.</a>
<a name="8"> { }{ } { }{ } Little Lamb, who made thee?</a>
These mark cross-references. The anchor tag with an `href' attribute gives an address of another file to which the enclosed text refers. The anchor tag with a `name' attribute provides an address for any number of cross-references to it. Thus the first anchor tag refers to a file with another anchor tag having a `name' attribute with the value `Blake 1789'. The second illustrative anchor tag may be used by other anchor tags with `href' attributes as the target for a cross-reference. Here the `href' tag enables a reader to move quickly from the short source reference to a full bibliographical citation in a separate document; and the `name' tag marks the first line of the poem so that a first-line index has a target at which to aim a cross-reference.

Finally, note that Representative Poetry uses a few letters or other symbols not available in many computer character sets. These special characters are represented either by simple codes within single braces, or by SGML entities. Which notation appears depends on the browser. SGML root files always use the brace notation.

For example, the single space is tagged as `{ }' in Blake's `Little Lamb' above but is converted to ` ' (the HTML tag for a non-breaking single space) for the purpose of display by HTML browsers. Some browsers, like Lynx, support few special characters. Others, like Mosiac and Netscape, support most of the following entity references.

Here are the most important special characters:

  1. non-breaking space: { } or  
  2. A-acute: {'a} or &Aacute;
  3. a-acute: {'a} or &aacute;
  4. A-circumplex: {^a} or &Acirc;
  5. a-circumplex: {^a} or &acirc;
  6. A-grave: {`a} or &Agrave;
  7. a-grave: {`a} or &agrave;
  8. A-tilde: {~a} or &Atilde;
  9. a-tilde: {~a} or &atilde;
  10. A-umlaut: {:a} or &Auml;
  11. a-umlaut: {:a} or &auml;
  12. ae-ligature: {ae} or &AElig;
  13. ae-ligature: {ae} or &aelig;
  14. ampersand: {:u} or &uuml;
  15. C-cedilla: {,C} or &Ccedil;
  16. c-cedilla: {,c} or &ccedil;
  17. censored letter: {-}
  18. E-acute: {'E} or &Eacute;
  19. e-acute: {'e} or &eacute;
  20. E-circumplex: {^e} or &Ecirc;
  21. e-circumplex: {^e} or &ecirc;
  22. E-grave: {`e} or &Egrave;
  23. e-grave: {`e} or &egrave;
  24. E-umlaut: {:e} or &Euml;
  25. e-umlaut: {:e} or &euml;
  26. I-acute: {'i} or &Iacute;
  27. i-acute: {'i} or &iacute;
  28. I-circumplex: {^i} or &Icirc;
  29. i-circumplex: {^i} or &icirc;
  30. I-grave: {`i} or &Igrave;
  31. i-grave: {`i} or &igrave;
  32. I-umlaut: {:I} or &Iuml;
  33. i-umlaut: {:i} or &iuml;
  34. N-tilde: {~N} or &Ntilde;
  35. n-tilde: {~n} or &ntilde;
  36. O-acute: {'o} or &Oacute;
  37. o-acute: {'o} or &oacute;
  38. O-circumplex: {^O} or &Ocirc;
  39. o-circumplex: {^o} or &ocirc;
  40. O-grave: {`O} or &Ograve;
  41. o-grave: {`o} or &ograve;
  42. o-macron: {_o}
  43. O-tilde: {~O} or &Otilde;
  44. o-tilde: {~o} or &otilde;
  45. O-umlaut: {:O} or &Ouml;
  46. o-umlaut: {:o} or &ouml;
  47. oe-ligature: {oe}*
  48. Old English capital eth: {D} or &ETH;
  49. Old English capital thorn: {TH} or &THORN;
  50. Old English eth: {d} or &eth;
  51. Old English thorn: {th} or &thorn;
  52. space: { } or &nbsp;
  53. U-acute: {'u} or &Uacute;
  54. u-acute: {'u} or &uacute;
  55. U-circumplex: {^U} or &Ucirc;
  56. u-circumplex: {^u} or &ucirc;
  57. U-grave: {`U} or &Ugrave;
  58. u-grave: {`u} or &ugrave;
  59. U-umlaut: {:U} or &Uuml;
  60. Y-acute: {'Y} or &Yacute;
  61. y-acute: {'y} or &yacute;
  62. y-umlaut: {:y} or &yuml;

Greek is transliterated into English letters and placed within <lang> ... </lang> reference tags. Other characters, e.g., e-macron and oe-ligature, have no entity references yet in the basic Latin character set used above.

HTML Markup

This simple encoding system employs SGML (Standard Generalized Markup Language) syntax. HTML is partly indebted to the Text-Encoding Initiative Guidelines, edited by Lou Burnard and Michael Sperberg-McQueen (1994).

The online literature on HTML is sizable.

HTML tags employed in Representative Poetry may be either single, such as <br> (indicating a line-break) or <p> (indicating a paragraph), or paired. The single tags stand for several unprintable characters and, when interpreted by Lynx, act on the text directly. Where the tag <br> appears, for instance, a line-break occurs. The paired tags surround a passage of text and characterize it in some way. The most important tag-pair, <html> <html>, encloses the entire text. Note that the closing tag of this pair is identical to the opening tag except for the added virgule or forward-slash /. This feature characterizes all HTML paired tags.

Here follows an HTML file generated by a sed script on the SGML file of Blake's `Little Lamb'.




<html>
<head>
The Lamb
</head>
<body>


<h3>THE LAMB</h3>
<hr>
Blake 1789: <a href="utel_rp_sourcebib.html#Blake 1789">Original Text Reference.<br></a>
Publication Date: 1789.<br>
Ed. (text): Northrop Frye; (e-text): I. Lancashire.<br>
Rep. Poetry: 3RP.II.278.<br>
<hr>

<a name=8 1&nbsp;&nbsp; &nbsp;&nbsp; Little Lamb, who made thee?</a> <br>
2 &nbsp;&nbsp; &nbsp;&nbsp; Dost thou know who made thee?<br>
3 &nbsp;&nbsp; Gave thee life, and bid thee feed<br>
4 &nbsp;&nbsp; By the stream and o'er the mead;<br>
5 &nbsp;&nbsp; Gave thee clothing of delight,<br>
6 &nbsp;&nbsp; Softest clothing, woolly, bright;<br>
7 &nbsp;&nbsp; Gave thee such a tender voice,<br>
8 &nbsp;&nbsp; Making all the vales rejoice?<br>
9 &nbsp;&nbsp; &nbsp;&nbsp; Little Lamb, who made thee?<br>
10 &nbsp; &nbsp;&nbsp; Dost thou know who made thee?<br>
<br>

11 &nbsp; &nbsp;&nbsp; Little Lamb, I'll tell thee,<br>
12 &nbsp; &nbsp;&nbsp; Little Lamb, I'll tell thee:<br>
13 &nbsp; He is called by thy name,<br>
14 &nbsp; For he calls himself a Lamb.<br>
15 &nbsp; He is meek, and he is mild;<br>
16 &nbsp; He became a little child.<br>
17 &nbsp; I a child, and thou a lamb.<br>
18 &nbsp; We are called by his name.<br>
19 &nbsp; &nbsp;&nbsp; Little Lamb, God bless thee!<br>
20 &nbsp; &nbsp;&nbsp; Little Lamb, God bless thee!<br>
<br>
</body>
</html>

<hr>
<img src="sample_small.gif" alt="*" align="bottom">
<a href="utel_rp_indexauthors.html">by Poet</a>
<img src="sample_small.gif" alt="*" align="bottom">
<a href="utel_rp_indextitles.html">by Title</a>
<img src="sample_small.gif" alt="*" align="bottom">
<a href="utel_rp_index1stlines.html">by 1st Line</a>
<img src="sample_small.gif" alt="*" align="bottom">
<a href="utel_rp_indexdates.html">by Date</a>
<img src="sample_small.gif" alt="*" align="bottom">
<a href="utel_rp_tagging.html">Tags</a>
<img src="utel.gif" alt="*" align="bottom">
<a href="utel.html">UTEL</a>
<hr>
<a name="credits"><h3>Credits and Copyright</h3></a>

Together with the editors, the Department of
English (University of Toronto), and the University of Toronto Press,
the following individuals share copyright for the work that went
into this edition:
<dl>
<dt>Screen Design (Electronic Edition): <dd>Sian Meikle (University of
Toronto Library)
<dt>Scanning: <dd>Sharine Leung (Centre for Computing in the Humanities)
<dt> <dd>
</dl>
</body>
</html>

No HTML tag may be seen when using the Lynx browser, but these tags are in the electronic text itself.

Here are the HTML tags employed in this library:

The NCSA's Beginner's Guide to HTML gives easy-to-understand instructions on how to encode a document with these and other HTML tags.

COCOA- or TACT-style Markup

A third tagging system exists, suitable for DOS-based text-analysis software like Oxford Concordance Program and TACT. These tags normally consist of an opening angle bracket, an unchanging variable (e.g., "author" in "<author Shakespeare>"), a space, a changing value (e.g., "Shakespeare"), and a closing angle bracket.

In the following list, COCOA- or TACT-style tag values are indicated by "xxxx".

Although simple, this tagging scheme attempts to characterize the text faithfully.

All COCOA- and TACT-style tags are single. They hold true until another tag of the same type--with a different value--appears. Thus the tag characterizes all subsequent text as prose until such time as the tag appears (which signals the beginning of a verse). To turn off a tag without replacing its value with another value, I sometimes use the "-" (or null) value. For example, the tag precedes Browning's "My Last Duchess", a dramatic monologue uttered by that duke (for whom Browning even gives a speech prefix). The tag follows the poem, however, because the next poem, "Meeting at Night", seems to be by the poet himself in the first person.

The "xxxx" value of the <tt xxxx> tag gives the type of text for all words following (until the next <tt xxxx> tag). These values include "epigraph", "nt:leftmargin" (note in left margin), "RPheading" (poem title or heading in Representative Poetry), "RPsubheading" (poem subtitle or subheading in RP), "RPheadingno" (the number given to the canto, stanza, etc., in RP), "sppfx" (speech prefix occurring in dramatic poems), "stagedir" (stage direction), and "text" (the poem itself). This tag ensures that text-analysis programs can retrieve words according to type.

Every line in a poem has a prefixed number that identifies both the in-sequence number of the poem in RP, and the line number in the poem. In any <lx yyyy> tag, "x" is the in-sequence number and "yyyy" the poem's line-nunber. Besides serving as a useful method of reference, exhaustive lineation of this kind ensures that the entire poem is included in the file. Any disruption in lineation indicates a corrupted file. Remove this lineation at your own risk.

A standard reference for any word or phrase retrieved from this file may be taken from the current values of the <author>, <poemtitle>, <subtitle>, <copytext>, and <lx> tags. These give the poet's name, the title and subtitle of the poem, the volume and page reference in Representative Poetry, and the verse line-number.

In the Context of an Electronic Library

The poems in Representative Poetry are split into poem files, but any library in English literary history consists of books and manuscripts, not authors. For this reason, the RP author files include, from time to time, hypertext links to the earliest copy of the source on which their texts are based.

For example, Representative Poetry includes John Dryden's poem entitled "To the Pious Memory of the Accomplished Young Lady Mrs. Anne Killigrew." This poem first appeared in Killigrew's Poems, published posthumously in 1686. An electronic copy of this text, encoded in HTML but also following a more complicated encoding system for Renaissance texts, also exists online. HTML <a> tags permit us to move back and forth between the modernized edition in Representative Poetry and the source edition of 1686.

The RP author-files give much of the best poetry in English literature up to the late 19th century, but they also lead readers back into the source literature so that they may make their own selection. The hypertext links possible in an electronic library only do--albeit faster, less expensively, and more conveniently-- what the editors end-of-volume did.

Sample Tagging Programs

It makes sense to automate as much of the work in preparing an online library as possible. By preparing scripts (collections of commands that can be executed in sequence automatically, like DOS batch files), sed (the UNIX stream editor), perl (an easy-to-use programming language), fgrep, and other UNIX utilities may be set to transform large numbers of files or to extract information from them for indexes. Here are some examples.

None of these programs is exemplary in form or structure, but they all did exactly what I wanted them to do.

The first example is a UNIX script that generates the `poet files' in the electronic Representative Poetry. First I edited a template for this kind of file and stored it as `rptemplate0.html'. These included simple codes for types of information, e.g., `xxx (aaa)" for the poet's name, followed by life-dates.

Then I put the following script into a text file. For each poet in the collection, the script copies the template into a file `0' and runs on it a sed script whose name begins with `0', follows with the poet's name, and ends with the extension `.ctrl'. The transformed text is output to the `poet file' (e.g., `arnold0.html').


#script to make poet0.html files

copy rptemplate0.html 0
sed -f "0arnold.ctrl" 0 > arnold0.html
del 0

copy rptemplate0.html 0
sed -f "0blake.ctrl" 0 > blake0.html
del 0

copy rptemplate0.html 0
sed -f "0browning.ctrl" 0 > browning0.html
del 0

....

The `0arnold.ctrl' file follows. It contains simple editing commands to make five substitutions and to append a list of anchor tags for a poem index.
#sed script to insert fields in the poet0 file
s/xxx/Matthew Arnold/
s/yyy/H. Kerpneck/
s/zzz/arnold/
s/bbb/Arnold/
s/aaa/1822-1888/
/<ol>/a\
<li><a href="utel_rp_poems_arnold14.html">Bacchanalia</a>\
<li><a href="utel_rp_poems_arnold6.html">Consolation</a>\
<li><a href="utel_rp_poems_arnold21.html">Dover Beach</a>\
<li><a href="utel_rp_poems_arnold19.html">Immortality</a>\
<li><a href="utel_rp_poems_arnold15.html">Isolation: To Marguerite</a>\
<li><a href="utel_rp_poems_arnold7.html">Lines Written in Kensington Gardens</a>\
<li><a href="utel_rp_poems_arnold3.html">Memorial Verses April 1850</a>\
<li><a href="utel_rp_poems_arnold23.html">Palladium</a>\
<li><a href="utel_rp_poems_arnold11.html">Philomela</a>\
<li><a href="utel_rp_poems_arnold10.html">Requiescat</a>\
<li><a href="utel_rp_poems_arnold22.html">Rugby Chapel</a>\
<li><a href="utel_rp_poems_arnold4.html">Self-Dependence</a>\
<li><a href="utel_rp_poems_arnold2.html">Shakespeare</a>\
<li><a href="utel_rp_poems_arnold12.html">Sohrab and Rustum</a>\
<li><a href="utel_rp_poems_arnold13.html">Stanzas from the Grande Chartreuse</a>\
<li><a href="utel_rp_poems_arnold8.html">The Buried Life</a>\
<li><a href="utel_rp_poems_arnold1.html">The Forsaken Merman</a>\
<li><a href="utel_rp_poems_arnold5.html">The Future</a>\
<li><a href="utel_rp_poems_arnold9.html">The Scholar-Gipsy</a>\
<li><a href="utel_rp_poems_arnold17.html">Thyrsis: A Monody, to Commemorate the Author's Friend, Arthur Hugh Clough</a>\
<li><a href="utel_rp_poems_arnold16.html">To Marguerite: Continued</a>\
<li><a href="utel_rp_poems_arnold20.html">Worldly Place</a>\
<li><a href="utel_rp_poems_arnold18.html">Youth and Calm</a>

The first perl program renumbers verse lines in sequence where it finds the string `00', restarting at `1' again each time it happens on either `h1' or `subhead' tags. During execution the program asks one for the input and output filenames. These could be given in parameters, but I wanted to be quite sure what I was doing and so used this somewhat tedious procedure. The input file contains all poems by a given author.
#!/usr/bin/perl
print "Unnumbered filename?\n";
$a = <STDIN>;
chop ($a);
print "Is `$a' the right filename? (y/n)\n";
chop($answer = <STDIN>);
if ($answer eq "n") {
print "Unnumbered filename, eh?\n";
$a = <STDIN>;
chop ($a);
} else {
print "ok\n";
}
print "Output numbered filename? (y/n)\n";
chop($b = <STDIN>);
print "Is `$b' the right output file? (y/n)\n";
chop($answer = <STDIN>);
if ($answer eq "n") {
print "Numbered filename, eh?\n";
$b = <STDIN>;
chop ($b);
} else {
print "ok";
}
open (IN,$a);
open (OUT,">$b");
$n = 1;
while (<IN>) {
if (/<h3/) {
$n = 1;
print OUT $_;
} elsif (/<subhead/) {
$n = 1;
print OUT $_;
} elsif (/ 00>/) {
$target = index($_, "00>");
substr($_, $target, 3) = "$n>";
print OUT $_;
++$n;
} else {
print OUT $_;
}
}
close(IN);
close(OUT);

The second perl program extracts single poems from this numbered file (as output by the preceding program) and writes each to its own author file, numbered in sequence. Note that I supply the input filename on the command line as the first argument. In this way I could make a script containing commands to extract the poems from all the author files at one stroke.
#!/usr/bin/perl
$period = rindex($ARGV[0], ".");
$head = substr($ARGV[0],0,$period);
$tail = substr($ARGV[0],$period+1);
print "head= '$head', tail = '$tail'\n";
$n = 0;
while (<>) {
if (/<h1>/) {
++$n;
$poemfilename = "$head$n.sgml";
open(POEMFILE, ">$poemfilename");
print POEMFILE $_;
} else {
open(POEMFILE, ">>$poemfilename");
print POEMFILE $_;
}
close(POEMFILE);
}

Ian Lancashire
Department of English, New College
Centre for Computing in the Humanities
Robarts Library
University of Toronto
Toronto, Ont. M5S 1A1