Overview of Standard Generalized Markup Language

 

 

by

 

Daniel V. Pitti

Project Director

Institute for Advanced Technology in the Humanities

University of Virginia

 

 

SGML

 

While Standard Generalized Markup Language (SGML) is both standard (ISO 8879) and generalized, it does not provide an off the shelf markup language that one can simply take home and apply to a letter, a novel, an article, a software manual, or a catalog record. What it really is, in fact, is a markup language meta_standard, or in simpler words, a standard for constructing markup languages. SGML provides a syntax and a meta_language for defining and expressing the logical structure of documents, and conventions for naming the components or elements of documents.

 

One can think of SGML as a set of formal rules for defining specific markup languages for individual kinds of documents. Using these formal rules, a community sharing a particular kind of document can get together and create a markup language specific to a shared type of document.

 

The specific markup languages written in compliance with formal SGML requirements, are called Document Type Definitions, or, DTDs. For example, the Association of American Publishers has developed a set of three DTDs: one for books, one for journals, and one for journal articles. A consortium of software developers and producers has developed a DTD for computer manuals: DOCBOOK. The Library of Congress is currently testing a USMARC DTD.

 

DTDs shared and followed by a community are themselves standards. The Association of American Publishers DTD is registered as ANSI/NISO z39.59_1988, and after substantial revision, has been approved as an international standard, ISO 12083.

 

SGML is thus very general and abstract. It exists formally over and above individual markup languages for specific document classes. It is also a standard, which is to say, a formal set of conventions in the public domain, not owned by and thus not dependent on any hardware or software producer. That SGML is a standard offers its users reasonable assurance that the information created will not become obsolete because of hardware and software developments.

 

The formality and generality of SGML have very important implications. Because SGML syntax and rules are formal and precise, it is possible to write software that can be easily adjusted to work with any compliant DTD. Typically, an SGML software product has a toolkit that allows the user to adapt its functionality to the specific DTD. As a result, the market driving SGML software development is in principle everyone. This is very different from the MARC software market, which consists almost exclusively of libraries and a few archives and museums.

 

The SGML market includes virtually everyone. Many government agencies are using SGML including the Department of Defense, Department of Energy, Internal Revenue Service, and Library of Congress. A wide variety of industries are also employing SGML including software producers; airline, automobile, and tractor manufacturers; print and electronic publishers; and pharmaceutical and medical companies. The academic community also has a number of important initiatives underway consisting of among others the Text Encoding Initiative, an international project to provide encoding standards to support linguistic and literary research, and Encoded Archival Description. Most recently, there are a number of initiatives for developing DTDs for electronic commerce on the Internet and a wide variety of scientific and academic research.

 

In order to understand why SGML has generated such broad interest from both users and developers, it is useful to consider the nature of markup and what kind of markup SGML promotes. In an article now considered by many to be a classic presentation of document markup theory, James Coombs, Allen Renear, and Steven DeRose distinguished six kinds of markup, three of which I would like to discuss briefly: procedural, descriptive, and referential[1].

 

Procedural Markup. In the last few years, through the use of word processing systems, we have become familiar with procedural markup. Procedural markup consists of processing instructions to the computer. It tells the computer what to do with specified components of the text. For example, the title of a major section might have instructions that tell the printer to center the text, use a font of a certain size, and perhaps print it in bold italics. Most procedural markup is characterized by being paper directed, that is, it tells the printer how to put the text on paper. If you want to do anything else with the text, the markup is not of much help. If you want to search for the initials "SGML" in the machine_readable version of a book, but only where it occurs as a chapter or section title, the procedural markup provides no assistance. Nor does it help if you want to display the text on a computer screen, since paper presentation and monitor presentation are quite different. Finally, procedural markup is characterized by a further limitation, to date all procedural markup has been proprietary. This means, for example, that the documents created on WordPerfect cannot be processed flawlessly on MicroSoft Word and vice_versa. Each word processing software package uses its own markup. In this environment, the future of the document is tied to the future of the software.

 

Descriptive Markup. A second type of markup mentioned by Coombs, Renear, and DeRose is descriptive markup. With descriptive markup, we arrive at the form of markup recommended by SGML. Descriptive markup identifies the logical components of documents. While procedural markup specifies a particular procedure to be applied to a document component, descriptive markup indicates what the component is. Examples are chapter, chapter title, section, paragraph, author, publisher, and cataloging_in_publication data. None of these gives any indication of what procedures are to be applied to these components. But, if you know the elements in a document, then you can have processors to do whatever you want to them. Descriptive markup liberates the document for multiple uses. It is possible, for example, to use one and the same source document to produce printed, electronic, Braille, and voiced synthesized versions, and, for good measure, to produce HTML and flat ASCII. The fact that descriptive markup can be used in so many different ways is one of its important characteristics. It escapes the single use trap of procedural markup.

 

It is useful to distinguish two kinds of descriptive markup: structural and nominal. Descriptive structural markup identifies document components and their logical relationship. Structural elements are components that you usually want to present visually in some distinct manner. Examples are chapter titles, paragraphs, block quotes, and the like. Descriptive nominal markup, as you might expect, identifies named entities, both concrete and abstract. Examples are corporate names, personal names, topical subjects, genres, and geographic names. While you may want to visually present these names online or on paper in some particular manner, you usually want to index them in particular ways, to use them to provide access to the source or subject matter of the document.

 

Referential Markup. As its name suggests, referential markup refers to information that is not present. It is markup in the third person, so-to-speak. There are different kinds and ways that one might use referential markup, but I would like to focus on the kind of referential markup that enables something about which most of you have heard, and perhaps with which many of you have some experience, namely, hypertext and hypermedia. In addition to supporting text, SGML also provides provisions for using text to refer to other text, and to refer to other kinds of digital information derived from the full array of native formats: photographs (color as well as black and white); sound motion pictures; drawings; paintings; audio recordings; three dimensional objects of all kinds, shapes, and sizes; maps; manuscripts; typescripts; printed pages; mathematical data; financial data; diagrams; musical notation; choreographic notation; and anything else open to being digitally captured and rendered in some useful form. It is possible not only to refer to or point at this other digital information from within SGML based documents, but also to control the notation information needed to launch the devices necessary for rendering the various objects into humanly intelligible forms. It is thus possible to use electronic text to control and manage extra_SGML information objects of all kinds, as well as to provide access to and navigation through them.

 

HTML and XML

 

HyperText Markup Language (HTML) is an SGML DTD which has enjoyed enormous success as the encoding standard underpinning the World Wide Web. As a specific application of SGML, the HTML DTD limits itself to simple procedural encoding dedicated to online display and hypermedia linking. Because of HTML’s relative ease of use and its ability to support online display of finding aids, there is a temptation to use it to encode even very complex documents.  Before using HTML, however, it is important to remember that its procedural focus will not represent complex intellectual content and structure in a manner that will enable sophisticated searching, navigation, and display. Evidence of HTML's limited ability to support intelligent searching and document discovery, let alone complex display, navigation and other processing, is not difficult to find. Many of us have used Web search engines to look for both known items and items relevant to a particular topic, and more often than not, we are overwhelmed by voluminous results. Our patience frequently is exhausted looking for an item or two that satisfies our need.

 

The success of HTML as a display format for the Web brings into sharp relief the one major weakness in currently available SGML software, namely that there are limited options for delivering native SGML over the Internet. SGML software developers have produced very good and affordable tools for SGML authoring and editing, data conversion, and database indexing and searching; they also have produced very good publishing tools for in-house and CD-ROM publishing. But delivering SGML documents on the Web has been a serious obstacle to its broader deployment. The prospects for this changing in the near future, though, appear to be bright.

 

In 1996, the World Wide Web Consortium (W3C) founded the XML Working Group to build a set of specifications to make it easier to use SGML on the Web. The Working Group, in a short period of time, wrote a specification for a simplified subset of SGML named Extensible Markup Language (XML). Both MicroSoft and Netscape have committed to fully implementing XML in their Internet browsers. Internet Exporer 5.0 provides some but not full support for XML.

 

The motive driving the development of XML is the recognition that HTML will not support complex, community-based use of shared information on the Internet. HTML hardwires a small set of procedurally-oriented tags. Constraining the set of tags has made it easy to build applications that make life relatively easy for authors and Web publishers. The ease of use has been a major factor in the Web’s remarkable success. The small, closed tag set, however, has come at a price: HTML has extremely limited functionality. Jon Bosak has identified three areas in which HTML is wanting: extensibility, structure, and validation.[2] SGML is strong in all of these areas, but its strength, like HTML's weakness, comes at a price: SGML is complicated for both application developers and the users of the applications. The W3C's XML Working Group addressed this weakness by identifying and proscribing some features in SGML that are difficult to implement. The result of their work is XML, a simplified subset of SGML for use on the Web.

 

The ongoing development of XML and closely related standards, Extensible Stylesheet Language (XSL), and Extensible Linking Language (XLINK), promise to overcome the last major obstacle to use of SGML for delivery of complex documents over the Internet.[3].

 

For additional information about SGML, see http://www.oasis_open.org/cover/

 

 

SGML and XML Software Tools

 

The best source of information on SGML/XML tools is Steve Pepper's The Whirlwind Guide to

SGML Tools and Vendors: http://www.infotek.no/sgmltool/guide.htm

 

Parsers: Validation of DTDs and documents.

 

Key to the use of SGML are parsers. Essentially, parsers are aware of the formal requirements of the SGML meta-language and syntax, and they use this awareness to do three very important things.

 

First, a parser can read the DTD itself, and make sure that it formally adheres to the standard. It reads all of the element, attribute, and entity declarations to make sure that they are compliant with the specifications in the standard. If naming conventions and syntax are used incorrectly, it will inform the person developing the DTD.

 

Second, once a parser has read the DTD and finds it valid, it can read an encoded document and validate that all of the encoding meets the specifications in the DTD.

 

Finally, the parser outputs the document in a form that other SGML software can use.

 

All SGML compliant software use recognized parsers! If they do not, they are not compliant!

 

There are free SGML parsers available for ftp. See:  http://www.oasis_open.org/cover/publicSW.html#parsers

 

The best available parser is NSGMLS. NSGMLS is part of a suite of related tools developed by James Clark called SP: http://www.jclark.com/sp/index.htm

               

There are many different parsers available for XML. There are two kinds of parsers: validating and non-validating parsers. Non-validating parsers check to make sure XML documents are well-formed, but do not check to make sure documents adhere to a DTD. Validating parsers, like SGML parsers, make sure the DTD is valid and that the document adheres to the rules in it.

 

The best available validating XML parser is MicroSoft’s XMLINT. See MSDN Online Tools - XML Validation Tool

 

Converters and Transformers: moving into and out of SGML.

 

SGML converters are software tools used to convert non-SGML text into SGML text. This is also called "up conversion."

 

SGML transformers either transform a document conforming to one DTD into a document conforming to another DTD; or an SGML encoded document into a non-SGML encoded document.

 

Custom created scripts using perl are frequently employed for both conversion and transformation. It is also possible to use macro programs in word processing programs such as WordPerfect to mark up text based on formatting clues.

 

There are also several applications devoted to converting and transforming documents. The most powerful and sophisticated of these tools, is Exoteric's OmniMark, which is now free. For more information on OmniMark, see OmniMark

 

                For more information, see http://www.oasis_open.org/cover/publicSW.html#conversion

 

Authoring and Editing: writing and maintaining SGML documents.

 

SGML authoring and editing tools, like word processing programs, are used for writing and maintaining documents. While many SGML authoring and editing tools behave very much like word processing programs, with WYSIWYG interfaces, spell checkers, and the like, they also have special features to facilitate creating and maintaining valid documents.

 

The best SGML authoring and editing tools have real time parsers, which is to say, parsers that compel the author to use only DTD compliant elements. In addition, good SGML authoring tools provide mechanism for automatically adding the tags using function keys or point-and-click menus. Really smart ones even will supply the next tag or tags in certain contexts.

 

There are free SGML authoring and editing tools, such as PSGML, see: http://www.oasis_open.org/cover/publicSW.html#editing

 

                There are several commercial authoring tools available. The most reasonably priced and of good quality are the following:

 

Adobe’s Frame+SGML is reasonably priced with an academic discount: http://www.adobe.com/prodindex/framemaker/prodinfosgml.html

 

SoftQuad’s XMETAL SoftQuad: Welcome to xmetal.com

 

Interleaf's Author/Editor, which is available on Unix, Windows, and Mac platforms. See: http://www.interleaf.com/products/sgml.htm

 

                WordPerfect SGML Edition, which is available on Windows:                

http://www.corel.com/Office2000/standard.htm

 

Browsers and Electronic Publishing: viewing and browsing SGML-based documents.

 

In order to browse and read SGML and XML-based documents easily in a networked, machine environment, it is essential to have browsers and electronic publishing systems that can transform the descriptive tagging in documents into a presentation form. There are a variety of such products available. They vary considerably in functionality and cost.

 

The most exciting development in SGML/XML publishing is the release of MicroSoft’s Internet Explorer 5.0, which, using XSL stylesheets, displays XML documents directly in the browser: http://www.microsoft.com/downloads/

 

There are also a number of tools based on XSL that will transform your XML documents into HTML for delivery on the WEB.  James Clark’s XT, which we use in the class, is available at http://www.jclark.com/xml/xt.html. I recommend the executable version for Windows, as you do not need to understand Java to work with it. Saxon is also very popular: http://users.iclway.co.uk/mhkay/saxon/



[1]Coombs, James H., Allen H. Renear, and Steven J. DeRose. 1987. "Markup Systems and the Future of Scholarly Text Processing." Communications of the Association for Computing Machinery 30 (11): 933-947.

[2]Jon Bosak XML, Java, and the future of the Web see: http://sunsite.unc.edu/pub/sun_info/standards/xml/why/xmlapps.htm

 

[3]XML includes three related initiatives: XML, Extensible Linking Language (XLink) and Extensible Stylesheet Language (XSL). For current information on the status of the development and the latest drafts of each, see http://www.oasis-open.org/sgml/xml.html