Tools for Implementing SGML-Based
Information Systems: Viewers and Browsers,
Text Retrieval Engines, and CD-ROMs

By Kurt Conrad, The Sagebrush Group

Abstract

This paper/presentation is an update of the one which was delivered at SGML'95. It is intended to be a general introduction to the issues and concepts involved in the selection of software tools for the electronic delivery and retrieval of SGML (Standard Generalized Markup Language) documents. In addition, some of the issues unique to publishing to CD-ROM or via the World Wide Web will be explored.

Keywords

browsers, CD-ROM, document lifecycle, HTML, IETM, information economics, information politics, metadata, stakeholder interests, text retrieval, vector searches. viewers, World Wide Web

Bio

Mr. Conrad is an Information and Management Consultant and founder of The Sagebrush Group, a consulting company that specializes in SGML transition services. He is one of the authors of The Panorama Handbook: SGML and the World Wide Web, which is to be published by O'Reilly and Associates in 1996, and is also contributing to a number of other books and training development efforts. Prior to that, he worked for Boeing Computer Services for 10 years, where he held a variety of training, consulting, and project management positions. From 1992 to 1995 he directed an enterprise-level SGML initiative and spearheaded the implementation of SGML within the Department of Energy. He can be reached at conrad@SagebrushGroup.com.


Introduction

The electronic document delivery market is one of the most complex and volatile ones in the entire computing industry. An incredibly vast array of products and fundamentally different technology approaches dominate this industry segment. With computers being viewed as tools for producing more than just paper documents, software vendors are almost tripping over themselves to integrate electronic delivery capabilities (especially HTML capabilities) into their product offerings. For some, SGML is just one of the available options. For others, SGML is a fundamental enabling technology.

This paper provides those who are new to SGML with an introduction to some of the core concepts that can influence the selection of electronic delivery and retrieval software. It describes the capabilities of a wide range of tools by focusing more on product categories than specific vendors. This is because 1) the market place is changing rapidly, 2) almost any viewer or browser can be used to deliver SGML documents, and 3) time and space do not allow for a complete survey of specific products.

This paper describes many of the things that I keep in mind when I'm looking at SGML browsing technologies. It starts by introducing a number of basic concepts that relate to electronic delivery and describing some of the major implementation choices. This is followed by a series of sections that describe categories of tools: a schema for classifying viewers and browsers, an overview of some of the query methods which are used in text retrieval systems, and an examination of a few of the issues and trends which are unique to CD-ROM and World Wide Web publishing.


Basic Concepts

A number of economic, organizational, and technical factors influence the design choices and investment decisions that are necessary to implement SGML. Many of these issues have direct bearing on which electronic delivery technologies should be deployed and the returns on investment which can be expected. Some of the more important issues include: the ways that SGML changes cost-benefit profiles within the document lifecycle; how divergent stakeholder interests and metadata requirements can impact electronic delivery; and how technology trends are blurring the distinction between information producers and consumers.

Changes to Cost-Benefit Profiles

There are many different ways to view a document lifecycle. When dealing electronic delivery technologies, I prefer this one:

Research

The acquisition of information, including the interpretation of information contained within documents

Authoring

The creation of new documents

Editing

The revision of documents to make them conform to various structural and content standards

Formatting

The revision of documents to make them conform to various appearance or encoding standards

Publishing

The transformation of documents to a specific published form (e.g., paper or CD-ROM)

Delivery

The distribution of documents

Storage

The holding of documents

Retrieval

The locating and accessing of documents

Viewing

The reading of documents

Unlike other possible views of the document lifecycle, this one helps to differentiate the steps that involve the mechanical processing of data from those that focus on the way that humans interact with the information contained within documents. This distinction is important because of the fundamental economics of the transition to an SGML-based document management process: the use of SGML generally shifts the cost burden upstream and shifts the realization of benefits downstream.

Up front costs are increased in a variety of ways. Document analysis, Document Type Definition (DTD) development, new tools and training requirements, and conversions of legacy data are significant expenses. The imposition of new quality control requirements often increases costs during the authoring and editing phases. If authors and editors continue to work with unstructured tools, additional conversion costs are added during the formatting phase of the lifecycle.

In return, SGML gives individuals and organizations better ways to publish, deliver, store, retrieve, view, and interact with their documents. Some of these potential benefits are concerned primarily with mechanical efficiency, others with human interaction and performance. The choices that an organization or project team makes when balancing these potentially competing measures of value have tremendous impact on how (and even whether) potential and intended benefits are fully realized.

Stakeholder Interests and Metadata Requirements

Metadata (data about data) is at the core of these choices. Information, by itself, is not terribly valuable anymore. There is simply too much of it. Metadata, by contrast, is increasing in importance because it provides the handles needed by computers to determine how to process the data and the hooks needed by humans to help identify which pieces of information are relevant to their interests.

What is metadata? The SGML tags within a document instance are metadata. They describe the role of each element within the context of the document's structure. Attributes are metadata, as they further describe important characteristics of the data within the SGML instance. Titles, authors, publication dates, and index numbers are metadata, as are annotations, bookmarks, and other navigational aides.

TV Guide is one of the best examples of metadata and its increasing importance. With the exception the horoscopes and advertisements, TV Guide is almost entirely metadata, and not so long ago, Wired magazine reported that TV Guide makes more money than the four major networks, combined.

When SGML is used to develop a vendor and processing-neutral markup language, the resulting DTD is a formalized framework for capturing and storing metadata. As such, this metadata framework represents a negotiated balance between the divergent stakeholder interests that exist at different points in the document lifecycle.

It is not uncommon, for example for authors and editors to desire a simple markup language that is easy to use. In some cases, however, the interests of authors and editors may diverge, with authors desiring greater flexibility and editors wanting a greater emphasis on rigorous structures and automated validation.

A similar divergence can exist among those stakeholders who are primarily concerned with the mechanical efficiency of the document lifecycle (the formatting, publishing, delivery, and storage phases) and those stakeholders whose interests are targeted at the retrieval, viewing, and research phases.

Stakeholders, such as publishers, that are mostly concerned with mechanical efficiency will usually express their interests in terms of cost savings and by using phrases like create once, publish many. They will often focus on the structural aspects of the document, as these will usually support the range of publishing variations that are expected. In addition, structure-oriented DTDs will usually support a wide variety of document instances and require little maintenance (thus yielding additional cost savings).

Information consumers, on the other hand, usually desire richer, more complex sets of metadata. Instead of being satisfied with a DTD that reflects the generic structures of the document (e.g., chapter and title), tags that capture the meaning of the data (e.g., purpose, scope, rationale, part number, voltage, person, software package, company) are preferred.

Rich metadata allows documents to better function as databases and can have important benefits when using retrieval tools that support context-sensitive searches. By making retrieval easier and more cost effective, this human-centered approach to SGML can enhance the way that people interact with documents to enrich collaboration, learning, decision making, innovation, and the acquisition and development of knowledge.

These benefits cannot be measured easily in strictly financial terms, and since such an approach is usually more expensive than the development and use of structure-oriented DTDs, many organizations find it difficult to justify the additional cost. At the same time, these organic measures of value can be central to the SGML implementation effort and a major source of strategic value. As the information density of business transactions continues to increase, organizations that deliver richer, more useful information products to their customers are likely to realize competitive advantages.

Blurring the Distinction Between Producers and Consumers

In traditional paper-based publishing, the various steps in the document lifecycle were finite and discrete, and each phase produced a paper artifact which required human involvement. These fundamental dynamics are so firmly entrenched that even where computers are used, human labor is often required to integrate and interpret individual pieces of information throughout the document lifecycle. Although vast amounts of paper have been replaced with electronic deliverables, different proprietary encodings often act as barriers to exchange and reuse.

SGML-based document management approaches, on the other hand, have been proven to reduce the need for humans to perform mechanical transformation of data and allow them to focus on more creative, knowledge-intensive activities. Because of this, traditional divisions of labor are being de-emphasized and the distinction between information producers and consumers is being blurred.

The very term browsing is an example of this trend. As will be seen in the next section, high-end SGML browsers integrate a wide variety of viewing, retrieval, navigation, and data collection tools. These help to close the gap between viewing and authoring and make the document lifecycle truly a cycle. While none of the tools have the same powerful authoring capabilities found in dedicated editors, the functional trends are fairly clear.


Viewers and Browsers

A wide variety of tools can be used for displaying SGML data. Generally, they fall into three classes: Readers, Viewers, and Browsers. Readers are used to display the contents of files without any interpretation or rendering. Viewers add interpretation and rendering capabilities but base most of their rendering on formatting codes (metadata) which were designed to support the printing of paper hard copy. Browsers abandon the page metaphor to provide an electronic delivery environment that is more in tune with the capabilities and constraints of computer displays. In addition, they are generally more powerful and better able to exploit the information content of an SGML-encoded document to offer improved navigation and retrieval.

This paper uses the following schema to distinguish individual categories of Readers, Viewers, and Browsers:

These categories of tools are differentiated primarily by the way the information is encoded for delivery. This delivery encoding is closely related to the richness of the metadata that the software can make use of and this relationship has important implications on the document lifecycle. It is not uncommon for SGML DTDs to be designed around the strengths or weaknesses of a particular viewer or browser. Because of this, the metadata that the delivery tool supports can not only limit the options for user interaction and the potential returns on investment, but even the long-term value of SGML documents.

Please note, where specific products are mentioned, they are only used as examples and do not comprise an exhaustive listing of the products in each category. In addition, most vendors are aggressively improving their products and many are working to incorporate more robust support for SGML. It is possible that I may have missed an important product announcement or even that some of the vendors mentioned in this document may announce new product offerings at SGML'96 and change their placement in this schema.

Text Readers

Text Readers simply display the contents of the file. They give you a WYSIWOD (What You See Is What's On Disk) view of your data. If the file only contains text, it usually looks pretty good. If it contains binary data, the non-ASCII characters are displayed in place and the file can be hard to read. In the vast majority of cases, when SGML data is displayed using a text reader, the tags are displayed as ASCII character streams. Most people don't like Text Readers because of their inability to provide a richly formatted visual representation of the document.

Although I don't know of many SGML implementations that use Text Readers as a primary delivery tool, it is not out of the question. A fairly simple filter could be used to convert SGML into an untagged ASCII representation, using carriage returns, line feeds, spaces, tabs and perhaps even punctuation for visual formatting. You would probably end up with something that looked an awful lot like UNIX Man pages: text files that contain fairly consistent formatting and an implicit structure. Vernon Beurg's List program is my favorite tool in this product category.

Native File Viewers

This class of viewing software is used to view word processing and desktop publishing files in their native format. cc:Mail, for example, uses Native File Viewers (Outside In and Quickview Plus) to display attached files. Microsoft Windows 95 includes a utility called Quickview, which can also be used to view a variety of native file formats. In some cases, Native File Viewers do not exist as separate products but are only available as functions within other software products. A few companies, like Mastersoft, do sell Native File Viewers in both the OEM and consumer markets.

The ability to view native word processing and desktop publishing files means that there is virtually no publishing process, to speak of. This is another class of viewer that isn't used very often to deliver SGML data, but if it were, the publishing process would involve conversion of the SGML data into some proprietary editing format.

Generally, the quality of the rendering is fairly limited. In some cases, interpretation of the proprietary formatting codes is imperfect and doesn't match the formatting of the native editing environment. In addition, support for embedded graphics tends to be a problem. For most implementations, these aren't major concerns, as the primary goal is to provide low-cost access to legacy documents. For example, many OpenText text retrieval implementations use the Mastersoft Word for Word software to support indexing and viewing of a wide variety of native file formats.

Raster Viewers

Raster Viewers are designed to display bitmapped images (usually TIFF and CCITT Group 4). This allows them to provide a good representation of the printed page, preserving its layout, typography, illustrations, and other visual elements. It is not unusual to find Native File Viewers and Raster Viewers combined into the same product (e.g., AutoVue Professional).

Production costs are fairly limited, usually involving the simple scanning of paper documents. Raster Viewers are very popular in the insurance industry, where imaging systems provide a fairly low-cost alternative to the routing of paper. Raster Viewers are also often used in conjunction with more robust SGML delivery tools to render and display the graphics images which may be referenced in an SGML document instance.

Raster Viewers aren't a very attractive alternative for displaying textual data, however. To a computer, paper is dead and scanned pages aren't a whole lot better. Because raster images are just a collection of dots, they are not very useful for searching or retrieving. As a result, some hybrid systems use combined image/text approach, where OCR is used to convert the scanned images to text files, but the record copy of the data is still the bitmap.

Not many implementations use Raster Viewers to deliver SGML data and those that do are probably based on the scanning of paper documents. Raster Viewers can provide some degree of interactivity, allowing highlighting and annotations to the page images. Normally, these notations are stored in a separate file and displayed as overlays at the time of presentation.

Page Viewers

Adobe Acrobat, WordPerfect Envoy, and No Hands Common Ground are examples of the products that fit into this category. All of these products use proprietary file formats that store page images. In most cases, these files are produced, not by scanning paper documents, but by printing through a special filter or series of filters. This makes the publishing process fairly easy. It also provides a better visual rendering than is normally seen with Native File Viewers because the native applications print engine is involved in the rendering.

They have many important advantages over the viewing of raster images. The biggest single advantage of image viewers is that they capture textual data in a searchable form (not just as a series of dots). Because the file formats are proprietary, however, the range of tools that can be used for search and retrieval can be rather limited and, in some cases, be constrained to the vendor's own product offerings. Another advantage over raster images is that the Page Viewers usually have better support for color.

Besides the ability to support highlighting and annotations, some Page Viewers also provide mechanisms for embedding hyperlinks in the deliverables. These are normally used for such things as linking Table of Contents entries to their locations in the document, lists of figures, and linking key terms to their glossary entries.

Because Page Viewers use proprietary file formats, however, attention should be paid to where labor costs are incurred during the production process. Labor which is expended after conversion (i.e., printing) to the vendors proprietary file format is usually lost when publishing the next version of the document. This is especially true when dealing with hyperlinks. In most cases, hyperlinks are inserted manually using the vendor's publishing tools and must be re-entered when a revised document is imported.

Page Viewers are the first category of software in this schema that are given serious consideration as vehicles for delivering SGML data and as serious alternatives to SGML browsers. Many individuals consider Page Viewers pretty but dumb, however, especially where the publishing of SGML data is concerned. The publishing process is simple, but a lot of important metadata is stripped out and lost. At one time, Adobe claimed that a future version of Acrobat would support bidirectional conversions from and to SGML. This has not happened yet.

Binary Browsers

Binary Browsers also use proprietary, binary file formats (like Page Viewers), but they aren't tied to a page image. While the majority of electronic delivery products seemed to fit this category a couple of years ago, more and more of them seem to be shifting to the Fixed DTD and Arbitrary DTD categories and becoming more SGML-like. At this time, products like Folio VIEWS, Lotus SmarText, HyperWriter, and the Microsoft Help browser still appear to fit in this category.

The products in this category exhibit a wide range of functionality. Although most of them were originally developed to work with word processing files, they can be used to deliver SGML data. To do this, the SGML data stream must be converted into the non-SGML binary files used by the browser. While filters can be used to perform some of the conversion, most of these tools are more like authoring environments than publishing environments.

This authoring process usually requires significant interaction with a vendor-supplied styles editor to design screens, format documents, and add navigation aides and hypermedia links. As is the case with the Page Viewers described above, labor which is expended after the conversion from SGML will probably be lost if a revised document is imported into the publishing environment. Accordingly, vendors are beginning to introduce richer sets of importation filters which allow SGML document structures to be mapped directly to the capabilities of the delivery tool and reduce the level of interactive authoring.

Fixed DTD Browsers

A Fixed DTD Browser is a tool that uses SGML as part of the product architecture but only works with small number of vendor-selected DTDs. Oracle Book, and InfoAccess Guide are examples of products in this category. HTML browsers (Mosaic, Netscape, Internet Exporer, etc.) also fit within this category, in that they operate against a finite set of HTML DTDs (Ignoring, for the moment, whether the software actually uses a DTD or a just a formalized tag set that could ideally be represented as a DTD).

NCSA Mosaic and some of the other HTML browsers are relatively unique as a Fixed DTD Browsers, however, because the versions of HTML that they support are not proprietary to the organizations that developed the browsers. Most of the tools in this category, on the other hand, use proprietary DTDs. This has certainly proved to be the case with the case with Netscape and Internet Explorer, as each vendor has a well-established history of introducing a series of proprietary versions of HTML that include tool-specific extensions. While a proprietary markup language blurs the distinction between Binary Browsers and Fixed DTD Browsers, some important differences remain.

First, when used with SGML documents, these tools support a publishing process which can be best described as mapping, where the elements in the source DTD are mapped to elements (and thus indirectly to functions) in the delivery environment. Some Fixed DTD Browsers use a markup language which is very structurally oriented. Some use a markup language which is more visually oriented. When evaluating products in this category, it is important to determine whether the target markup language is a good fit for the source data that you will be producing and the ways that you wish to render it.

Second, there are more options for automating SGML to SGML transformations than for converting SGML to a proprietary binary format. Not only do many of the vendors in this category provide customized filters, but other popular filter tools can also be used (e.g., OmniMark and Perl).

Third, the stronger SGML bias shows in a better separation between content and styles. This allows sets of formatting commands to be designed once for a given DTD and used for both multiple document instances and multiple versions of the same document. While rules-based formatting may be possible with the Binary Browsers, the products in this category tend to have more mature and robust support.

Although most of the tools in the previous categories provide ways to capture input from readers (e.g., bookmarks and annotations), many of the products in this category extend those capabilities to provide low-level data collection and authoring functions that are designed to be integrated back into the document lifecycle. Most of the HTML browsers, for example, include forms capabilities and integrate email functions.

Interactive Electronic Technical Manual (IETM) browsers, are a specific subclass of Fixed DTD Browsers. Because they are used to both display technical procedures and support a complex variety of interactive and automated behaviors, they require very rich and sophisticated semantics. IETMs are often used to collect operator input, integrate it with real-time data from diagnostic systems, and use that information to customize the logical flow of the procedure's steps. IETMs can also be used to route operator feedback in real-time to other information management and reporting systems.

The U.S. Department of Defense has developed a number of specifications for IETMs (including MID, the Metafile for Interactive Documents), but IETM browsers tend to remain customized, one-off solutions that are either developed from scratch or based on other SGML and HyTime delivery tools. Commercial, off-the-shelf IETM browsers are hard to find, in part, because the market lacks a single, standard, IETM DTD. Day and Zimmerman's Interactive Presentation Manager (DZIS-IPM), for example, was exhibited at last year's conference, but is no longer being marketed.

Arbitrary DTD Browsers

These browsers are designed from the ground up to render SGML data and are truest to the philosophy of SGML. By accepting arbitrary DTDs, these products do not require that a document instance be restructured, converted, or mapped into a vendor-specified tag set.

These tools not only retain all the metadata in the SGML document instance, but they also maintain clear separation among a document's structure, content, and visual rendering. Electronic Book Technologies DynaText, Inforium's LivePAGE, Jouve's GTI Publisher, and SoftQuad Panorama are examples of the products in this category. Synex's ViewPort (which is the basis for Panorama and a number of other browsers) supports arbitrary DTDs, but is a browser engine, not a fully-functional browser.

The publishing process is focused around the DTD that was used to structure the incoming document instance. Styles are defined for each element type in the DTD and stored a separate file, which is normally called a style sheet. Multiple style sheets can be defined for the same DTD.

One of the primary functions of the browser is to merge data and styles at the time of rendering (often based on the decisions of the reader). This preserves flexibility in a way that is normally not possible with other viewers or browsers. It is also common for multiple style sheets to be used at the same time (e.g., DynaText normally uses one style sheet for the Table of Contents and another for the full text of the document). Some of these browsers even allow the reader to define their own custom styles.

Style sheets and other browser-specific sets of metadata are usually stored as SGML files and can be created or revised without using the vendor-provided editing tools. At this time, the DTDs used to structure style sheet files are proprietary, but in time, they are almost certain to become compliant with DSSSL (the Document Style Semantics and Specification Language) or a well-accepted dialect of DSSSL (e.g., DSSSL-Lite or DSSSL-Online).

An important issue to consider when comparing products in this classification is whether the browser renders native SGML document instances, or requires them to be compiled into a delivery format. Pre-compiling tends to make it easier to protect the source data but can also makes the tools less attractive for rendering SGML that is generated on-the-fly from databases and other automated systems.

While many, if not most, electronic delivery tools support hyperlinks so that Tables of Contents and other navigational aides can be built or coded, the tools in this category usually have the most sophisticated methods for handling such navigation. In Panorama, for example, Tables of Contents are supported through a feature called navigators. Multiple element types can be flagged for inclusion in a navigator, thus creating a hierarchical Table of Contents which can be expanded and collapsed. Multiple navigators can be applied to the same document to speed access to different sets of elements (e.g., figures, tables, code fragments, etc.).

Some of the most exciting features in this category are those that use HyTime (or similar addressing models) to collect information from users in the form of links annotations, bookmarks, or even historical journals of the locations which have been visited (Grif's SGML ActiveViews). As compared with other approaches for doing links and annotations, the more extensive use of SGML and HyTime in this product category has two primary advantages: 1) Pointing to locations formalizes the relationships between this user-generated data and the base documents without requiring changes to the base documents. This is in contrast to other approaches (e.g., HTML), which require that anchors or their equivalent to be inserted directly in the document and thus require write access to the file. 2) Capturing this data as an SGML data stream makes it easier to recycle it in a formalized publishing process, as compared with tools that store such additions in proprietary binary formats.

These features dramatically blur the distinction between viewing and authoring by allowing both data and metadata to be collected, not through the use of dedicated authoring tools, but as a function of browsing. Such an approach could, for example, help subject matter experts not only to codify and document their understanding of the complex relationships inside libraries of documents but to do it in a form that could be easily recalled and expanded by other users of the system. In addition, the use of addressing allows such analysis to be layered on top of the pool of documents. The layering of analysis and the ability to support multiple layers of analysis would be powerful mechanisms for formalizing the tacit knowledge that organizations depend upon and could create a new medium of communication and exchange.

In time, these consumer-defined metadata collections could even solve many of the infoglut problems that limit the effectiveness of large document collections by providing organic alternatives to the engineering of complex, content-oriented DTDs and other fixed metadata structures. This could prove to be especially valuable in multi-tiered regulatory environments, for example, where groups of experts need to pool and integrate their understandings to both guide organizational behavior and respond efficiently to regulatory changes. To be used in this fashion, tools should allow system integrators to augment link and annotation DTD fragments with their own, application-specific data models.


Text Retrieval Engines

At its simplest, text retrieval involves searching and string matching. It's a rare electronic delivery tool that doesn't support simple text searches within a single, onscreen document. For most applications, however, this is far from adequate. To be useful, text searching must be done against a body of documents. Full text indexing and retrieval systems address these needs.

When discussing text retrieval, the terms precision and recall are fairly important. Precision refers to the ability to retrieve only what is desired, and not a lot of extraneous (noisy) data. Recall is the ability to retrieve everything that is of interest. Ideally, a query returns everything that you are looking for (recall) and nothing that you aren't (precision). Query results are never ideal.

Mechanically, full text retrieval is almost always a two-step process. The first step involves an indexing function. Although vendors use different indexing approaches, this step usually occurs somewhere within the publishing phase of the document lifecycle. An exception to this is some of the indexing which is being done on the World Wide Web, where software tools are indexing documents after they have been published on the Internet.

The second step is the specification and processing of a user-defined query. The way that queries can be constructed and processed is the greatest single differentiating factor among retrieval tools. The more common query approaches are boolean searches, weighted thesauruses, vector searches, and context-sensitive searches.

Boolean Searches

The simplest query model is the boolean search. In addition to searching for a specific string, systems that support this approach (virtually all of them) allow multiple strings to be searched for. AND, OR, and NOT operators can be combined with the specified strings to influence precision and recall (e.g, get me the documents that have SGML AND pasta in them).

Weighted Thesauruses

Verity, with their Topic product, has done most of the pioneering work with weighted thesauruses. This approach was developed originally for the CIA to help process large amounts of incoming data to determine which information deserved further attention.

It works something like this. Imagine being interested in information about outer space. A fairly large vocabulary of words could be used to identify documents about outer space: space, rockets, Boeing, moon, NASA, stars, shuttle, Hubble, etc. Some of these words are likely to be strong indicators of relevance (e.g., NASA) and others are less likely (e.g., movie stars). The weighted thesaurus allows a hierarchy of terms to be constructed, where each node and branch can be given a number to indicate its probable relevancy.

When queries are formed by referencing these key terms, a complex set of string matching and statistical calculations are used to rank target documents for relevancy. When the thesauruses are well-designed and maintained, this is a very effective retrieval approach. It may not be as good a choice for performing predominantly ad-hoc queries, though, where the cost of crafting well-designed vocabularies of search terms is hard to justify and much of the power of the query tool remains unused.

Vector Searches

From what I can tell, Gerard Salton, of Cornell University developed this method, and only a few retrieval engines support it at this time.

Imagine taking every article in Byte magazine and counting the number of times that the words hardware and software appear in each article. Plot each article on a grid with the number of occurrences of hardware on the vertical axis and the number of occurrences of software on the horizontal axis. Next, take an article that is a good example of what you are looking for and plot its location on the grid. Vector math can then be used to find the nearest article, which will have a similar combination of hardware and software.

Admittedly, the above example is simplistic and a bit stupid. This approach becomes much more useful, however, when the index contains thousands of keywords that have been carefully chosen. Some researchers are using vector searches to replace hard-coded hyperlinks and are integrating these search engines with graphical displays, where dot clustering, color changes, and other visualization techniques are used to communicate proximity.

Context-Sensitive Searches

The preceding search methods can be used with either structured or unstructured data and can be found in a wide variety of products. Context-sensitive searches, on the other hand, require structured data and are usually only found in those products that have a solid SGML foundation. OpenText, DynaText, and Explorer are some of the products that support this searching method.

Context-sensitive searches are performed by specifying both the text which is to be found and the element that it is to be found in. This approach can significantly improve precision. By searching for words only in document titles, for example, the absolute number of hits will be lower, and the precision will generally be higher.

The desire to perform context-sensitive searches can often have a significant effect on tool selection and markup strategies. Page Viewers, Binary Browsers, and Fixed DTD Browsers can have trouble supporting context-sensitive searches because the original SGML markup is likely to have been stripped out or converted to another metadata framework.

DTDs that are designed to enhance the effectiveness of context-sensitive searches are likely to have more elements than those designed mostly to support visual formatting. Being able to search for part number elements that contain 1978 is very different from just searching for 1978 or even part number 1978. It is possible for DTDs to become incredibly large and complex as information consumers strive to have every possible context be formalized in the SGML DTD to support potential search strategies.


CD-ROM Publishing

Most of what has been described in this paper applies directly to CD-ROM publishing. Virtually all of the tools that have been described can be used for CD-ROMs, just as they can be used with data that resides on local hard disks, networked fileservers, and other media. Licensing arrangements may vary from vendor to vendor, however, making some tools more or less attractive than others.

One of the few issues that are relatively unique to CD-ROM publishing involves how the documents are encoded on the disk. While some CD-ROMs, like the SGML World Tour, contain native SGML files, intellectual property interests may dictate that the documents be stored in a binary representation that cannot be converted back to revisable text.

An emerging trend in CD-ROM publishing is to integrate CD-ROMs with databases, the World Wide Web, and other online services. CompuServ and Encyclopedia Britannica appear to be some of the first to be doing this. I can envision two popular approaches: 1) using the CD-ROM to distribute pieces of information that are heavily used or fairly static (like graphics), thereby cutting bandwidth requirements, reducing cost, and improving performance; and 2) to keep the data on CD-ROMs more current by integrating fragments of dynamic data that are accessed online.

Most of the early efforts at integrating CD-ROM and online publishing appear to be based on custom software. I am only aware of one commercial tool that has been designed to support CD-ROM/World Wide Web integration. Electronic Book Technologies Matterhorn product is designed to allow URLs to be imbedded on the CD and be used to retrieve Web pages when an appropriate hyperlink or icon has been activated.


Hybrid SGML/HTML Publishing

While HTML's near-ubiquitousness has made it the de-facto standard for digital paper, it really doesn't support a very complex set of interactive or online behaviors. Even with waves of enhancements and extensions, an increasing number of sites are moving to a two-tiered model: SGML and/or database technologies are used to manage richer metadata and document structures than are possible with just HTML and a second layer of programs are used to paint HTML representations of this data. In many cases, this transformation layer actually assembles the documents dynamically from a set of SGML entities and/or database fields.

As was mentioned earlier, the production aspects of publishing SGML documents using HTML are roughly the same as for other any other product that uses a fixed DTD. HTML is used as a visual display language and some form of a mapping process is used to translate SGML elements into HTML representations. Here, the technology emphasis is on filtering and conversion. Once the desired mappings (and other supporting behaviors) have been determined, a wide variety of tools and approaches exist to execute them. General-purpose programming languages like C, Awk, Perl, and Omnimark are often used.

Two main production models are emerging, which are differentiated by when the transformations take place.

The batch conversion approach involves taking a collection of SGML documents, converting them to HTML, and loading the resulting files onto a web site. A variation of this is the parallel conversion approach, where non-SGML source documents are converted into both SGML and HTML at the same time. Organizations who opt for the parallel approach, however, often find themselves trapped in a design paradox, where the incremental value of the SGML documents is hard to identify. Instead, the practical emphasis on HTML conversions results in SGML structures that don't differ much from HTML and are therefore unlikely to provide much long-term value. One of the perceived problems with the batch/parallel approach is the amount of disk space consumed.

Real-time conversions are more complex, but offer greater functionality. They use the HTTP protocol to communicate with software applications instead of the file system. Because they eliminate the need to physically store the HTML files, a larger universe of virtual HTML documents can be supported. Even more important, these virtual documents can be heavily customized to support more complex interactions. Multiple filters can be used to tune the look for different browsers. The filters can also be updated to modernize or change the look and feel of the site without forcing the underlying documents to be re-authored or modified. Customized forms and HTML pages can be generated on-the-fly to collect input from the users. Often, this is done by presenting the reader with a set of hotlinks that tell the server software which command is to be executed next.

While doing some research on the web to support this update, for example, I found a couple of sites that did some elegant things to repackage their web pages as the result of my queries. The Folio site layered a return-to-query-results-page hotlink at the top of each page. The Fulcrum site actually updated the body of the document to surround each occurrence of the search term with special characters that were linked to the previous and next occurrence of the search term.

Many, if not most, of the companies that produce large-scale publishing and database tools have also developed specific applications to simplify the conversion from their internal formats. The degree of out-of-the-box automation that is provided is usually a function of how constrained (and thus predictable) the non-HTML encoding is. Generally, the more powerful and flexible systems require customization to render attractive HTML pages, and as a result, they function very much like a programming environment.

Often these toolkits provide more than just conversion support. Querying interfaces, link management and verification, automatic generation of Tables of Contents, and even editing, configuration control, and server updating services are becoming increasingly common. As a result, many of the historical distinctions between traditional SGML software vendors is starting to blur as each tries to migrate their tools away from their established niches and into more generalized frameworks for managing web sites. In a very real sense, the World Wide Web has spurred the development of integrated, mass-market, SGML-based document management solutions that support the entire document lifecycle, and not just portions of it. Even more important, these improvements have started to roll back into mainstream SGML products.

The delivery of SGML documents across the World Wide Web introduces another set of issues that deal with granularity. The average HTML document is much shorter than the average SGML document. That, combined with the relative slowness of modem-based internet connections has lead to various approaches for chunking SGML documents into a number of smaller fragments. When the dust settles, I'm hoping that conventions for chunking SGML documents leave some control with the reader to specify how much (or how little) information they wish to receive in a single transaction.

At the same time, software vendors are migrating an ever-increasing array of proprietary encodings to the Web. Although HTML is the lingua franca for web documents, hardly a week goes by without an announcement of a viewer, helper app, or plug-in that will allow yet another file format to be transmitted across the Web and rendered through a Web browser. Another important trend is the volatility of the HTML language itself. HTML has found itself to be a battleground for proprietary interests. Wave after wave of Microsoft and Netscape-specific extensions have done much to undermine the portability of HTML documents. Taken together, these two trends are reducing the World Wide Web to a transport medium that segments the audience based on the client software which is being used. Accessibility is increasingly being determined by your software vendor, not a vendor-neutral system-wide architecture of interchange standards.

In the last year, the World Wide Web Consortium has stepped up its efforts to establish a set of next-generation standards that will encompass and thus survive the commercial forces that are dominating the development of the Web. The movement towards an established framework for extending HTML, generic SGML (Extensible Markup Language, or XML), conventions for delivering SGML document fragments, style sheets, and typed links are all evidence of this effort. In the few short years since HTML was introduced, it has proved itself to be an attractive and useful entry point for internet and intranet publishing. But both users and tools have become more sophisticated, and they need richer metadata structures to support a wider variety of behaviors. In the face of these challenges, the Web community appears to be returning to the SGML spring to sip from the primordial waters.

Some have tied to exploit weaknesses in the HTML standard to embed additional SGML elements in the data stream. These elements are intended to provide additional markup for downstream applications, but with the expectation that they will ignored by HTML browsers. Others have adapted SGML browser technology to the Web, either as helper apps on the client side or as transformation engines on the server side. But in the future, DSSSL-based SGML to HTML transformations are likely to become increasingly commonplace; and the promise of SGML and XML plug-ins (and perhaps even plug-ins that will support the editing of SGML fragments) may even make the direct distribution of SGML data through the Web a practical reality in the not-too-distant future.


Conclusions

Document delivery and retrieval play important, sometimes critical roles in the document lifecycle. Because SGML tends to shift costs upstream and benefits downstream, the selection of electronic delivery and retrieval tools can dramatically influence 1) the cost-benefit ratios for the entire SGML project, and 2) where in the lifecycle benefits are realized.

SGML DTDs represent a negotiated balance among the divergent stakeholder interests that exist at different points in the document lifecycle. The way that the interests of authors (simplicity), publishers (structure), and consumers (richness) are balanced will drive many, if not most, of the major DTD design choices. These metadata choices will, in turn, influence the appropriateness of individual display and retrieval tools.

While a very broad range of readers, viewers, browsers, and text indexing tools can be used to deliver SGML documents, browsers and retrieval engines that are built to support SGML generally offer superior performance. As a rule, they better preserve the metadata richness of the original document instance, provide more flexibility during display, and support context- sensitive searching methods that can enhance the precision of query results.

SGML documents can be, and often are, delivered using HTML, but the limitations of the language have driven many Web publishers to augment their servers with sophisticated back-end tools that generate HTML on-the-fly to better preserve the richness and functionality of the source documents.

The shift from paper-based to SGML-based document lifecycles is blurring the distinction between information producers and consumers. Many of the high-end delivery tools include capabilities for capturing valuable information during the browsing session. These changes have the potential of shortening cycle times, speeding individual learning, improving collaboration, and enhancing organizational adaption.


Stemma

Copyright, The Sagebrush Group, 1996-2009.

This article is based on a paper which was presented at SGML'96, December 18-21, 1996, and published in the conference proceedings. It was an update of the SGML'95 version.