290x Filetype PDF File size 0.44 MB Source: frank.itlab.us
programming
by ]an Bentley pearls
SELF-DESCRIBING DATA
You just spent three CPU hours running a simula- %title Efficient string matching:
tion to forecast your company’s financial future, and an aid to bibliographic search
your boss asks you to interpret the output: %journal Communications of the ACM
Xvolume 18
Scenario 1: 3.2% -12.0% 1.1% %number 6
Scenario 2: 12.7% 0.0% 8.6% %month June
Scenario 3: 1.6% -8.3% 9.2% %year 1975
%pages 333-340
Hmmm. Blank lines separate entries in the file. A line that
You dig through the program to find the meaning begins with a percent sign contains an identifying
of each output variable. Good news-Scenario 2 term followed by arbitrary text. Text may be con-
paints a rosy picture for the next fiscal year. Now all tinued on subsequent lines that do not start with
you have to do is uncover the assumptions of each. a percent sign.
Oops-the disaster in Scenario 1 is your company’s The lines in the bibliography file are name-value
current strategy, doomed to failure. What did Sce- pairs: each line contains the name of an attribute
nario 2 do that was so effective? Back to the pro- followed by its value. The names and the values are
gram, trying to discover which input files each one sufficiently self-describing that I don’t need to elabo-
reads. . . . rate further on them. This format is particularly well
Every programmer knows the frustration of trying suited to bibliographies and other complex data
to decipher mysterious data. The first two sections of models. It supports missing attributes (books have no
this column discuss two techniques for embedding volume number and journals have no city), multiple
descriptions in data files. The third section then attributes (such as authors), and an arbitrary order of
applies both methods to a concrete problem. fields (one need not remember whether the volume
number comes before or after the month).
Name-Value Pairs Name-value pairs are useful in many databases.
Many document production systems support biblio- One might, for instance, describe the aircraft carrier
graphic references in a form something like: USS Nimitz in a database of naval vessels with these
pairs:
Xtitle The Art of Computer Programming, name Nimitz
Volume 3: Sorting and Searching class CVN
Xauthor D. E. Knuth number 68
%pub Addison-Wesley displacement 81600
%city Reading, Mass. length 1040
%year 1973 beam 134
Xauthor A. V. Aho draft 36.5
%author M. J. Corasick flightdeck 252
speed 30
officers 447
0 1967 ACM OOOl-0782/87/0600-0479 75a enlisted 5176
June 1987 Volume 30 Number 6 Communications of the ACM 479
Programming Pearls
Such a record could be used for input, storage, and format. The regular structure supports the tabular
output. A user could prepare a record for entry into format, but observe that the header line is another
the database using a standard text editor. The data- kind of self-description embedded in the data.
base system could store records in exadtly this form Name-value pairs are a handy way to give input
(we’ll soon see a representation that is more space to any program. (They are perhaps the tiniest of the
efficient). The same record could be included in the “little languages” described in the September 1986
answer to the query “What ships have a displace- column.) They can help meet the criteria that
ment of more than 75,000 tons?” Kernighan and Plauger propose in Chapter 5 of their
Name-value pairs offer several advantages for this Elements of Progrumming Style (2nd ed., McGraw-Hill,
hypothetical application. A single format can be 1978):
used for reading, storing, and writing records; this Use mnemonic input and output. Make input easy to
simplifies life for user and implementer alike. The prepare (and to prepare correctly). Echo the input and
application is inherently variable-format because any defaults onto the output; make the outpui self-
different ships have different attributes: submarines explanatory.
have no flight decks and aircraft carriers have no Name-value pairs are useful in code far removed
submerged depth. Unfortunately, the example does from input/output. Suppose we wanted to provide a
not document the units in which the various quanti- subroutine that adds a ship to a database. Most lan-
ties are expressed; we’ll return to that shortly. guages denote the (formal) name of a parameter by
Some database systems store records on mass its position in the parameter list. This mechanism
memory in exactly the form shown above. This for- leads to remarkably clumsy calls:
mat makes it particularly easy to add new fields to
records in an existing database. The name-value for-
mat can be quite space efficient, especially com- addship("Nimitz", "CVN", "68",
pared to fixed-format records that have many fields, 81600, 1040, 134, 36.5,
most of which are usually empty. If storage is criti- 447, 5176,,,30,,,252,,,,)
cal, however, then the database could be squeezed
to a compressed format: The missing parameters denote fields not present in
this record. Is 30 the speed in knots or the draft in
feet? A little discipline in commenting helps unravel
the mess:
Each field begins with a two-character name and addshipc "Nimitz", # name
ends with a vertical bar. The input and the stored " CVN" , # class
formats are connected by a data dictionary, which "68" , # number
might start: 81600, # disp
1040, # length
ABBR NAME UNITS . . . 1
na name text
cl class text Many programming languages support named
nu number text parameters, which make the job easier:
di displacement tons
le length feet
be beam feet addshipcname = "Nimitz",
dr draft feet class = "CVN",
fl flightdeck feet number = "68",
sP speed knots disp = 81600,
of officers personnel length = 1040,
en enlisted personnel . . . 1
In this dictionary the abbreviations are always the
first two characters of the name; that may not hold Even if your language doesn’t have named param-
in general. Readers offended by hypocrisy may com- eters, you can usually simulate them with a few
plain that the above data is not in a name-value routines:
480 Communications of the ACM lune 1987 Volume 30 Number 6
Programming Pearls
shipstart “When looking at the result of a computation like
ship&r (name, "Nimitz") this, it is helpful to have an ‘audit trail’ of the vari-
shipstr (class, "CVN" ) ous command lines and data files encountered. We
shipstr (number, "68" ) therefore built a mechanism for ‘commenting’ the
shipnum(disp, 81600) files with various annotations so that when we re-
shipnum(length, 1040) view the output, everything is there on one display
. . . or piece of paper.
shipendc ) “We use several types of comments. An ‘audit
trail’ line identifies a data file or a command-line
The name variables name, class, number, etc., are transformation. A ‘dictionary’ line names the attri-
assigned unique integers. butes in each column of the output. A ‘frame separa-
tor’ sets apart a group of sequential records associ-
ated with a common event. A ‘note’ allows us to
Provenances in Programming place our remarks in the file. All comments begin
The provenance of a museum piece lists the origin with an exclamation mark and the type of the com-
or source of the object. Antiques are worth more ment; other lines are passed through untouched and
when they have a provenance (this chair was built processed as data. Thus the output of the above
in such-and-such, then purchased by so-and-so, pipeline might look like:
etc.). You might think of a provenance as a pedigree
for a nonliving object.
The idea of a provenance is old hat to many pro- ltrail sim.events -k 1.5 -1 3
grammers. Some software shops insist that the ltrail sample -t .Ol
provenance of a program be kept in the source code ltrail bins -t .Ol
itself: in addition to other documentation in a mod- Idict bin-bottom-value item-count
ule, the provenance gives the history of the code 0.00 72
(who changed what when, and why). The prove- 0.01 138
nance of a data file is often kept in an associated file 0.02 121
(a transaction log, for instance). Frank Starmer, a . . .
Professor in the Departments of Computer Science lnote there is a cluster around 0.75
and Medicine at Duke University, tells how his pro- lframe
grams produce data files that contain their own
provenances:
“We constantly face the problem of keeping track All programs in our library automatically copy exist-
of our manipulations of data. We typically explore ing comments from their input onto their output,
data sets by setting up a UNIX@ pipeline like and additionally add a new t r a i 1 comment to doc-
ument their own action. Programs that reformat data
sim.events -k 1.5 -1 3 I (such as bins) add a diet comment to describe the
sample -t .Ol I new format.
bins “We’ve done this in order to survive. This disci-
pline aids in making both input and output data files
The first program is a simulation with the two pa- self-documenting. Many other people have built
rameters k and 1 (set in this example to 1.5 and 3)’ similar mechanisms; wherever possible, I have cop-
The vertical bar at the end of the first line pipes the ied their enhancements rather than having to figure
output into the second program. That program sam- out new ones myself.”
ples the data at the designated frequency, and in Tom Duff of Bell Labs uses a similar strategy in a
turn pipes its output to the third program, which system for processing pictures. He has developed a
chops the input into bins (suitable for graphical large suite of UNIX programs that perform transfor-
display as a histogram). mations on pictures. A picture file consists of text
lines listing the commands that made the picture
(terminated by a blank line) followed by the picture
itself (represented in binary). The prelude provides a
UNIX is a registered trademark of AT&T Bell Laboratories. provenance of the picture. Before Duff started this
’ Note that the two parameters are set by a simple name-value mechanism practice he would sometimes find himself with a
-/. B. wonderful picture and no idea of what transforma-
]une 1987 Volume 30 Number 6 Communications of the ACM 481
Programming Pearls
tions produced it; now he can reconstruct any camps 4772
picture from its provenance. swaps 4676
Duff implements the provenances in a single cpu 0.1083
library routine that all programs call as they begin
execution. The routine copies the old command Given the sort routines and other procedures to do
lines to the output and then writes the command the real work, the control program is easy to build.
line of the current program. Its main loop is sketched in Program 1: the code
reads each input line, copies it to the output, and
A Sorting Lab processes the name-value pair. The s imul.ate ( )
To make the above ideas more concrete, we’ll apply routine performs the experiment and writes the out-
them to a problem described in the July 1985 col- put variables in name-value format; it is called at
umn: building “scaffolding” to experiment with sort each blank line and also at the end of the file.
routines. That column contains code for several sort
routines and a few small programs to exercise them.
This section will sketch a better interface to the rou- loop
tines. The input and output are both expressed in read input line into string S
name-value pairs, and the output contains a com- if end of file then break
plete description of the input (its provenance). if S = *I II then simulated)
Experiments on sorting algorithms involve adjust- write S on output
ing various parameters, executing the specified rou- Fl = first field in S
tine, then reporting key attributes of the computa- F2 = second field in S
tion. The precise operations to be performed can be if Fl= "n" then
specified by a sequence of name-value pairs. Thus N = F2
the input file to the sorting lab might look like this: else if Fl = "alg" then
if F2 = "insert" then
n 100000 alg = 1
input identical else if F2 = "heap" then
alg quick alg = 2
cutoff 10 else if F2 = "quick" then
partition random alg = 3
seed 379 else error("bad alg")
else if Fl = "input" then
. . .
In this example the problem size, n, is set to 100,000. simulatet 1
The input array is initialized with identical
elements (other options might include random, PROGRAM 1. A Loop to Process Name-Value Pairs
sorted,or reversed elements). The sortingalgo-
rithm in this experiment is quit k for quicksort;
insert (for insertionsort) and heap (forheapsort)
might also be available. The cutoff and parti -
t ion names specify further parameters in quit k - This structure is useful for many simulation pro-
sort. grams. The program is easy to build and easy to use.
The input to the simulation program is a sequence Its output can be read by a human and can also be
of experiments in the above format, separated by fed to later programs for statistical analysis. Because
blank. lines. Its output is in exactly the same format all the input variables (which together provide a
of name-value pairs, separated by blank lines. The provenance of the experiment) appear with the out-
first part of an output record contains the original put variables, any particular experiment can be re-
input description, which gives the provenance of peated and studied in detail. The variable format
each experiment. The input is followed by three ad- allows additional input and output parameters to
ditional attributes: camps records the number of be added to future simulations without having to
comparisons made, swaps counts swaps, and cpu restructure previous data. Problem 8 shows how the
records the run time of the procedure. Thus an out- basic structure can be gracefully extended to per-
put record might end with the fields: forming sets of experiments.
482 Communications of the ACM June 1987 Volume 30 Number 6
no reviews yet
Please Login to review.