Previous Contents Next

CHAPTER 7: MACREF

The MACREF utility serves two functions. It is a cross-reference utility of some value in assembly language programming. And—as it turned out—it serves as a demonstration of five important programming techniques: binary search, binary trees, linked lists, dynamic storage allocation, and the use of a finite-state machine.

THE USES OF A CROSS-REFERENCE

A cross-reference is a table that lists the identifiers used in a program in alphabetic order. For each identifier, the cross-reference shows, first, the program line at which the identifier is declared, and second, the numbers of all the program lines in which the identifier is used.

The assemblers and compilers written for big computers routinely produce a cross-reference as part of a listing. I don't know of any personal-computer compilers that do. The symbol file written by MAC and RMAC is not a cross-reference—it shows the value of each symbol but not its point of definition nor its points of use.

A cross-reference is a waste of file space in some circumstances, and an invaluable aid to the programmer in others. It's of little use in a small program, because a small program has few identifiers and they can be found easily. A cross-reference is rarely used by a program's author, at least while the program is in development. The author knows all the identifiers and their uses.

The cross-reference comes into its own when you must modify a large program that you didn't write (or one of your own that you haven't seen in a while). When reading such a program, questions always arise that can only be answered through the cross-reference. What is this identifier meant for? Find its definition point and read the comments; if there are none, find all references to it and see how it is used. Where is this routine called, and under what conditions? The cross-reference points to the calling instructions. Could this variable ever become negative (or zero, or whatever)? The cross-reference leads to all the points where the variable is assigned.

SPECIFYING MACREF

The Roots of MACREF

While planning this book, I decided that the program listings would have to be cross-referenced as an aid to the reader. At first I thought to use an existing cross-referencer. They are fairly common; a cross-reference utility is a common project for budding systems programmers (after a disassembler and just before they begin designing their own editors).

I looked first at a program named XREF distributed by the CP/M User Group. It had several drawbacks, the worst being that it wrote its output to the printer, not to a disk file. This XREF took only the listing file as input. It parsed each statement to isolate the opcode, then took every other identifier as a symbol. After reading it, I made a note on the listing, "better, faster, if it read the .SYM file to prime a table, then scanned for only those names—wouldn't have to recognize opcodes."

Digital Research distributes a utility, also called XREF, with the RMAC assembler. According to its documentation, it does just that: it reads the symbol file created by the assembler, then builds a cross-reference for just those names. That simplifies the logic of the utility. The assembler has already recorded a list of the valid symbols. If the cross-referencer didn't use that information, it would have to duplicate a great deal of the assembler's work. Especially, it would have to distinguish between symbols defined by the user and words reserved by the assembler (opcodes, register names, operators like "NOT"). Even if the program could recognize the usual reserved words, how could it tell the difference between a normal symbol and a macro name used as an opcode? It would have to recognize every MACRO statement and add the macro's name to its list of reserved words. But to do that reliably, the program would have to implement the MACLIB directive and process the included library. Using the .SYM file is much, much simpler.

Cross-Reference the Source?

The Digital Research XREF would have sufficed for my purposes—and this chapter would not have existed—except for a practical consideration of book production. For several reasons, the best way to display the programs herein was to print their source files rather than their assembler output listings. But both XREFs, and all the other cross-referencers I'd heard of, added information to the listing file, not to the source file.

Does it make sense to cross-reference a source program? As a matter of fact, it makes a great deal of sense. To begin with, when the cross-reference is appended to the source file, there is no longer any reason to have the assembler make a print file at all. When the print file is omitted, the assembly runs much faster. And, when you use SID or another symbolic debugging tool, a source listing with a cross-reference is the only hard-copy you need for debugging.

There's another reason to cross-reference the source file: the print file is not a true copy of the source program. Macro invocation lines carry important information about the program's logic, but they are not printed. The listing includes lines that are not present in the source (macro expansions and the repeated statements of REPT, IRP, etc.). For the programs in this book, the print file also contains the text of all the included code. That adds needless bulk; the included routines exist, but their text contributes little to one's understanding of a program's logic.

One step in making a cross-reference is to attach a sequence number to each line of the file. This is necessary because the cross-reference must refer to lines by number. But the numbered lines of a listing file will not match up to the numbered lines of the source file. That makes it difficult to work from a cross-referenced listing on paper to the source file on the screen of a terminal.

One possible inconvenience of cross-referencing a source file is the waste of disk space. It shouldn't be necessary to keep two copies of the file, one with a cross-reference and one without. That would be solved if the cross-referenced file were still acceptable as input to the assembler. If the cross-referenced file can be reassembled, it can be the only copy of the file. Of course that opens the possibility that the cross-referencer's input file might contain an old cross-reference left over from a prior run; that would have to be stripped out.

The Data Requirements

There are five file-handling requirements on our proposed cross-referencer. Prior to reading the input file, it will read a MAC symbol file. Then it will read its input, a MAC-compatible assembly source file. If the input file already contains an old cross-reference, it must be stripped out. The program must write an output that is also a MAC-compatible assembly source file. The output file will be augmented with line numbers and a cross-reference of the identifiers recorded by the assembler.

The Cross-Reference Format

What should the cross-reference look like? The nicest cross-reference style I could recall was that written by an assembler for the IBM 1130 (the 1130, an expensive minicomputer of the late 1960s, had abilities comparable to those of a typical 8086-based desktop system of 1982). For every use of a variable, that assembler showed whether the use was a reference or a modification. That information was often useful; you could tell at a glance which statements might have changed a value.

Could MACREF do something similar without a lot of trouble? At first, I thought not. The semantics of the 8080 assembly language don't make it easy to distinguish references and modifications. Consider

STA CPMFCB+32.

Is CPMFCB being modified, or referenced? What about

LXI H,INDEX INR M

The variable INDEX is being modified, but how could a cross-referencer tell?

Then I thought of a clever dodge, a way that MACREF could pass on all the information it had, while leaving it up to the user to decide its meaning. It would list the operation code associated with every use. That would paint a picture of the kind of uses the identifier was put to, like this:

RAT LDA-55 LXI-87 STA-103 CNZ-135 RATE CALL-41 CALL-119 CALL-128

Several kinds of information can be read out of that. RAT appears to be a byte which is referenced at line 55 (might it be referenced before it is set?). At line 87 and after it could be either referenced or set under the guise of "M". It's definitely being altered at line 103. But what about line 135? That looks like a bug! Almost certainly line 135 should be "CNZ RATE," not "CNZ RAT."

The description of a symbol should also include its hexadecimal value as given in the symbol file, and the line number at which it is defined. I decided, as frosting, that when consecutive references had identical opcodes, the redundant ones should be elided:

0288 0BC9 RATE CALL-41 -119 -128 CNZ-135

Opcodes and the Census

MACREF would have to collect operation codes. This is a different matter than recognizing the assembler's vocabulary of reserved words. To the assembler, any identifier that is not a macro or reserved word is a symbol—and it records the symbols in the .SYM file. MACREF could reverse this logic. Any word it found that was not a symbol must be a reserved word or macro! The first such word on a line could be taken as an opcode. That word must be added to a list of opcodes for printing in the cross-reference.

This logic, coupled with the fact that MACREF would be processing the source file, not the print file, meant that it would automatically tabulate macro invocations as opcodes. For instance, it might display

CPMFCB LDA-150, SERVICE-192

showing that CPMFCB is an operand of the SERVICE macro. This side-effect is another good reason for cross-referencing the source file. Surely the preceding line is more informative than the line

CPMFCB LDA-150, LXI-198

that would result from processing the comparable print file.

Since MACREF would be tabulating opcodes, it could count them with very little extra work. Therefore, it should not be difficult for MACREF to add a census of opcode usage at the end of the cross-reference. This census would be only marginally useful, but it would be interesting and should cost little to produce.

Operations

MACREF is a utility; it takes an input file and delivers an output file containing the same data, augmented. The syntax of the MACREF command is

MACREF input-ref output-ref

MACREF uses the utility convention for its input and output files. The parts of the output-ref that are omitted are supplied from the input-ref; if output-ref is omitted the output will replace the input file.

MACREF requires one other file. It reads a file inputname.SYM, produced by the MAC or RMAC assembler. This symbol file must be present on the same disk as the input file. For example, the command

macref b:program.asm a:.xrf

will read B:PROGRAM.ASM and B:PROGRAM.SYM from the B-drive, and will write an output file PROGRAM.XRF on the A-drive.

The input file is expected to be an assembly-language source file suitable for input to the MAC assembler. The output file is the same, with the addition of a sequence number at the head of each line and a cross-reference report at the end.

The sequence numbers are 4-digit decimal numbers. If MACREF is given a file of more than 9,999 lines the numbers will "wrap around" and start over. The sequence number is followed by a tab character, so that a display of the file will line up relative to tab stops just as it did before, except that the text will be displaced to the right by eight columns.

The cross-reference lines begin with asterisks (*), which makes them comment lines to the assembler. The assembler ignores sequence numbers and comments, so the MACREF output file can be submitted to the assembler just as its input file could be.

If the input file has sequence numbers, MACREF will strip them off before substituting its own. If it contains an old cross-reference, MACREF will drop it (and anything following it) from the output.

DESIGNING THE DATA

The initial sketch for MACREF appears in Figure 7-1. The logic splits comfortably into four phases. In the first phase, the inputname.SYM file is loaded into a symbol table of some kind. Next, the input file must be processed. Each line is read, stripped of an existing sequence number, given a new number, and written to the output file. Once this has been done, the line can be parsed into words and the words identified. If a word appears in the symbol table, it is either used or defined on the line, and this fact must be noted somehow.

In the third phase, the symbol-usage information must be converted into a cross-reference display. The symbol table will be read out in alphabetical order and display lines prepared from the information noted earlier for each symbol.

Finally, the opcodes that were noted must be read out (in alphabetical order) and displayed with their use counts. Then the output file can be closed, using the same included code that we developed in earlier chapters.

Figure 7-1. The initial plan for MACREF.

program MACREF(inputref,outputref) prepare input mechanism for the .SYM file load the symbol table from the .SYM file prepare input mechanism for "inputref" prepare output mechanism for each input line strip any sequence number copy line to the output file with a new sequence number note the definition or use of each symbol in the line end for. add the cross-reference display to the output file add a census of opcode usage to the output file finish the output file end MACREF.

Data Structures and Storage

This program is unusual in that it seemed to call for just about every standard technique of data storage. In the end it involved a linear array (of structures), linked lists (of structures), and a binary tree. It lacks only a FIFO queue and a ring buffer to be complete! Figure 7-2 summarizes the data plan I arrived at.

One item could be disposed of easily. A buffer would be needed to hold each input line. One kilobyte seemed like a generous amount of space to hold an assembly source line. Files in general might have longer lines, but program files generally have short records (mine rarely exceed 80 bytes).

Figure 7-2. Plan for the data structures used in MACREF.

LineBuffer is a space of 1024 bytes to hold a line the SymbolTable is an array[1..?] of a SymRec is a record of Label is a string of 1..16 characters plus 00h Line is a line number in binary Val is the assembler value of the label in binary Use is a pointer to a UseRec A UseRec is a record of Next is a pointer to another UseRec Line is the line number where the symbol appeared Op is a pointer to an OpRec An OpRec is a record of Name is a string of 1..8 characters plus 00h Count is a count of the uses of the opcode Left is a pointer to an OpRec Right is a pointer to an OpRec Pool is a space of 8192 bytes for dynamic allocation

The Symbol Table

The first question was the organization of the symbol table. It would have an entry for each symbol from the .SYM file, and record three items about each:

In addition, an unpredictable number of "uses" would have to be noted for each symbol. During the second phase of processing the table would be searched for a given identifier many, many times.

It seemed sensible to create a record—a "SymRec"—for each symbol as it was read, and to make a linear array of SymRecs. The .SYM file lists symbols in alphabetic order. Therefore the array would be ordered alphabetically by identifier without any special effort on the program's part (or so I thought—it turned out not to be so simple). An ordered array can be searched very quickly with a binary search.

Recording Symbol Uses

There might be a large number of uses recorded for some symbols and few or none for others. It would be impossible to predict in advance either the number of uses for one symbol, or the total number of uses that would have to be noted. That implied that a use of a symbol was an event that should be recorded in a UseRec of its own. The uses of a symbol could be represented by a linked list of UseRecs; the list could be anchored in the SymRec to which they applied.

Recording Operation Codes

The recording of operation codes was a different sort of problem. The opcode related to a particular use could be recorded as a string in that use's UseRec. But a program might contain thousands of uses, while its set of unique opcodes might be small (MACREF uses only 61 unique opcodes). It would be much more economical of storage to record each opcode's name-string just once in an OpRec, then to put a pointer to the OpRec in each UseRec associated with that opcode.

The size of the opcode set can't be predicted in advance, and the opcodes will not occur in alphabetic order in the input. Yet an opcode would have to be looked up (or inserted) in the set of opcodes once for every line processed. The opcode list wouldn't be searched as often as the symbol table, but often enough that access to it should be fairly quick. Furthermore, the census of opcode use displayed at the end ought to be in alphabetical order.

A hash table allows quick insertion and lookup but is inherently unordered; an ordered array allows quick lookup but is slow for insertion; a linked list allows easy insertion but only a sequential search. There is one data organization that has all three characteristics: quick lookup, quick insertion of new entries, and easy preservation of alphabetic order. That is the binary tree, a pleasant structure of which I am very fond. I decided to store OpRecs in a binary tree. Figure 7-3 shows the relationship between the three types of records.

Dynamic Allocation

There remained some unknowns. The program can't know in advance how many SymRecs, UseRecs, and OpRecs it will need. Worse, it can't know the relative proportions of the three kinds of records. What it can be sure of is that SymRecs are created only during the first phase of processing, and UseRecs and OpRecs only during the second phase. Furthermore, no file output would be done while the symbol table was being loaded. The large output buffer needed by the library routines wouldn't be needed until the size of the symbol table was known.

The situation called for two kinds of dynamic allocation. During the first phase, an unknown number of SymRecs would be allocated, from low storage toward high storage. Then, when the actual size of the symbol table was known, a further area would be set aside, beginning at the end of the symbol table and extending upward, to serve as a storage pool. UseRecs and OpRecs could be allocated as needed from this pool of space. The output buffer would follow the pool, filling the rest of storage as usual.

I set an arbitrary limit of 8,192 bytes (341 entries) on the symbol table, and made the pool 8,192 bytes (100 OpRecs and 1,100 UseRecs) in size. That ensured that not more than 16K bytes would be used for records. The program code came out to about 3400 bytes. MACREF, then, requires a minimum of 22K bytes in the Transient Program Area. It won't run in a 24K system, but it should run in a 32K machine.

DESIGN: LOADING SYMBOLS

Figure 7-5 shows my initial plan for the first processing phase (the subroutines appear later). It evolved from inspection of a typical .SYM file (with VDUMP!). A .SYM file is divided into lines and the lines contain a varying number of symbols. It occurred to me to ignore the line-ends. If blanks and control characters are lumped together as filler, then a symbol file is a sequence of

fill, XXXX, fill, SYMBOL, fill, XXXX, ...CpmEof.

This observation simplified the logic of LoadSymbols. It expects that a four-hex-digit value will precede each symbol string, that some fill will separate the value and the symbol, and that end of file will occur after a symbol. It has no expectations beyond that. There can be any number of symbols on a line.

Figure 7-5. The initial plan for reading the .SYM file and loading the symbol table.

procedure LoadSymbols: Sn is a pointer to a SymRec D is a byte NSymbols := 0 D := any byte other than CpmEof loop: ReadFill(D) while ( D<>CpmEof ) NSymbols := NSymbols + 1 GetMain(length(SymRec),Sn) ReadHex(D,Sn) ReadFill(D) ReadLabel(D,Sn) end loop. end LoadSymbols.

MAC Problems Intervene

Some problems arose when I first tested MACREF. No uses were recorded for some symbols, and other symbols turned up as opcodes! After a good deal of investigation, I found that in two cases MAC does not write the .SYM file in alphabetic order. As a result, some symbols went into MACREF's table out of sequence. Then the binary search couldn't find them.

One of the cases is this: if two symbol-strings are identical for the length of the shorter one, and if the longer one appears first in the source program, then MAC will list the longer one first in the .SYM file. That's the reverse of what one would expect. RMAC handles this case correctly.

The second problem affects both MAC and RMAC. Both assemblers treat the question mark "?" as an alphabetic character. But they treat it as if it collated higher than "Z," not lower than "A" as its ASCII value would have it do. So any symbol that contains a "?" character may be out of sequence in the .SYM file. In particular, symbols that begin with "?" will follow all other symbols.

The .SYM file was almost in alphabetic order, but not quite. What to do? The symbol table could be reorganized as a binary tree, but that would entail a lot of work (the program was completely coded at this point, remember). I decided to add one more step to LoadSymbols. After building a SymRec, it would check to see if the new symbol was out of order. If so, it would swap the symbol with its neighbor, and so bring the table into order. In the first case, a symbol would usually only be swapped once, the shorter and longer symbols changing places. In the second case, a symbol with an initial "?" might have to be swapped all the way to the head of the table, but these symbols are fairly uncommon. If the .SYM file were completely unordered, this would be a very inefficient way of sorting it. Since the file is almost ordered, the overhead should be tolerable.

The final logic of LoadSymbols appears in Figure 7-6, and the pseudo-code of its subroutines is shown in Figure 7-7.

Figure 7-6. The plan for loading symbols, as revised to account for the disorder of the .SYM file.

procedure LoadSymbols: Sn is a pointer to the current SymRec Pn is a pointer to the previous SymRec D is a byte NSymbols := 0 D := any byte other than CpmEof GetMain(length(SymRec),Pn) {Pn names a null symbol} loop: ReadFill(D) while ( D<>CpmEof ) NSymbols := NSymbols + 1 GetMain(length(SymRec),Sn) ReadHex(D,Sn) ReadFill(D) ReadLabel(D,Sn) ReOrder(Pn,Sn) {ensure symbols are in order} Pn := Sn {save address of prior SymRec } end loop. end LoadSymbols.

Figure 7-7. The pseudo-code logic for the subroutines of LoadSymbols.

procedure ReadFill(var D: a byte) {advance in .SYM file to some data} while ( D<>CpmEof ) and ( D<=Blank ) D := GetChar end while. end ReadFill. procedure ReadHex(var D: a byte, Sn: a SymRec ) {get the hex value of a symbol} Q is a word Q := 0 if ( D<>CpmEof ) then do four times Q := (Q*16)+hex value represented by D D := GetChar end do. endif. Sn.Val := Q end ReadHex. ReadLabel(var D: a byte, Sn: a SymRec ) {get the label string of a symbol} while ( D>AsciiBlank ) { n.b. CpmEof is < Blank } append D to Sn.Label D := GetChar end while. append 00h to Sn.Label end ReadLabel ReOrder(Pn, Sn: pointers to SymRecs) {move new symbol to proper place} while ( Pn.Label > Sn.Label ) swap contents of Pn, Sn SymRecs Sn := Pn Pn := Pn - length(SymRec) end while. end ReOrder

DESIGN: PARSING INPUT

The overall plan for the second phase came easily (see Figure 7-8). A line of the input file would be prepared and copied to the output. Then it could be parsed, or analyzed into its syntactic units. The only units that concern MACREF are words; that is, strings that begin with an alphabetic and contain only alphabetics and numerics. Everything else could be ignored. The words would have to be divided into program symbols and opcodes.

A line might contain no words, or only one word, or many words. I decided to have a routine GetWord that would return the next word and False, or return a null string and True if there were no more words on the line. Its input would be an index to the line buffer, which it would advance as it worked. There would also be a function Symbol that would search for a word in the symbol table, returning True and the SymRec address if it was found, or returning False and a nil pointer if it did not appear.

Under these conditions the syntax of 8080 assembly language is simple. If the first word on the line is a known symbol, the symbol is being defined on the line. The program can note its definition point and proceed to the next word.

The second word (or the first one, if it wasn't a known symbol) must be the opcode. The program should find or insert the opcode in the list of opcodes, and remember the address of the resulting OpRec. All following words that are known symbols are being used; the program records these uses as UseRecs, using the current line number and the address of the OpRec. Following words that aren't known symbols must be reserved words—register names and operators.

Figure 7-8. The plan for reading and interpreting an assembly source file.

procedure PeruseFile Over is a boolean String is an area to hold a temporary string P is an index over Linebuffer Sn is a pointer to a SymRec Ln is a line number On is a pointer to an OpRec Ln := initial sequence number while (not CopyLine(Ln)) P := 1 Over := GetWord(String,P) { isolate first word } if ( Symbol(String,Sn) ) then { ..it appeared in .SYM } NoteDef(Sn,Ln) { ..this line defines it } Over := GetWord(String,P) { ..move on to opcode } endif. if (not Over) then { there is another word } NoteOpCode(String,On) { ..which is the opcode } Over := GetWord(String,P) { ..move on to the next } endif. while (not Over) if ( Symbol(String,Sn) ) then NoteUse(Ln,On,Sn) Over := GetWord(String,P) end while. Ln := Ln+sequence increment end while. end PeruseFile.

Reading the Line

The process of copying a line from input to output turned into a substantial bit of programming (Figure 7-9).

The first problem was to get rid of a line number that might have been installed by a prior run of MACREF. If this weren't done, successive uses of MACREF would build up successive layers of line numbers. MAC will accept that—it ignores any combination of digits and white space at the head of a line—but the user would not. The ReadLine routine strips leading digits and one blank or tab following the last digit.

In other computing environments than CP/M, it is usual to put sequence numbers on all program files. Often those numbers are important in schemes for automated program maintenance. The programming tools used in IBM operating systems, for example, often rely on the eight-digit sequence numbers that are standard in those systems. There, a program like MACREF would not dare to strip off a line number and substitute its own. On the other hand it could expect that its input would always be numbered, and it could read and use the line numbers it found. ReadLine's logic could be changed to return the value of the line number it found, rather than discarding it.

Figure 7-9. The logic of the subroutines that rad and reproduce the input file.

procedure CopyLine(Ln: a line number) : returns boolean if ( ReadLine ) then { end of file } return true if (line is not start of an old X-ref) then WriteLine(Ln) return false else return true { treat old X-ref as end of file } endif. end CopyLine. procedure ReadLine : returns boolean C is a byte P is an index over Linebuffer C := GetChar if ( C = CpmEof ) then return true if ( C is a digit ) then { strip off sequence number } repeat C := GetChar until ( C is not a digit ) if ( WhiteSpace(C) ) then C := GetChar endif. P := 1 loop { read line through LF } LineBuffer[P] := C while ( C <> AsciiLF ) C := GetChar P := P+1 end loop. return false end ReadLine. procedure WriteLine(Ln: a line number) P is an index over Linebuffer P := 1 Put9999(Ln); PutTab loop PutChar(Linebuffer[P]) while (Linebuffer[P]<>AsciiLF) P := P+1 end loop. end WriteLine.

Writing the Line

The next problem was to add MACREF's own line number when writing the line to the output file. It was during the writing of MACREF that I put together the decimal output routines in the PutSubs include file (Appendix C). Each takes a binary integer in the HL register pair, and sends ASCII characters to the output file by calling the label "PutChar." I would have preferred not to code that label into generalized solution. For one thing, it meant that decimal output to the console required another, nearly identical, set of routines. But there seemed to be no really satisfactory alternative.

The Put9999 routine, then, writes a four-digit positive integer to the output file; it does not suppress leading zeroes (PutZZZZ9 does that). With it in hand, the rest of procedure WriteLine was easy.

Isolating Words

Assembly language syntax is simple enough once the words and other lexical tokens have been isolated. But isolating them is another matter. Think about the problem for a few minutes. Work out a pseudo-code design for the GetWord routine that will pick out each identifier—each symbol and reserved word—from a statement. It must find all identifiers, but only identifiers.

The problem is complicated by the rules for numeric constants; these may end with an alphabetic character. Quoted characters must be skipped. The dollar sign within a word or a number is ignored. The semicolon that begins a comment must be recognized as the end of the statement, but only outside quotes. A naive approach to the design will probably end up with a quite complicated scheme of nested "if" structures controlled by boolean flags. The program that results from such a design is likely to be slower than it could be, and hard to debug.

The Finite-State Automaton

There is a construct that handles this sort of problem in a reliable, and very efficient, way. It is the finite-state automaton, or FSA. The design of an FSA begins with the division of the possible input characters into a small number of useful classes. A classification for the MAC assembly syntax is shown in Table 7-1.

ClassLetters in that Class
0 ignoreall not otherwise classified
1 alphaonly alphabetics used only in labels
2 alphanum alphabetics used also in numeric constants (A-F, H, O, Q)
3 digit decimal digits 0-9
4 dollarthe dollar sign $
5 quote the single-quote or apostrophe
6 endstmtthe semicolon
7 endlinethe linefeed

Place the tip of a pencil on an assembly statement and slowly draw a line through its characters. At any character, the pencil will be pointing to one of five kinds of thing: a word, a number, a quoted string, the end of line, or "other." These are the states that we are interested in.

The next step is to draw a state diagram, like the one in Figure 7-10. As the pencil point encounters a character of a certain class, the state will change. The state diagram shows all the possible transitions from one state to another. The classes of input that can cause a given transition are written beside the arrow that represents the transition. For example, if the pencil point is moving through "other" characters, an encounter with any "other" (class 0) character or with the dollar sign (class 4) character leaves it in the "other" state. However, if it meets an alphabetic (class 1 or 2) character, it enters the "word" state. If it meets a digit (class 3), it enters the "number" state. If it meets a quote (class 5) it enters the "quote" state. Meeting a linefeed, and meeting a semicolon in any state except "quote", sends any state to the end-of-line state.

The state diagram can be translated into a state matrix, as shown in Table 7-2. The matrix contains exactly the same information as the state diagram. The states of the diagram become the rows of the matrix. The columns of the matrix represent the classes of input. Each cell of the matrix gives the name of the state that follows an input of that class in that state. Compare Table 7-2 with Figure 7-10 until you feel at home with this idea.

Class of Input Character
State01234567
0 OtherOtherWordWordNumberOtherQuoteEndEnd
0 OtherOtherWordWordWordWordQuoteEndEnd
0 OtherOtherWordNumberNumberNumberQuoteEndEnd
0 OtherQuoteQuoteQuoteQuoteQuoteOtherQuoteEnd
Table 7-2. A state matrix, an exact translation of Figure 7-10 into another form.

If we equipped a program with a state matrix like Table 7-2, it could walk along an assembly source line and name the state of every character. But we want more; we want to extract the words from the line, one at a time. The program has to take action to pull out the characters of an identifier. Adding an action entry to every cell of the state matrix produces a finite-state automaton.

Table 7-3 shows the state matrix augmented with action numbers. Take the upper left cell: it says, "When the current state is `other' (row 0), and the current byte is in class 0, then do action 0 and enter state 0." In other words, remain in the "other" state and go on to the next character. Work through the other cells of Table 7-3 in the same way, and think about the implications. Note how smoothly the table deals with dollar signs embedded in words and numbers, and how all possible state transitions (not just the ones that MAC would accept) are dealt with.

Class of Input Character
State01234567
0 Othera0/s0a1/s1a1/s1a0/s2a0/s0a0/s3a3/s0a3/s0
1 Worda2/s0a1/s1a1/s1a1/s1a0/s1a2/s0a2/s0a2/s0
2 Numbera0/s0a1/s1a0/s0a0/s2a0/s2a0/s3a3/s0a3/s0
3 Quotea0/s3a0/s3a0/s3a0/s3a0/s3a0/s0a0/s3a3/s0
Action 0Advance the input to the next character. 
Action 1Store the current character as part of a word; advance the input to the next character. 
Action 2Stop, a complete word has been collected. 
Action 3Stop, there are no more words in the input. 
Table 7-3. The state matrix with action entries added, making a finite-state automaton.

Once the FSA matrix has been designed, it is possible to design extremely tight assembly code to implement it. That code can't easily be represented in pseudo-code, since it involves branching to addresses plucked from the array, but it is readable and reliable even so.

The Tree of Opcodes

Binary trees are explained in many programming texts [1] . The idea of a tree is a nice intuitive one that needs little explanation. A binary tree is one that branches two ways at every node. A more formal definition is this: a binary tree is either (1) the empty node, or (2) a node composed of value, left, right, where left and right are binary trees. This is one of those mind-wrenching recursive definitions of which computer-science authors are so fond. A tree is a node that contains other trees, which may be nodes that contain other trees, etc.

Figure 7-11 represents the logic of a routine to find a value in a binary tree, or to insert it if it isn't there. It receives the value to look for (a string), and a tree (which is assumed not to be empty). It returns the node that contains the value. If the value isn't in the node it was given, it calls itself to look deeper into the tree. If the tree doesn't go any deeper, it installs a new node to hold the value and returns that. MakeNode presumably creates a new node in which left and right are empty.

Figure 7-11. The usual logic of inserting a new value in a binary tree.

procedure FindAdd(S: a string, Nd: a tree node) : returns a tree node if ( S = Nd.Value ) then {found it} return (Nd) else if ( S < Nd.Value ) then if ( Nd.Left is empty ) then Nd.Left := MakeNode(S) return Nd.Left else return FindAdd(S,Nd.Left) else if ( Nd.Right is empty ) then Nd.Right := MakeNode(S) return Nd.Right else return FindAdd(S,Nd.Right) endif. endif. end FindAdd.

An ordered binary tree is one constructed so that the values in the left tree are always less than the value in the containing node, and those in the right tree are always greater.

In a real program, the left and right fields of a node will not be binary trees; they will point to binary trees. If the pointers are nil then they represent empty trees. Figure 7-12 shows the logic of a routine to find or insert a value in a such a tree. It receives a value and—read carefully, now—a pointer to a pointer to a node.

If the pointer-to-a-node is nil, the code creates a new node, puts its address in the pointer-to-a-node, and returns.

Otherwise, it works its way into the tree, level by level, until it finds itself pointing to a pointer to a node that contains a matching value, or until the situation in the last paragraph turns up.

It's much easier to code this than to describe it in words, and it codes up nicely in assembly language or in any other language that allows pointer variables.

Figure 7-12. Binary tree insertion, expressed with pointer notation, and using a loop instead of recursion.

procedure FindAdd(S: a string, var T: points to a pointer to a Node) do forever if ( T->pointer is nil ) then T->pointer := MakeNode(S) return else Q := T->pointer { Q->a node } if ( S = Q->Node.Value ) then return else if ( S < Q->Node.Value ) then T := address of Q->Node.Left else T := address of Q->Node.Right endif. endif. endif. end do. end FindAdd.

Searching for Symbols

A binary search over an ordered list is another standard technique, and as intuitively easy as binary trees (there is a deep theoretical relationship between the two, but never mind). Figure 7-13 shows the logic of a binary search over a table, as it is usually explained in textbooks. The table is indexed starting from zero, not from one. Each iteration of the loop eliminates half of the remaining entries from consideration. As a result, the expected number of iterations is the base-2 logarithm of the size of the table (it might, of course, be less).

Figure 7-13. The usual logic for a binary search of an ordered table.

procedure Lookup(S: a string, var I: an index to Table) : returns boolean locals H and L are indexes to Table, too global N is the number of entries in Table L := 0 H := N-1 do forever if ( H < L ) then return false { S not in Table } else I := floor( (H+L)/2 ) if ( S = Table[I] ) then { found it } return true else if ( S < Table[I] ) then H := I-1 else L := I+1 endif. endif. end do. end LookUp.

The logic of Figure 7-13 isn't easy to code in assembly language for the 8080; the machine doesn't have enough registers. I learned from Knuth[2] of a variation on the standard logic that requires one less local variable. That logic (Figure 7-14), although tricky and hard to follow, made a tighter program. The Symbol routine of MACREF is called for every word (including every opcode and register name) in its input; the extra complexity is justified by the need for a quick search.

Figure 7-14. A different plan for binary search. Only one extra variable is used, and the assembly code is smaller.

procedure LookUp(S: a string, var I: an index to Table) : returns boolean M is an index to Table N is the number of entries in Table M := floor(N/2) I := ceiling(N/2) do forever if ( S = Table[I] ) then { found S } return true else if ( M=0 ) then { S not in table } return false else if ( S < Table[I] ) then I := I - ceiling(M/2) M := floor(M/2) else I := I + ceiling(M/2) M := floor(M/2) endif. endif. endif. end do. end LookUp.

DESIGN: OUTPUT

The final two phases of MACREF generate reports. The first is the cross-reference; the second is the census of opcode use. Code to produce a formatted report is almost always tedious to write. You slog along, generating this and that bit of information. It's picky work—everything has to come out looking right, or the users will complain—but there aren't any clever algorithms to be worked out. (After writing that sentence I became determined to find a clever algorithm for displaying output. The "display machines" in the programs of Chapter 5 were the result.)

I worked both phases out in pseudo-code (Figure 7-15 and Figure 7-16) regardless. I structured the logic carefully so that each small routine would be easy to code. The top routine, AddXref, concerns itself solely with writing a heading and with calling WriteXline for each symbol. WriteXline works only on the fixed fields of the line, passing on the work of displaying the variable-length list of UseRecs to WriteUses. That routine is restricted to locating each UseRec in the chain, and to detecting the end of the chain.

At the bottom of the heap, the PutUse routine formats a single UseRec for display. It is here that duplicate opcodes are elided from the display. PutUse is also the point at which the output could reach the maximum length of a line.

Figure 7-15. The logic used to generate the cross-reference listing.

procedure AddXref PutString("* CROSS-REFERENCE",CR,LF) PutString("* dfn. val. symbol and uses",CR,LF) FirstSymbol(Sn) while (Sn is not nil) WriteXline(Sn) NextSymbol(Sn) end while. end AddXref. procedure WriteXline(Sn: pointer to a SymRec) Col is a byte in 1..MaxCol if ( Sn.Line is null and Sn.Use is null ) then return PutChar("*") ; PutBlank if (Sn.Line <> 0) then Put9999(Sn.Line) else PutString("----") PutBlank; PutXXXX(Sn.Val); PutBlank PutString(Sn.Label) Col := 13 + StringLength(Sn.Label) WriteUses(Col,Sn) PutCRLF end WriteXline. procedure WriteUses(Col: a byte, Sn: a pointer to a Symbol) Un is a pointer to a UseRec St is a string Un := Sn.Use St := the null string while ( Un <> nil ) PutUse(Col,St,Un) Un := Un.Next end while. end WriteUses PutUse(Col: a byte, St : a string, Un : a pointer to a UseRec) if ( St <> Un.Op->OpRec.Name ) then { new opcode } St := Un.Op->OpRec.Name if ( Col+5+Length(St) => MaxCol ) then StartNewLine(Col) PutBlank PutString(St) Col := Col + 1 + Length(St) else { same opcode as before } if ( Col+5 => MaxCol ) then StartNewLine(Col) PutBlank; Col := Col + 1 endif. PutChar("-") ; Col := Col + 1 PutZZZZ9(Col,Un.Line) end PutUse.

In the first version of the program, the output was cluttered with a number of symbols that were not defined or used in the source program text. These were labels internal to included subroutines. The assembler put their names in the .SYM file, but they were of no interest when only the source file, without included code, was considered. I decided that if a symbol was neither defined nor used in the input file, it should be dropped from the cross-reference. A single line added to the WriteXline pseudo-code accomplished the change.

Figure 7-16. The opcode census is produced by touring the binary tree of OpRecs in alphabetic order.

procedure AddOpUse PutString("*",CR,LF) PutString("* CENSUS OF OPCODE USAGE") StartNewOp(Col) {i.e. CR,LF,"*",tab,tab, and Col:=16 } if ( OpRoot is not null) then InOrder(Col,OpRoot) PutCRLF end AddOpUse. procedure InOrder(Col, On: a pointer to an OpRec) if ( On.Left is not nil ) then InOrder(Col,On.Left) PutOp(Col,On) if ( On.Right is not nil) then InOrder(Col,On.Right) end InOrder PutOp(Col, On: a pointer to an OpRec) if ( Col+16 <= MaxCol ) then StartNewOp(Col) PutString(On.Name) if ( Length(On.Name) < 8 ) then PutTab(Col) PutBlank PutZZZZ9(On.Count) Col := Col+16 if ( Col+16 <= MaxCol ) then PutTab(Col) end PutOp.

The last phase (Figure 7-16) contains one point of interest. That is the "inorder" traversal of the binary tree of opcodes. It takes just three lines of pseudo-code (and just twenty assembly statements) to read out the opcodes in alphabetic order!

Figure 7-16. The opcode census is produced by touring the binary tree of OpRecs in alphabetic order.

procedure AddOpUse PutString("*",CR,LF) PutString("* CENSUS OF OPCODE USAGE") StartNewOp(Col) {i.e. CR,LF,"*",tab,tab, and Col:=16 } if ( OpRoot is not null) then InOrder(Col,OpRoot) PutCRLF end AddOpUse. procedure InOrder(Col, On: a pointer to an OpRec) if ( On.Left is not nil ) then InOrder(Col,On.Left) PutOp(Col,On) if ( On.Right is not nil) then InOrder(Col,On.Right) end InOrder PutOp(Col, On: a pointer to an OpRec) if ( Col+16 <= MaxCol ) then StartNewOp(Col) PutString(On.Name) if ( Length(On.Name) < 8 ) then PutTab(Col) PutBlank PutZZZZ9(On.Count) Col := Col+16 if ( Col+16 <= MaxCol ) then PutTab(Col) end PutOp.

The assembly version of InOrder is recursive, just as shown in the pseudo-code. There are certain pathological input files that could make the tree of opcodes lopsided and very deep. It is remotely possible that the assembly version of InOrder could nest itself deeply enough to cause the program stack to walk down into the file buffer. That would require an input file that contained more than sixty unique opcodes, all of which appeared in alphabetic or reverse-alphabetic order in the input. If that were a serious possibility, InOrder could be recoded as a looping procedure rather than a recursive one.

THE MACREF PROGRAM

Tabular Documentation

The source of MACREF appears in Listing 7-1. It's a very large program and so it deserves special efforts at documentation. Table 7-4 is a display of the heirarchical structure of the code. With it you can tell which routines are called at each level, and in what order. Table 7-5 documents the information-hiding groups—the parts of the program that "know" about different structures. Tabular displays like these are easier to prepare than written documentation. Often they are of more use to another programmer than a wordy "Principles of Operation" booklet would be.

Table 7-4.A chart of the subroutine struction of MACREF.

LoadSymbols AddXref GetMain PutString ReadFill FirstSymbol GetChar WriteXline ReadHex PutChar GetChar PutBlank ReadLabel Put9999 GetChar PutString ReOrder PutXXXX CmpString StringLength CopyString WriteUses PutUse CmpString StringLength CopyString PutChar StartNewLine PutString PutString PutZZZ9 PutCRLF NextSymbol PeruseFile AddOpUse CopyLine PutString ReadLine StartNewOp GetChar PutString WhiteSpace InOrder WriteLine PutOp Put9999 StartNewOp PutChar PutString GetWord PutString UpperCase StringLength Symbol PutTab CmpString PutBlank NoteDef PutZZZZ9 NoteOpCode PutCRLF MakeNode GetMain CopyString CmpString NoteUse GetMain

Table 7-5. Documentation of how information is "hidden" among the modules of MACREF.

Routines aware of the symbol table array: LoadSymbols ReOrder Symbol FirstSymbol NextSymbol Routines aware of the layout of a SymRec: ReadHex ReadLabel NoteDef NoteUse WriteXline WriteUses Routines aware of the layout of a UseRec: NoteUse WriteUses PutUse Routines aware of the OpRec binary tree: NoteOp InOrder PutUse

Development Time

MACREF is the largest program in the book. As such, it represents a good test of the effectiveness of this style of program design and development. I kept careful track of the time spent on MACREF. In the end, I had recorded 18 hours and 35 minutes spent on design and pseudo-code work. This time was spread over several days during which I had to attend to other jobs as well. It does not include the time I spent musing on the program while driving, waiting in line at the supermarket, etc.

The coding period, during which I translated the pseudo-code to assembly language and also coded a tentative version of the decimal output routines, took 12 hours and 45 minutes over three days.

The test-and-debug interval occupied a rather intense period of 7 hours 15 minutes. Proportionately, that is entirely too much. Here are some of the mistakes that turned up: (1) LengthSymRec was equated to 23, rather than to 24 as Symbol expected, so only the first symbol could be found; (2) a string-compare loop was coded into Symbol, and it was wrong (I tossed it and used the library routine); (3) the interface to the library routine StringLength was changed at the last minute, but the calling code wasn't (a classic type of bug); (4) PutUse didn't save the opcode, and so didn't elide duplicate opcodes; (5) a typo in the FSA definition made it impossible to find names with digits in them; (6) wrong arguments to CopyString made all the opcode names in the census come out blank. And of course, it was in this period that the two kinds of mal-ordering of the .SYM file came separately to light.

Improving MACREF

MACREF is quite complete as it stands, but it does have two shortcomings. You might enjoy trying to correct them. The first was noted earlier: since the program strips existing sequence numbers, those numbers can't be used as input to an automated code-maintenance system. The CopyLine and ReadLine procedures could easily be revised to collect the binary value of an existing sequence number and return it to PeruseFile. But what (if anything) should be done about missing sequence numbers, or numbers that are out of order?

The second problem is the result of a design oversight. I forgot to make any provision for lines containing multiple statements. MAC will accept these (see lines 790-791 of Listing 7-1, for example). MACREF treats such lines as single statements. It counts only the first opcode. If a label were defined in a second statement, MACREF would treat it as a use, not as a definition. It would probably take a fairly radical revision of the program to correct this oversight. Or can you find a reasonably simple fix? (Hint: study the FSA.)

Justifying Assembly Language

In the end, the total time cost was 38 hours and 35 minutes, or about $1000 at typical contract rates. Is the cost justified? Well, for the purposes of this book, emphatically yes. MACREF contains a number of examples of programming techniques that it would be hard to motivate in any other application short of a full-blown compiler.

But suppose that MACREF had been supplied by a consultant on contract? At a guess, a Pascal or PL/I version of the program would have had about half the number of source lines. Design time would have been equal. Coding time, because of the close match of pseudo-code to the language, would have been about a third, and debugging perhaps half. About 35% less time would have been required, so the incremental cost of assembly language to the customer is about 350 dollars. In return, the program runs at easily twice the speed of one written in a compiled language (due mostly to the large disk buffer). Every time the customer uses it, he or she saves a fraction of a minute. How many times must the program be used before it returns 350 dollars worth of time?

But there's another factor! Because MACREF is in assembly language, the cost for modifications will suffer the same 35% increment. And if the customer wants to move up to a larger system, one with a different CPU, it will have to be rewritten, incurring several times the cost of porting a Pascal program to a new system.

A single implementation of MACREF can be built economically in assembly language, then, only if (1) the program will be used frequently (many times a day), (2) the specifications are firm, so that few modifications will be needed over the life of the program, and (3) the user has no intention of changing hardware for a long time.

However, when a program is built for publication its development cost is shared among all its purchasers. Then, every increment in its performance makes it a more attractive product; that (theoretically) attracts more buyers; that in turn spreads the costs further. Thus, when a program is written, not on contract to a single customer but for general sale, one can make a case for the claim that writing it in assembly language will actually lower its per-unit cost. And we programmers who enjoy our intimate relationship with the machine will certainly advance that claim.

[1] For an exhaustive discussion see Donald E. Knuth, The Art of Coputer Programming, Volume 1: Fundamental Algorithms, page 305ff. (second edition, Addison-Wesley, 1973)

[2] Donald E. Knuth, The Art of Computer Programming, Volume 3: Sorting and Searching, page 411 (Addison-Wesley, 1973)