Stilo e-Publishing Solutions picture - clouds picture - clouds
dark blue bar dark blue bar dark blue bar
Stilo
Contact Us Stilo Home
OmniMark Developer Resources  
Stilo
Public
Stilo
Stilo
Login
defects reporting

Applying Structured Content Transformation Techniques to Software Source Code

Roy Amodeo

Senior Software Architect

Content processing systems, like all software systems, must meet reliability, performance, and functional requirements. To this end, they should be developed using proven software engineering methodologies.

Is this a one-way street? Or can software engineering environments benefit from the application of content processing techniques?

This paper will show that considerable benefits can be achieved quickly when legacy document conversion techniques are applied to software programs.

Specifically, we will examine how XML markup can be automatically inserted into a program, essentially by treating the program as if it were a document in some proprietary formatting language. We will then demonstrate that useful software engineering tools can be more easily developed if they leverage the marked-up structure of the program rather than the native syntax.

Sometimes the puppy can teach the old dog a few new tricks.

Introduction

In the realm of structured content processing, benefits accrue from modeling the information content of documents, rather than their presentation. These benefits include the ability to automate the publication of information in many different formats, with content tailored for different audiences.

Software programs are clearly a form of content. The information they contain is expressed in a programming language. Usually authored by humans, they are "published" by compilers. The audience is the computer that runs these programs.

However, programs are not written solely for use by machines. If they were, programming languages would have no need for comments or programming style guidelines. The application developers and maintainers themselves are an audience.

We can create information products from the source code of the program that can make these developers and maintainers more effective. Creating these information products relies on recognizing interesting artifacts in the source code, and transforming them in some way.

The process of creating these information products can be simplified if we make these artifacts explicit with XML markup. But how do we recognize these artifacts in the first place?

Programming language compilers and interpreters employ a parser to do this very thing. One approach is to modify the language parser itself to reproduce the original program, but annotated with markup to explicate its structure. As we discover in Formal Representations of Programs, these language parsers are quite complex. For organizations without experience in compiler technology, adapting such parsers can be a daunting challenge.

This paper takes a fresh look at the problem of annotating program source code with XML markup by recasting it as a content processing problem. Rather than trying to model the program structure precisely, (which a parser must do), we will look only for the cues that indicate the artifacts that we are interested in, and we will treat the remainder of the program as text. As we will show, we can capture a great deal of structure with a minimum of effort.

We will show that this technique will identify enough structure to make it possible to craft transformation and analysis tools that are directly useful in the practice of software engineering. Again, these tools are developed in the same way as any other content processing application.

We will also discuss how we assessed the quality of the transformation.

Software Engineering Activities

In large software systems, four activities in particular resonate in the content processing world: refactoring, integration, repurposing and establishing a common look and feel.

Refactoring

Refactoring is an activity that restructures software to remove redundancy and generalize functionality. Often, this is done by decomposing large amounts of software into smaller, reusable, modules. Reusability reduces the long-term costs of software maintenance.

In the world of content, modular documentation is an area of increasing interest, for exactly this reason. Large documentation sets containing redundant information are time-consuming and error-prone to maintain.

By refactoring large documentation sets into reusable modules, redundancy can be reduced considerably. When information only has to be changed in one place, author productivity increases dramatically, and the quality of the delivered information is increased.

S1000D [S1000D 2007] and DITA [DITA 1998] are two examples of standards which specify how to construct large documentation sets as reusable modular topics. DITA, especially, derives much of its terminology and modeling from the software engineering field.

As part of the modularization process, refactoring also involves the elimination of redundant material. In software, when two functions do the same thing, one can be eliminated. All calls to the eliminated function must be found and changed to refer to the other.

In content processing terms, this is analogous to finding all of the places where content is referenced, and replacing it with a reference to different content.

Simple word search tools (such as the UNIX grep utility) can find all occurrences of a name, but if the name is not unique, or forms a subset of a keyword or another name, many false hits can be generated. Many languages allow the programmer to use the same name for different functions as long as they are defined in different scopes. The programmer often has to examine the context of the function call carefully to determine the identity of the function being called.

In general, language-specific analysis tools help to keep refactoring activities tractable.

Integration

The need for refactoring is often created when a new software component is integrated into an existing system. Integration is often motivated by the need to quickly add new functionality to a system. Yet, integration rarely goes smoothly.

Often, the integrated code needs to be restructured to fit the existing software architecture. Sometimes the integrated code needs to be generalized to handle new situations for which it was not originally designed. Many times, the integration will uncover errors in the new component that had not surfaced before. And sometimes, substantial changes need to be made to the existing system to accommodate the new component.

Analyzing software that was written by others can be difficult. Programmers need to locate where variables, functions, and types are defined in order to determine their purpose. Functions call other functions that interact through side effects with other functions. The amount of time programmers spend searching in the code for definitions or declarations quickly becomes significant.

Again the problem of locating function, variable, or type definitions becomes an issue.

Even simple things like a different coding style can make unfamiliar programs more difficult to understand.

These issues occur in content processing when integrating new documentation components into an existing set. Integration is easier if the new components are transformed to a structure more like the documentation set into which they are being integrated.

Repurposing

Repurposing is often the motivation for implementing a content processing system. Repurposing occurs as a requirement in software systems as well.

Many large software systems have subsystems where the same structures occur in different forms. Data structures have to correspond exactly to the functions which build or process them In such cases, it is desirable to generate all of these structures from one source to ensure consistency. This is a form of repurposing.

For example, two components may communicate across an interface using data structures. Functions on one side of the interface copy their arguments into a data structure. The data structure is then passed across the interface to the other component. There the information in the data structure is used as arguments to another function call on the other side of the interface. The return value from that function is copied back across the interface boundary in the same way.

This technique is often used when the two software components are written in two different programming languages. (The technique is also quite common when a component interprets a tiny language of its own, such as a regular expression library.)

The function which populates the data structure, the data structure itself, and the function which interprets the data structure are all tightly coupled. When a new function is provided in one component, the corresponding function in the other component and the intermediate data structure must also be created. Consistency is easier to achieve if the data structure and both function interfaces are generated from a single source.

Common Look and Feel

An important consideration in content processing is to ensure that related documentation looks related. They should have similar layout, font usage, terminology, and navigation behavior. In software engineering, such concerns fall under coding style guidelines.

Style guidelines often specify code organization, naming conventions, indentation style, and commenting requirements. It is common for organizations to mandate that a consistent coding style be used on a project. Often, corporate-wide guidelines are established and all projects are expected to follow them. Such practices improve programmer productivity when they begin a new project. Programs are much easier to understand when they are written in a familiar style.

Code editors provide some help for keeping coding styles consistent for mainstream languages. They rarely provide any assistance for naming conventions. And they rarely enforce conventions. Most often, they just make it easier to program in an approved style.

Tools that automatically detect coding style violations (where possible) can be quite useful for guaranteeing adherence to corporate standards. Even better are tools that can correct common coding style errors.

The Content Processing Approach

The tasks outlined above cover a variety of activities in software development that have analogues in content processing. What are the best practices for these activities in the content processing world, and can we apply them to software?

The most important step in making projects such as these manageable in the content processing world is the removal of presentation artifacts, and the explication of structural information. The use of structured markup languages, particularly XML, is integral.

In the case of source code, the content is already highly structured. If it was not, the compiler would not understand it. The problem is that only the compiler and the human really understand the structure implied by the source code. So either we become software engineering experts and modify the compiler, or we continue to do things manually.

So the problem is not to infer structure, but simply to recognize it and make it explicit so that it is accessible to other processing tools.

The content processing world has been converting "unstructured data" into structured markup for a long time, with great success. This approach will work here as well.

The key to success in this endeavor is to keep things simple. Therefore we will design a custom intermediate structure for this task. In accordance with XP (Extreme Programming) principles, the structure will be quite simple at first, and will only be refined as much as needed for each information product.

While this approach will work with most programming languages, some are easier to process than others. We will choose a harder one.

This paper uses the OmniMark language for two reasons: one is that the language is extremely context-sensitive, and this makes it difficult to process in a naive fashion.

The second reason is that the author recently worked on a project where OmniMark programs had to be transformed as part of an upgrade effort. The approach taken was more crude than this, and the work was largely not reusable. This paper sprang from ideas on how such projects could be streamlined in the future.

On OmniMark Syntax

Compared to many programming languages, OmniMark syntax seems very irregular. Although it has a hierarchical structure, many of these structures do not have an explicit end marker. These structures end when the next structure at the same level begins. The following fragment shows a pattern-matching ("find") rule, an element rule, and a function definition:

define function
   output-comment-element value string text
as
   output "<comment>"
   output text
   output "</comment>%n"

find ";" any-text* => line "%n"
   output-comment-element line

element "comment"
   suppress

While the indenting style gives the human reader a good idea of where each part begins and ends, the indentation is not significant and cannot be relied upon. (The author has seen many programs where the structure is completely obscured because the programmer has left-aligned everything.)

It should be observed that statements (also called "actions") do not have separators in OmniMark. Many common languages employ some kind of statement separator or terminator. For example, languages descended from ALGOL (including C, C++, and Java) use the semi-colon to separate statements. Separators make it easy to determine where each action begins.

Finding the beginning of each action is made more difficult with OmniMark because of its English-like syntax. This tends to lead to longer statements than in many other languages, and these are often broken across multiple lines.

The other difficulty is that many of the keywords in the language can be used in different roles. In the preceding example, the keyword element introduces a rule. But the keyword element can also appear as part of a built-in expression (in this case, testing that para is a sub-element of subsec1 ):

element "para" when open element is "subsec1"
   output "<p>%c</p>%n"

The English-like nature of the language is further evident in that user-defined functions and built-in constructs often use names to indicate their arguments, rather than the more familiar parenthesized argument list:

   define string source function
      strip                           value string        part
                                 from value string source whole
   as
      repeat scan whole
         match part
         match any => t
            output t
      again

   process
      output "Here is a sentence.%n"
      output strip " " from "Here is a sentence with spaces removed.%n"
      output "The final sentence is normal.%n"
           

These are the primary difficulties, but not the only ones. It should be apparent that many tools will be much easier to write if the structural aspects are marked up by XML, rather than implied by native OmniMark syntax. If it is not clear, we will see the examples again, but with explicit markup.

Marking Up OmniMark

The examples shown above can be made structurally more explicit with XML markup.

The first example, showing two rules and a function definition:

<function line="1" name="output-comment-element" type="">
   <arguments>
      <argument name="text" type="string" property="value" id="a_2"/>
   </arguments>
   <body>
     <action>output "&lt;comment&gt;"</action>
     <action>output <ref id="a_2">text</ref></action>
     <action>output "&lt;/comment&gt;%n"</action>
   </body>
 </function>
 <rule type="find" line="8">
   <header>
     ";" any-text*
     <variable line="8" scope="local" name="line" type="pattern" id="v_3"/>
     "%n"
   </header>
   <body>
     <action><ref id="f_1">output-comment-element</ref> 
  <ref id="v_3">line</ref></action>
   </body>
 </rule>
 <rule type="element" line="11">
   <header>
     "comment"
   </header>
   <body>
     <action>suppress</action>
   </body>
 </rule>

The second example shows clear differentiation between the uses of the keyword "element":

 <rule type="element" line="1">
   <header>
     "para" when open element is "subsec1"
   </header>
   <body>
     <action>output "&lt;p&gt;%c&lt;/p&gt;%n"</action>
   </body>
 </rule>
				

Finally, the third example shows the statement structure:

<function line="1" name="strip" type="string source">
  <arguments>
    <argument name="part" type="string" property="value" id="a_2"/>
        
    <argument name="whole" type="string source" property="value"
           herald="from" id="a_3"/>
        
  </arguments>
  <body>
    <loop type="scan">
      <expression>
        <ref id="a_3">whole</ref>
      </expression>
      <alternative type="match">
        <header>
          <ref id="a_2">part</ref>
        </header>
      </alternative>
      <alternative type="match">
        <header>
             any
          <variable line="7" scope="local" name="t" type="pattern" id="v_4"/>
        </header>
        <body>
          <action>output <ref id="v_4">t</ref></action>
        </body>
      </alternative>
    </loop>
  </body>
</function>
<rule type="process" line="11">
  <body>
    <action>output "Here is a sentence.%n"</action>
    <action>output <ref id="f_1">strip</ref> " " from 
"Here is a sentence with spaces removed.%n"</action>
    <action>output "The final sentence is normal.%n"</action>
  </body>
</rule>
				

The Conversion Process

A multi-stage pipeline is used to generate the markup. The first three stages are implemented as coroutines [Wilmott 2003] to ensure that they are synchronized, and each brings their own kind of context information to the table:

  1. The first coroutine tracked file name and line number information in the program source code so that it was always available to coroutines further down the pipeline.
  2. The second coroutine uses pattern-matching to recognize the syntax of the original program. The pattern-matching rules add the appropriate structural markup. The original code is retained in source elements so that it can be analyzed if something goes wrong.
  3. The third coroutine tracks variable references. At this point, the structure of the program is available, so this is actually an XML to XML transformation using the XML generated in the preceding stage.

The result of this is a highly redundant hybrid of XML markup and source code, retaining all of the text of the original program, including white space. This intermediate markup is further processed to produce a simpler, less redundant final markup. An example of the intermediate form is shown in Annex B: Example Library Stub File.

To produce intermediate markup for the examples shown so far, the program uses 28 pattern-matching rules to recognize language constructs. The rules average about 10 lines of code each, with half of that devoted to specifying the pattern, and half for emitting the markup. Some of these rules perform a basic completeness check, looking for keywords that should have been recognized as structural indicators, but were not.

Because there are no statement separators, the conversion process keeps a dictionary of keywords and user-defined functions that may appear at the beginning of a statement. In general, this works well enough.

This intermediate markup is automatically tested by using it to regenerate the original source code. The regenerated source code is compared to the original code byte by byte, to make sure that absolutely everything has been captured.

From here, generating the final markup is easy. The source elements are removed (because they are redundant), white space is normalized, and proper indentation is applied to indicate structure.

The final markup is tested by generating an OmniMark program again. The generated program is also compared to the original. In this case, white space differences are ignored, but the programs must otherwise be identical.

The generation of the intermediate language is facilitated by a strong pattern matching language, XML processing capability, and coroutines. For this reason, OmniMark was ideally suited to implement this task, but any language which has these characteristics would be equally capable of handling this transformation.

Where's the Schema?

Some readers may feel somewhat uncomfortable that we have not specified a DTD (or a schema) yet. Is there a reason for this?

In the process of developing these conversions, it was felt that a DTD would be more of a distraction than a help, especially at the initial stages.

There is a tendency when developing a DTD to attempt to build a complete model of the information. In this case, the ideal DTD would completely model the language's grammar. In fact, that's the approach taken by the projects discussed in Formal Representations of Programs. Such a complete DTD contributes to the complexity of the effort, and it has been our goal to try to do this project in small incremental stages.

If the DTD does not represent the entire grammar, then it should at least represent those portions of the grammar in which we are currently interested, so that the structural markup can be validated. One has to ask whether the presence of such a DTD can increase the correctness of the conversion project, and whether it can speed the diagnosis of errors.

We have already described three tests that are performed automatically as the source code is being converted. A test for completeness ensures that keywords that indicate structure do not "fall through the cracks". A regeneration of the original source code from the intermediate language ensures that the entire input is accounted for. The test to regenerate an equivalent program from the final markup ensures that the structure of the final XML instance is consistent with the structure of the original program.

It is this last test which largely reduces the need for a DTD. Keep in mind that these tools are designed to work on existing programs, so those programs have already been structurally validated against the language's own grammar. This is a far more complete test than validating it against any partial DTD.

By ensuring that the XML structure is consistent with the structure of the source program, we have as much structural validation as we could expect from a DTD.

This project is still in an early stage. We do envision writing a DTD at a later point. The purpose will be to facilitate tool development beyond those that we have described here, and to communicate the extent of the structural capture to other tool developers.

In the meantime, the tests that we are performing as we generate the markup ensure an extremely high degree of correctness and are sufficient for now.

Example Applications

As a proof of concept, we implement three useful software engineering tools as content processing applications:

  • an interface function stub generator,
  • a code formatter
  • an analysis tool for variables

Each tool consists of two processes. The first process is common to all of the tools described here. It converts the OmniMark program to XML markup as described above. The second process is different for each tool. It applies standard content processing techniques to the marked up program (or "document") to generate the desired output.

Interface Function Stub Generator

OmniMark provides an external function API that allows a developer to create C or C++ functions that can be called directly from an OmniMark program. This is one of the most common ways to invoke third-party applications directly from an OmniMark program.

The OmniMark external function API for C++ provides an OmniMark environment object that represents the calling context for the external function. The environment object provides accessor methods for retrieving the external function's arguments, and to set the value that the external function will return.

The external function stub generator reads the OmniMark interface to the external function library, and generates a C++ source file containing stubs for the external function implementations. (In practice, a C++ header file is generated as well, but that is not shown here.)

An external function library interface looks like this:

; omblowfish.xin - OmniMark interface to a BlowFish cryptography library

declare function-library "omblowfish"

define external string function 
   omblowfish-version
as "BlowFishLibraryVersion"


define external integer function
   create-encryption-key               value      string  encryption-key
as "BlowFishCreateEncryptionKey"


define external string function
   decode                              value      string  src
                                 using modifiable integer state
as "BlowFishDecode"


define external string function
   encode                              value      string  src
                                 using modifiable integer state
as "BlowFishEncode"

This example defines four external functions and some of the linkage information associated with them. The "define library" declaration sets the name of the external library for the functions which follow. Function definitions containing the keyword external indicate functions defined in the external library. The bodies of these functions consist of a string naming the C or C++ function that OmniMark will call when the function is invoked.

Here is how the library interface looks after it has been marked up.

    <comment> omblowfish.xin - OmniMark interface to a BlowFish cryptography library
    </comment>
    <function-library name="omblowfish"/>
    
    <function line="5" name="omblowfish-version" type="string"
	  external="BlowFishLibraryVersion" library="omblowfish">
      <arguments>
      </arguments>
    </function>
    
    <function line="10" name="create-encryption-key" type="integer"
	  external="BlowFishCreateEncryptionKey" 
     library="omblowfish">
      <arguments>
        <argument name="encryption-key" type="string" property="value" id="a_3"/>
      </arguments>
    </function>
    
    <function line="15" name="decode" type="string" external="BlowFishDecode" 
      library="omblowfish">
      <arguments>
        <argument name="src" type="string" property="value" id="a_5"/>
        <argument name="state" type="integer" property="modifiable"
             herald="using" id="a_6"/>
      </arguments>
    </function>
    
    <function line="21" name="encode" type="string" external="BlowFishEncode" 
     library="omblowfish">
      <arguments>
        <argument name="src" type="string" property="value" id="a_8"/>
        <argument name="state" type="integer" property="modifiable" 
             herald="using" id="a_9"/>
      </arguments>
    </function>

This instance can be used to populate templates to produce a C++ stub file. The template for a C++ function is:

dll_export_fun void 
[:function-name:] (OM_ExtFuncApiHandle_t  Environment)
{
   try 
   {
      /* Argument handling */
      OM_ExtFuncAPI_t                    ExtFuncAPI (Environment);
      [:argument-declarations:]
      [:return-value-declaration:]

      /* Add function implementation here */

      /* Return value handling */
      [:return-value:]
   }

   /* Exception handling */
   catch (OM_Exception_t &  Except)
   {
      OM_DefaultExceptionHandler (Environment, Except);
   }

   catch (...)
   {
      OM_UnexpectedExceptionHandler (Environment);
   }
}

The interesting parts of the conversion are as follows for the function "create-encryption-key":

  • The template slot
    [:function-name:]
    
    gets replaced by the value of the attribute external of the element function. In this case:
    BlowFishCreateEncryptionKey
    
  • The template slot
    [:argument-declarations:]
    
    is replaced by processing the element:
          <arguments>
            <argument name="encryption-key" type="string" property="value" id="a_3"/>
          </arguments>
    to produce
          OM_StreamShelf_t                   EncryptionKey (ExtFuncAPI, 1);
          OM_String_t                        EncryptionKeyValue = EncryptionKey.Value ();
    
    The interesting part of this expansion is that OmniMark allows hyphens in names, whereas C++ does not. So the argument name encryption-key is converted to EncryptionKey for C++
  • Finally, the attribute type on the element function is used to fill the slots:
          [:return-value-declaration:]
    
    to be filled with
          OMXF_int                              ReturnValue = 0;
    
    and
          [:return-value:]
    
    with
          OMXF_SetCounterReturnValue(p_Env, ReturnValue);
    

The fully generated library stub file is shown in

A Code Formatter

In many ways, code formatting is the most obvious kind of transformation that can be applied. In the content processing world, this is called normalization. This example addresses line-breaking issues, indentation, and naming conventions. (All names are lower-cased. OmniMark is case-insensitive where user-defined names are concerned, so the lower-casing corrects inconsistencies in capitalization.)

If we take one of the preceding examples, and mangle it like this:

define string source function STRIP value string Part from value string source Whole as
repeat scan WHole
          match parT
          match any => T output t
again

process
output "Here is a sentence.%n"
output strip " "
from "Here is a sentence with spaces removed.%n"
output "The final sentence is normal.%n"           

Converting this to XML gives a clearer idea of what we are dealing with:

<file path="../examples\formatter.xom">
    <function line="3" name="STRIP" type="string source">
      <arguments>
        <argument name="Part" type="string" property="value" id="a_3"/>
        <argument name="Whole" type="string source" property="value"
          herald="from" id="a_4"/>
      </arguments>
      <body>
        <loop type="scan">
          <expression>
            <ref id="a_4">WHole</ref>
          </expression>
          <alternative type="match">
            <header>
              <ref id="a_3">parT</ref>
            </header>
          </alternative>
          <alternative type="match">
            <header>
              any
              <variable line="6" scope="local" name="T" type="pattern" id="v_5"/>
            </header>
            <body>
              <action>output <ref id="v_5">t</ref></action>
            </body>
          </alternative>
        </loop>
      </body>
    </function>
    <rule type="process" line="9">
      <body>
        <action>output "Here is a sentence.%n"</action>
        <action>output <ref id="f_2">strip</ref> " "
		   from "Here is a sentence with spaces removed.%n"</action>
        <action>output "The final sentence is normal.%n"</action>
      </body>
    </rule>
  </file>

From this, it is easy to normalize into the more cleanly formatted:

define string source function
   strip                           value string        part
                              from value string source whole
as
   repeat scan whole
      match part
      match any => t
         output t
   again

process
   output "Here is a sentence.%n"
   output strip " " from "Here is a sentence with spaces removed.%n"
   output "The final sentence is normal.%n"           

Since our XML intermediate form actually tracks variable relationships using id attributes, we can even rename variables safely. This allows us to support naming conventions like "Hungarian notation" automatically.

The explicit relationship between references to names and their definitions means that we can easily convert the code to a browsable HTML representation, where every variable reference is linked to its definition. If we extend our intermediate language to mark up keywords explicitly, we can perform syntax coloring when we generate the HTML. This is often a very great enhancement to readability.

Variable Analysis

We have already observed that references to names are explicitly linked to their definitions using id attributes. This enables us to implement a tool that provides us with a report on every name in a program:

  • We can list every place that a name is referenced.

  • We can identify every name that is defined but never used.

  • We can identify local variable definitions that hide names in a higher scope (local variables in an enclosing variable scope, function argument names, and global variable names). Inadvertent name reuse can cause programmer confusion.

An example program showing the name a used for three different variables is shown in Annex C: Variable Report.

The variable report (shown in Annex C: Variable Report) shows each definition of the variable. For each definition, all the references to that variable are shown. Unique ids allow the report to distinguish between references to different variables with the same name.

The report also shows when a local variable hides a variable declared in a higher-level scope. This can be a common source of errors, since programmers can sometimes write code that references the name of the local variable, thinking that they are using the global. (This is not uncommon when code is pasted from one location in a program to another.) This kind of a variable report can be useful in detecting the possibility of such errors.

It should also be noted that OmniMark programs can use an include statement to pull in the contents of other OmniMark files. These included files may also include other files. Names may be defined in included files and used in later included files or the main program. In these situations, a report listing every name defined, and its location, can be invaluable to a new programmer.

The information that enables the variable report can also be used to build a renaming tool that can safely rename variables without invalidating the references.

Observations and Conclusions

The applications discussed so far XML were written either in XSLT or in OmniMark, depending on the requirements for pattern-matching.

Potential Applications

It should be obvious that the preceding examples barely scratch the surface of what is possible, once program structure is explicitly marked up. Many additional applications are also possible, including (but not limited to):

  • The creation of stubs for API documentation
  • Conversion from one programming language to another (very similarly structured) programming language
  • Generation of input for an interpreter program
  • Generation of stubs for unit testing
  • Extension of the intermediate markup language into a full-blown literate programming language
  • Code colorization
  • Annotating a program with profiling or tracing code to aid in debugging and optimization
  • String capture for internationalization
  • Variable renaming
  • Syntax modernization
  • And many many more

The techniques described here can be applied to any programming language. For languages where a grammar is unavailable, this may be the only way to build the tools described here. These techniques can be applied even to Unix shell scripting languages, or Windows batch files.

They require a working-level knowledge of the source programming language and facility with content conversion approaches.

Limitations

One of the limitations in this approach is completeness. The approach taken here has been a content engineering approach. A sample set of programs is identified, and the transformation programs are modified until the sample set is completely processed. At this point, the transformations can be extrapolated to larger program sets.

Such an example-guided approach does not guarantee complete handling of all possible legal programs. However, it is not hard to compare the pattern matching rules against the language grammar to look for holes that need to be plugged. Completeness is therefore achievable if a complete definition of the language is available in some form.

Another limitation in this approach involves macro bodies and macro invocations. Macros can expand to arbitrary text, and this text can span programming language construct boundaries. These types of macros can be termed "obfuscatory" because they tend to obscure the structure of the program. The bodies of these macros may not be marked up correctly because they may contain partial language structures. And the invocation of such macros may cause the incorrect markup of program structure.

In general, such macros are discouraged, because human readers have much the same problem understanding the resulting program structure. It is possible to detect macro bodies which contain partial language structures and issue appropriate error messages. At least this means that obfuscatory macros can be detected, even if they introduce markup problems.

Not all such cases are bad style. One recent such example is a macro to reduce the effort of writing DITA-aware OmniMark programs. The processing of DITA elements often involves element rules that look like this:

    element #implied when attribute "class" matches (any** " bkinfo/bkinfo ")
       ...
    
    element #implied when attribute "class" matches (any** " topic/topic ")
       ...
Coding conventions can be made easier to read by creating macros that hide the invariant parts:
    macro dita-element token class-name is
          element #implied when attribute "class" matches (any** " %@(class-name) "
    macro-end

    dita-element "bkinfo/bkinfo"
       ...
    
    dita-element "topic/topic"
       ...
. In this case, the macro body contains the beginning of an element rule, but not the end of it.

If such macros are deemed useful, the pattern-matching rules can be extended to treat these macro invocations as if they were part of the language. They become de facto language extensions.

For language with a separate macro preprocessor, like C and C++, include files can also obscure program structure. In those languages, the include directive is part of the preprocessor language, not the programming language proper. Therefore, language structures may begin before the include file, and end within it, or vice versa. The "include structure" of the program may not be well-nested with respect to language constructs.

This situation is analogous to SGML text entities and elements, which also do not need to be well-nested with respect to each other.

Include files with partial language structures can be detected, in the same way as macro bodies with partial language structures.

Related Work

Formal Representations of Programs

This paper describes a largely informal process of adding markup to computer programs. It produces useful results quickly, and can be extended incrementally as needed.

A program can be represented more formally as an abstract syntax tree. A parser is the part of an interpreter or compiler that converts source code into an abstract syntax tree. In general, abstract syntax trees contain only the information needed to execute the program, but they can also be extended to contain information about the source code representation.

James Power and Brian Malloy [Power 2002] describe how they modified the GNU parser generator bison to annotate source code with XML for programming languages in the C family.

The Harmonia Research Project [Harmonia] has developed a generalized framework for representing information about computer programs in order to build CASE tools. Work is being done on an XML encoding of its internal syntax tree. Harmonia is the most general of the cases reviewed here, supporting not only C, C++, and Java, but also other languages such as Common Lisp, Scheme, XML, Lambda Calculus, Regular Expressions, Cobol, and a few others. For some of these languages, only the syntax is supported.

The Datrix schema [Datrix] models both the lexical and syntactic structure of computer programs. However, it models only preprocessed code, so details about macro invocations are lost. This schema was designed to handle C, C++, and Java.

The Columbus Schema [Columbus] is another C++ schema. Like Datrix, the Columbus schema seems to be based on preprocessed code as well.

In general, the formal approaches to this problem are fairly complete, but because of this they are more complex than the approach described here. A background in compiler technology would be necessary in order to take this approach.

In addition, Datrix and Columbus in particular can only handle preprocessed code. Macros are common in languages such as C (and OmniMark), and the practicality of these tools is greatly diminished by discarding this information.

Finally, these approaches are fairly language-specific. Many are designed to support the C family of languages. Those that are more general, still require a formal grammar for the programming language. Some tools will only work with a restricted set of grammars. In particular, bison requires a LALR(1) grammar and therefore, cannot handle languages such as OmniMark.

The approach described in this paper is quite fault-tolerant because it does not require a language grammar. It can give useful information about a program even if some of the constructs are not recognized.

Multi-Syntax Languages

The idea of having more than one syntax to encode a programming language is also not new. There are some very mainstream languages that provide specifications for more than one syntax and provide tools to convert between them.

RelaxNG [Clark 2001] is an XML schema language. The specification uses an XML encoding for the schema, but there is also a non-XML compact syntax [Clark2002] designed to maximize human readability.

XQuery [XQuery 2007] is a query language for retrieving information from XML documents. The standard specifies an English-like language, but there is also a corresponding XML syntax called XQueryX [XQueryX 2003].

Wilmott [Wilmott 2006] discusses the advantages of recasting XSLT [XSLT 1999] into an elegant human-readable form. He describes the benefits of each syntax, and provides tools to convert the XML form to the human-readable form.

LISP [McCarthy 1979] also falls into the category of a language with multiple syntaxes. Instead of an XML syntax, LISP syntax is commonly expressed in terms of S-expressions [S-expression, Wikipedia] (or nested lists). Since LISP is designed for list processing, it is the ideal language to manipulate itself (much like XSLT).

LISP also initially specified an infix syntax that was thought to be easier for humans. This syntax, known as M-expressions [M-expression, Wikipedia], was never widely adopted, and the S-Expression form of LISP is the syntax most often seen today.

It would seem that we do not need to create a process for transforming a programming language to XML if we use one of these languages. The content processing tools will easily follow.

In reality, it is not that simple. First of all, the essential audience for the computer program is still the computer that runs it. No matter what, we have to please this audience.

Language choice is also often motivated by productivity. The most effective language is usually one that expresses concepts in terms of the problem domain. If you have to map problem domain concepts into the language domain concepts, then you have an extra step to perform in creating your application. Worse, the models may not be perfectly isomorphic, and your translation may be incorrect in subtle ways.

Productivity is also increased by tools. Programmers often prefer to work in a language that has a good debugging tool and good language-specific editors.

Language choice is often dictated by application concerns as well. Often the performance or reliability requirements exclude certain language choices.

Language choice may be mandated by corporate policies either because of commercial agreements, security restrictions, or to ensure that the applications are maintainable without retraining existing staff.

Any of these reasons alone may make the choice of a hybrid language impossible.

Literate Programming

Some success has also been achieved in repurposing software by writing the software within a literate programming [Knuth 1992] framework.

Literate programming intertwines documentation with source code, so that the two are more easily maintained in a consistent fashion. Classically, a literate programming system provides two tools, one for producing documentation, and another one for producing pure source code which can be used to build the application.

Examples include language-specific systems like Javadoc [Kramer 1999] and language-agnostic systems [Légaré 2006].

Literate programming is an alternative only if there is a literate programming system for the language being used, and if it produces the type of information you want from your code.

In general, literate programming systems use markup to indicate embedded documentation and the boundaries of the code fragments. They rarely explicate the finer structure of the source code itself.

Creating a custom literate programming system is a big task. Modeling both the software and the documentation structure simultaneously is often difficult. The processing tools often need to do some major reorganization to build either the documentation or the programs, or both.

In addition, debugging tools would likely have no knowledge of a home-built literate programming system. All error messages and debugging information would be relative to the generated code. In practice, this can make it difficult to trace things back to the original literate program.

It also means that one cannot simply edit code in the debugger and retry until the problem is solved. The corrected code has to be reintegrated back into the literate programming system, and if changes have been made in many places, the process will become error-prone.

Finally, the decision to use a literate programming system is best applied at the beginning of a software development project. Once a substantial amount of source code is written, it requires much more effort to import it into a literate programming system.

Annex A: Example Library Stub File

Here is the fully generated stub file for the blowfish encryption library:

/******************************************************************

   Copyright (C) Stilo International Ltd
   All Rights Reserved

 ******************************************************************/

#include "protype.h"

/* Add C++ include files here */

extern "C" {
#include "libblowfish.h"
  /* Add C include files here */
}

#include "omblowfish.h"

/************************** Local Definitions ***************************/

/* Add local definitions here */

/****************** Exported Function Implementations *******************/

dll_export_fun void 
BlowFishLibraryVersion (OM_ExtFuncApiHandle_t  Environment)
{
   try 
   {
      /* Argument handling */
      OM_ExtFuncAPI_t                       ExtFuncAPI (Environment);
      char                                 *p_ReturnValue = "";

      /* Add function implementation here */

      /* Return value handling */
      ExtFuncAPI.WriteToStreamReturnValue (OM_TemporaryString_t (p_ReturnValue));
   }

   /* Exception handling */
   catch (OM_Exception_t &  Except)
   {
      OM_DefaultExceptionHandler (Environment, Except);
   }

   catch (...)
   {
      OM_UnexpectedExceptionHandler (Environment);
   }
}


dll_export_fun void 
BlowFishCreateEncryptionKey (OM_ExtFuncApiHandle_t  Environment)
{
   try 
   {
      /* Argument handling */
      OM_ExtFuncAPI_t                  ExtFuncAPI (Environment);
      OM_StreamShelf_t                 EncryptionKey (ExtFuncAPI, 1);
      OM_String_t                      EncryptionKeyValue = EncryptionKey.Value ();
      OMXF_int                         ReturnValue = 0;

      /* Add function implementation here */

      /* Return value handling */
      OMXF_SetCounterReturnValue(p_Env, ReturnValue);
   }

   /* Exception handling */
   catch (OM_Exception_t &  Except)
   {
      OM_DefaultExceptionHandler (Environment, Except);
   }

   catch (...)
   {
      OM_UnexpectedExceptionHandler (Environment);
   }
}


dll_export_fun void 
BlowFishDecode (OM_ExtFuncApiHandle_t  Environment)
{
   try 
   {
      /* Argument handling */
      OM_ExtFuncAPI_t                  ExtFuncAPI (Environment);
      OM_StreamShelf_t                 Src (ExtFuncAPI, 1);
      OM_String_t                      SrcValue = Src.Value ();
      OM_CounterShelf_t                State (ExtFuncAPI, 2);
      char                            *p_ReturnValue = "";

      /* Add function implementation here */

      /* Return value handling */
      ExtFuncAPI.WriteToStreamReturnValue (OM_TemporaryString_t (p_ReturnValue));
   }

   /* Exception handling */
   catch (OM_Exception_t &  Except)
   {
      OM_DefaultExceptionHandler (Environment, Except);
   }

   catch (...)
   {
      OM_UnexpectedExceptionHandler (Environment);
   }
}


dll_export_fun void 
BlowFishEncode (OM_ExtFuncApiHandle_t  Environment)
{
   try 
   {
      /* Argument handling */
      OM_ExtFuncAPI_t                  ExtFuncAPI (Environment);
      OM_StreamShelf_t                 Src (ExtFuncAPI, 1);
      OM_String_t                      SrcValue = Src.Value ();
      OM_CounterShelf_t                State (ExtFuncAPI, 2);
      char                            *p_ReturnValue = "";

      /* Add function implementation here */

      /* Return value handling */
      ExtFuncAPI.WriteToStreamReturnValue (OM_TemporaryString_t (p_ReturnValue));
   }

   /* Exception handling */
   catch (OM_Exception_t &  Except)
   {
      OM_DefaultExceptionHandler (Environment, Except);
   }

   catch (...)
   {
      OM_UnexpectedExceptionHandler (Environment);
   }
}


/******************* Helper Function Implementations ********************/

/* Add helper function implementations here */

Annex B: Example Library Stub File

The process of converting the source code to a more abstract XML markup employed a highly redundant intermediate stage. The intermediate stage retains all of the original source code, including white space, so that the original input can be reproduced exactly.

For the example:

define function
   output-comment-element value string text
as
   output "<comment>"
   output text
   output "</comment>%n"

find ";" any-text* => line "%n"
   output-comment-element line

element "comment"
   suppress

The following intermediate markup was produced:

<file path="../examples\example1.xom">
  <function line="1" name="output-comment-element" type="">
    <source>define function
   output-comment-element</source>
    <arguments>
      <argument name="text" type="string" property="value" id="a_3"/>
  <source> value string text</source>
    </arguments>
    <source>
as</source>
    <body>
      <unprocessed>
</unprocessed>
      <unprocessed>   </unprocessed>
      <action>
        <unprocessed>output "&lt;comment&gt;"
</unprocessed>
        <unprocessed>   </unprocessed>
      </action>
      <action>
        <unprocessed>output <ref id="a_3">text</ref>
</unprocessed>
        <unprocessed>   </unprocessed>
      </action>
      <action>
        <unprocessed>output "&lt;/comment&gt;%n"
</unprocessed>
        <unprocessed>
</unprocessed>
      </action>
    </body>
  </function>
  <rule type="find" line="8">
    <source>find</source>
    <header>
      <unprocessed> ";" any-text* </unprocessed>
      <variable line="8" scope="local" name="line" type="pattern" id="v_4"/>
      <source>=&gt; line</source>
      <unprocessed> "%n"
</unprocessed>
      <unprocessed>   </unprocessed>
    </header>
    <body>
      <action>
        <unprocessed><ref id="f_2">output-comment-element</ref> <ref id="v_4">line</ref>
</unprocessed>
        <unprocessed>
</unprocessed>
      </action>
    </body>
  </rule>
  <rule type="element" line="11">
    <source>element</source>
    <header>
      <unprocessed> "comment"
</unprocessed>
      <unprocessed>   </unprocessed>
    </header>
    <body>
      <action>
        <unprocessed>suppress
</unprocessed>
      </action>
    </body>
  </rule>
</file>

In the intermediate markup, source elements contain the fragments of code that produced the structural markup, and unprocessed elements contain the rest of the code fragments. The original source code can be reproduced by emitting the content of these elements, and discarding everything else.

The structural markup instance is obtained by discarding the source elements and the unprocessed elements that contain only white space, and retaining the rest.

Annex C: Variable Report

This annex shows a small program, its XML encoding, and a report describing each variable definition and reference.

A Program Showing the Reuse of Variable Names

This is a short program that uses the same name for different variables. Specifically, the variable name a is used for both a global variable of type switch and for a locally defined pattern variable.

global switch a

process
   do xml-parse scan file "people.xml"
      output "%c"
   done

element #implied
   output "%c"

element tel
   local string area-code
   local string local-number
   do when parent is a
      do scan attribute value
         match "(" digit{3} => a ")" blank* (digit{3} "-" digit{3}) => n
            set local-number to n
            set area-code to a

         match digit{3} => a blank+ digit{3} => n1 blank+ digit{4} => n2
            set local-number to n1 || "-" || n2
            set area-code to a
         else
            not-reached message "Illegal phone number."
      done
      do when a
         output "(" || area-code || ") "
      done
      output local-number
   else
      suppress
   done

The XML Encoding of the Preceding Program

Here is the XML encoding of the program shown above.

<file path="../examples\example4.xom">
  <variable line="1" scope="global" name="a" type="switch" id="v_1">
  </variable>
  <rule type="process" line="3">
    <body>
      <block line="4" type="xml-parse">
        <header>
          scan file "people.xml"
        </header>
        <body>
          <action line="5">output "%c"</action>
        </body>
      </block>
    </body>
  </rule>
  <rule type="element" line="8">
    <header>
      #implied
    </header>
    <body>
      <action line="9">output "%c"</action>
    </body>
  </rule>
  <rule type="element" line="11">
    <header>
      tel
    </header>
    <body>
      <variable line="12" scope="local" name="area-code" type="string" id="v_2">
      </variable>
      <variable line="13" scope="local" name="local-number" type="string" id="v_3">
      </variable>
      <block type="conditional">
        <alternative line="14" type="conditional">
          <header>
            parent is <ref line="14" id="v_1">a</ref>
          </header>
          <body>
            <block line="15" type="scan">
              <expression>
                attribute value
              </expression>
              <alternative line="16" type="match">
                <header line="16">
                  "(" digit{3}
                  <variable line="16" scope="local" name="a" type="pattern" id="v_4"/>
                    
                  ")" blank* (digit{3} "-" digit{3})
                  <variable line="16" scope="local" name="n" type="pattern" id="v_5"/>
                    
                </header>
                <body>
                  <action line="17">set <ref line="17" id="v_3">local-number</ref> 
to <ref line="17" id="v_5">n</ref></action>
                 <action line="18">set <ref line="18" id="v_2">area-code</ref> to 
<ref line="18" id="v_4">a</ref></action>
                </body>
              </alternative>
              <alternative line="20" type="match">
                <header line="20">
                  digit{3}
                  <variable line="20" scope="local" name="a" type="pattern" id="v_6"/>
                    
                  blank+ digit{3}
                  <variable line="20" scope="local" name="n1" type="pattern" id="v_7"/>
                    
                  blank+ digit{4}
                  <variable line="20" scope="local" name="n2" type="pattern" id="v_8"/>
                    
                </header>
                <body>
                  <action line="21">set <ref line="21" id="v_3">local-number</ref> 
                                         to <ref line="21" id="v_7">n1</ref> 
                                            || "-"
                                            || <ref line="21" id="v_8">n2</ref></action>
                  <action line="22">set <ref line="22" id="v_2">area-code</ref> 
                                          to <ref line="22" id="v_6">a</ref></action>
                </body>
              </alternative>
              <alternative type="default">
                <body line="23">
                  <action line="24">not-reached message "Illegal phone number."</action>
                </body>
              </alternative>
            </block>
            <block type="conditional">
              <alternative line="26" type="conditional">
                <header>
                  <ref line="26" id="v_6">a</ref>
                </header>
                <body>
                  <action line="27">output "("
                  || <ref line="27" id="v_2">area-code</ref> 
                  || ") "</action>
                </body>
              </alternative>
            </block>
            <action line="29">output <ref line="29" id="v_3">local-number</ref></action>
          </body>
        </alternative>
        <alternative type="default">
          <body line="30">
            <action line="31">suppress</action>
          </body>
        </alternative>
      </block>
    </body>
  </rule>
</file>

A Sample Variable Definition and Reference Report

This is a report showing where each name (variable or function) is defined in a program, and where each name is referenced. The program is taken from

<variables>
  <variable name="area-code" id="v_2">
>
    <defined scope="local" type="string" file="../examples\example4.xom" line="12"/>
    <referenced file="../examples\example4.xom" line="18"/>
    <referenced file="../examples\example4.xom" line="22"/>
    <referenced file="../examples\example4.xom" line="27"/>
  </variable>
  <variable name="a" id="v_1">
>
    <defined scope="global" type="switch" file="../examples\example4.xom" line="1"/>
    <referenced file="../examples\example4.xom" line="14"/>
  </variable>
  <variable name="a" id="v_4">
>
    <defined scope="local" type="pattern" file="../examples\example4.xom" line="16"/>
    <hides id="v_1"/>
    <referenced file="../examples\example4.xom" line="18"/>
  </variable>
  <variable name="a" id="v_6">
>
    <defined scope="local" type="pattern" file="../examples\example4.xom" line="20"/>
    <hides id="v_1"/>
    <referenced file="../examples\example4.xom" line="22"/>
    <referenced file="../examples\example4.xom" line="26"/>
  </variable>
  <variable name="local-number" id="v_3">
>
    <defined scope="local" type="string" file="../examples\example4.xom" line="13"/>
    <referenced file="../examples\example4.xom" line="17"/>
    <referenced file="../examples\example4.xom" line="21"/>
    <referenced file="../examples\example4.xom" line="29"/>
  </variable>
  <variable name="n1" id="v_7">
>
    <defined scope="local" type="pattern" file="../examples\example4.xom" line="20"/>
    <referenced file="../examples\example4.xom" line="21"/>
  </variable>
  <variable name="n2" id="v_8">
>
    <defined scope="local" type="pattern" file="../examples\example4.xom" line="20"/>
    <referenced file="../examples\example4.xom" line="21"/>
  </variable>
  <variable name="n" id="v_5">
>
    <defined scope="local" type="pattern" file="../examples\example4.xom" line="16"/>
    <referenced file="../examples\example4.xom" line="17"/>
  </variable>
</variables>

In the example, there are two definitions of the name a that are reported as hiding a previous definition.(Those with ids v_4 and v_6.)

The author would like to thank Mario Blažević and Jacques Légaré for identifying many relevant research papers and for their diligent review, Kirill Lisovsky, Patrick Baker, and Chris Carruthers for their comments on refining this approach, additional applications, and overall suggestions. Thanks to my many colleagues at Stilo International who sat through a very hand-wavy presentation of the ideas in this paper. And many thanks to Helen St. Denis for proof-reading the paper at almost the last possible moment. Any mistakes still present (or introduced afterwards) are mine.

Bibliography

[Clark 2001] James Clark, RELAX NG Specification, 2001, http://www.oasis-open.org/committees/relax-ng/spec-20011203.html

[Clark 2002] James Clark, RELAX NG Compact Syntax, 2002, http://www.oasis-open.org/committees/relax-ng/compact-20021121.html

[Columbus] Rudolf Ferenc, Árpád Beszédes, Data Exchange with the Columbus Schema for C++, http://www.inf.u-szeged.hu/~ferenc/research/ferencr_columbus_schema_cpp.pdf

[DATRIX] R. Ferenc, S. E. Sim, R. C. Holt, R. Koschke, T. Gyimothy, Towards a standard schema for C/C++, Proceedings of the Eighth Working Conference on Reverse Engineering, 49-58, 2001

[DITA 1998] M. Priestley, DITA XML: a reuse by reference architecture for technical documentation. In Proceedings of the 19th Annual international Conference on Computer Documentation (Santa Fe, New Mexico, USA, October 21 - 24, 2001). SIGDOC '01. ACM Press, New York, NY, 152-156.

[Harmonia] Marat Boshernitsan, Susan L. Graham, Designing an XML-based exchange format for Harmonia., Proceedings of Seventh Working Conference on Reverse Engineering, http://harmonia.cs.berkeley.edu/papers/harmonia-xml.pdf, 2000

[Knuth 1992] Donald E. Knuth, Literate Programming, Stanford, California: Center for the Study of Language and Information, 1992, CSLI Lecture Notes, no. 27

[Kramer 1999] Douglas Kramer, API documentation from source code comments: a case study of Javadoc, Proceedings of the 17th annual international conference on Computer documentation, 147-153, 1999

[Légaré 2006] Jacques Légaré, Literate Programming Using OmniMark, http://developers.stilo.com/0001.html, 2006

[M-expression, Wikipedia] M-expression, http://en.wikipedia.org/wiki/M-expression

[McCarthy 1979] John McCarthy, History of Lisp, http://www-formal.stanford.edu/jmc/history/lisp/lisp.html, 1979

[OmniMark 2007] OmniMark language documentation, https://developers.omnimark.com/docs/html/

[Power 2002] James F. Power, Brian A. Malloy, Program annotation in XML: a parse-tree based approach, Proceedings of the Ninth Working Conference on Reverse Engineering, 190-198, 2002

[S-expression, Wikipedia] S-expression, http://en.wikipedia.org/wiki/S-expression

[S1000D 2007] S1000D International specification for technical publications, http://www.s1000d.org/

[Wilmott 2003] Sam Wilmott, What programming language designers should do to help markup processing, Extreme Markup Languages 2003, http://www.idealliance.org/papers/extreme/Proceedings/html/2003/Wilmott01/EML2003Wilmott01.html, 2003

[Wilmott 2006] Sam Wilmott, Rethinking XSLT, Extreme Markup Languages 2006, http://www.idealliance.org/papers/extreme/proceedings/html/2006/Wilmott01/EML2006Wilmott01.html, 2006

[XQuery 2007] Boag et al., XQuery 1.0: An XML Query Language, http://www.w3.org/TR/2007/REC-xquery-20070123/, 2007

[XQueryX 2003] Ashok Malhotra et al., XML Syntax for XQuery 1.0 (XQueryX), http://www.w3.org/TR/2003/WD-xqueryx-20031219/, 2003

[XSLT 1999] XSL Transformations (XSLT) Version 1.0, http://www.w3.org/TR/xslt/

blue bar