Excel to XML conversions: sample program

Excel to XML conversions: sample program

Introduction: Excel to XML conversion

Sample

This program converts the content of an Excel spreadsheet (XLS) file, saved as a tab-delimited text file, to an XML document.

The program uses groups to enable find rules when they are needed to produce particular elements, and to disable them when they are not needed. (In OmniMark, groups can be used to enable and disable any number of rules with a single statement.) The program uses "find groups" to enable and disable find rules used when producing the major elements because this method is simpler than using an "open element" test on each of several find rules.

The program also uses information from the pattern processor while processing the child elements. For example, it issues a warning when non-numeric data appears anywhere in a table other than in the first row or first column.

During the translation, the program checks for invalid input data.

  context-translate

  global stream document-prolog initial  {
     "<!DOCTYPE xls2xml SYSTEM %"xls2xml.dtd%" [" _
     "]>"
     }

The following find rules are used to build the major elements used in the translation. The initial document-start rule performs any required initial work on the final output document. Since the output of the pattern processor will be sent to the XML parser, this part of the program must begin with a document prolog. The document prolog should either include a DTD or refer to a DTD (as is done in this program, above).

The next set of find rules determine which major element of the document to process and which group of find rules to use. The final find rules are used to clean up the output.

All of these find rules are in the #implied group.

  ;-----------------------------------------------------------------------
  ; Rules to begin construction of major elements.

  document-start
    output document-prolog

The input file starts with a text title; the "xls2xml" element of the document is not yet open. So, the program can recognize any text as a title, before starting the XML document, and begin the document when it "sees" the text. Beginning the document prevents the program from recognizing any subsequent text as a title.

  find  line-start white-space+
        [any-text except "%t"]+ => title
        any-text+
        unless open element is XLS2XML
     output "<xls2xml>"
     output "%n<title>" || title || "</title>"
     ;
     ; We know that the next major element will be a ulist, 
     ; so open it now.
     ;
     output "%n<ulist>"
     next group is ulist

The document table begins with the text "Project/Activity". So, this find rule must come before the find rules enabled when "<ulist>" is open because it must have priority over the text recognition rules enabled when "<ulist>" is open.

  find  "Project/Activity" => text
        "%t"?
     output "</ulist>" when open element is ulist
     repeat
        exit when open element isnt (dl | p | table | ulist)
        output "</%q>"
     again
     output "%n<table><tr><th>" || text || "</th>"
     next group is table

The last major element, "<dl>", starts after the paragraph that begins with "*Note:".

  find  (  "*Note:"
           [any-text except "%t"]*
        )  => text
        "%t"?
     output "%n<p>" || text || "</p>"
     output "%n<dl>"
     next group is dl

  ;-----------------------------------------------------------------------
  ; Rules that produce an unordered list (ulist).

From the analysis of the input document, we know not only that unordered list items occur in the first two cells of each row, but also that some rows may be blank. An error is produced when a "non-blank" row starts with two empty cells, or any cell other that the first two has data.

  group ulist

  find  line-start
        [any-text except "%t"]+ = legend
        "%t"
        [any-text except "%t"]+ = info
     ;
     ; Change the tab to a space.
     ;
     output "%n<li>" || legend || " " || info || "</li>"

  find  [any-text except "%t"]+ = info
        when open element is ulist
     put #error "Warning: invalid data %"" || info || "%" in ulist%n"

  find  any   ; Blank lines do not cause an error.

  ;-----------------------------------------------------------------------
  ; Rules that produce a table.

Cells in the first row or column can contain non-numeric data; these cells are converted to "<th>". All other cells must either be blank or contain numeric data; these cells are converted to "<td>".

Blank rows are allowed in the input, but no corresponding row will appear in the output. Every input row (blank or non-blank row) ends with a newline "%n".

Blank cells can occur in a non-blank row; an empty cell is output for these. This is done by picking up and discarding the tab that follows a non-blank cell, then outputting a blank cell for any tab that has not been discarded. Note that this will not output the last cell of a row when the cell is blank because the last cell is never followed by a tab.

The output document structure is used to decide when to report non-numeric data as an error.

The table ends when the string "Week Total" is encountered.

  group table

  find  "Week Total"
     repeat
        exit when open element isnt table
        output "</%q>"
     again
     next group is #implied

  find  digit+ => text
        "%t"?
     output "%n<tr>" when open element isnt tr
     do when occurrence of open element tr = 1
        output "%n<th>" || text || "</th>"
     else
        output "%n<td>" || text || "</td>"
     done

  find  [any-text except "%t"]+ => text
        "%t"?
     output "%n<tr>" when open element isnt tr
     do when occurrence of open element tr = 1    ; First row.
        output "%n<th>" || text || "</th>"
     else                                         ; Not the first row.
        output "%n<td>" || text

Note that "td" is not closed here; it's closed below. It's left open so that the occurrence number can be counted.

  ;
        ; Non-numeric data is allowed only in the 
        ; first row and first column.
        ;

        do when occurrence != 1                   ; Not the first column.
           local counter row
           local counter column
              set row to occurrence of open element tr  ; Current row.
              set column to occurrence                  ; Current column.
           put #error "Warning: non-numeric data %"" || text
              || "%" in row " || "d" % row || ". column "
              || "d" % column || " of table%n"
        done
        output "</td>"
     done

  find  "%t" when open element is tr
     do when occurrence of open element tr = 1
        output "%n<th></th>"
     else
        output "%n<td></td>"
     done

  find  "%n" when open element is tr
        output "</tr>"

  find any

  ;-----------------------------------------------------------------------
  ; Rules that produce a list of paired items (dl).

All input data should be in pairs -- a number followed by a description. The structure of the output document is used to determine when to check for a missing description or missing number.

  group dl

  find digit+ => text
     output "%n<dt>" || text || "</dt>"
     do when occurrence != 1 and previous isnt dd
        put #error "Warning: Product number before %"" || text
              || "%" did not have description%n"
     done

  find [any-text except "%t"]+ => text
     output "%n<dd>" || text
     do when previous isnt dt
        put #error "Warning: Product description %"" || text
              || "%" without product number%n"
     done
     output "</dd>"

The method used to close the "dd" element allows us to use the "previous test" to check whether the previous element was "dt".

  ;-----------------------------------------------------------------------
  ; Clean-up rules.

Clean-up rules are placed at the end so that they will fire only when no other pattern identifies what is to be processed.

  group #implied

  find any

  find-end
     repeat over reversed current elements as this-element
        output "</" || name of current element this-element || ">"
     again

  ;-----------------------------------------------------------------------
  ; XML output processing

  element (table | ulist | dl)
     output "<%q>%c%n</%q>"

  element (dt | dd | li | td | th | tr | p)
     output "%n<%q>%c</%q>"

  element title
     output "<%q>%c</%q>%n"

  element #implied
     output "%c</%q>"

The final part of the program simply puts the tags back in. A more elaborate conversion might use the element rules to perform other types of work. For example you can't write referents to the XML parser, so you could use element rules to handle forward links on the output side.

----

[CONTENTS] [CONCEPTS] [SYNTAX] [LIBRARIES] [SAMPLES] [ERRORS] [INDEX]

Generated: April 21, 1999 at 2:01:42 pm
If you have any comments about this section of the documentation, send email to [email protected]