|
|||||
Excel to XML conversions: sample program | |||||
Introduction: Excel to XML conversion |
Sample
This program converts the content of an Excel spreadsheet (XLS) file, saved as a tab-delimited text file, to an XML document.
The program uses groups to enable find rules when they are needed to produce particular elements, and to disable them when they are not needed. (In OmniMark, groups can be used to enable and disable any number of rules with a single statement.) The program uses "find groups" to enable and disable find rules used when producing the major elements because this method is simpler than using an "open element" test on each of several find rules.
The program also uses information from the pattern processor while processing the child elements. For example, it issues a warning when non-numeric data appears anywhere in a table other than in the first row or first column.
During the translation, the program checks for invalid input data.
context-translate global stream document-prolog initial { "<!DOCTYPE xls2xml SYSTEM %"xls2xml.dtd%" [" _ "]>" }
The following find rules are used to build the major elements used in the translation. The initial document-start rule performs any required initial work on the final output document. Since the output of the pattern processor will be sent to the XML parser, this part of the program must begin with a document prolog. The document prolog should either include a DTD or refer to a DTD (as is done in this program, above).
The next set of find rules determine which major element of the document to process and which group of find rules to use. The final find rules are used to clean up the output.
All of these find rules are in the #implied
group.
;----------------------------------------------------------------------- ; Rules to begin construction of major elements. document-start output document-prolog
The input file starts with a text title; the "xls2xml" element of the document is not yet open. So, the program can recognize any text as a title, before starting the XML document, and begin the document when it "sees" the text. Beginning the document prevents the program from recognizing any subsequent text as a title.
find line-start white-space+ [any-text except "%t"]+ => title any-text+ unless open element is XLS2XML output "<xls2xml>" output "%n<title>" || title || "</title>" ; ; We know that the next major element will be a ulist, ; so open it now. ; output "%n<ulist>" next group is ulist
The document table begins with the text "Project/Activity". So, this find rule must come before the find rules enabled when "<ulist>" is open because it must have priority over the text recognition rules enabled when "<ulist>" is open.
find "Project/Activity" => text "%t"? output "</ulist>" when open element is ulist repeat exit when open element isnt (dl | p | table | ulist) output "</%q>" again output "%n<table><tr><th>" || text || "</th>" next group is table
The last major element, "<dl>", starts after the paragraph that begins with "*Note:".
find ( "*Note:" [any-text except "%t"]* ) => text "%t"? output "%n<p>" || text || "</p>" output "%n<dl>" next group is dl ;----------------------------------------------------------------------- ; Rules that produce an unordered list (ulist).
From the analysis of the input document, we know not only that unordered list items occur in the first two cells of each row, but also that some rows may be blank. An error is produced when a "non-blank" row starts with two empty cells, or any cell other that the first two has data.
group ulist find line-start [any-text except "%t"]+ = legend "%t" [any-text except "%t"]+ = info ; ; Change the tab to a space. ; output "%n<li>" || legend || " " || info || "</li>" find [any-text except "%t"]+ = info when open element is ulist put #error "Warning: invalid data %"" || info || "%" in ulist%n" find any ; Blank lines do not cause an error. ;----------------------------------------------------------------------- ; Rules that produce a table.
Cells in the first row or column can contain non-numeric data; these cells are converted to "<th>". All other cells must either be blank or contain numeric data; these cells are converted to "<td>".
Blank rows are allowed in the input, but no corresponding row will appear in the output. Every input row (blank or non-blank row) ends with a newline "%n".
Blank cells can occur in a non-blank row; an empty cell is output for these. This is done by picking up and discarding the tab that follows a non-blank cell, then outputting a blank cell for any tab that has not been discarded. Note that this will not output the last cell of a row when the cell is blank because the last cell is never followed by a tab.
The output document structure is used to decide when to report non-numeric data as an error.
The table ends when the string "Week Total" is encountered.
group table find "Week Total" repeat exit when open element isnt table output "</%q>" again next group is #implied find digit+ => text "%t"? output "%n<tr>" when open element isnt tr do when occurrence of open element tr = 1 output "%n<th>" || text || "</th>" else output "%n<td>" || text || "</td>" done find [any-text except "%t"]+ => text "%t"? output "%n<tr>" when open element isnt tr do when occurrence of open element tr = 1 ; First row. output "%n<th>" || text || "</th>" else ; Not the first row. output "%n<td>" || text
Note that "td" is not closed here; it's closed below. It's left open so that the occurrence number can be counted.
; ; Non-numeric data is allowed only in the ; first row and first column. ; do when occurrence != 1 ; Not the first column. local counter row local counter column set row to occurrence of open element tr ; Current row. set column to occurrence ; Current column. put #error "Warning: non-numeric data %"" || text || "%" in row " || "d" % row || ". column " || "d" % column || " of table%n" done output "</td>" done find "%t" when open element is tr do when occurrence of open element tr = 1 output "%n<th></th>" else output "%n<td></td>" done find "%n" when open element is tr output "</tr>" find any ;----------------------------------------------------------------------- ; Rules that produce a list of paired items (dl).
All input data should be in pairs -- a number followed by a description. The structure of the output document is used to determine when to check for a missing description or missing number.
group dl find digit+ => text output "%n<dt>" || text || "</dt>" do when occurrence != 1 and previous isnt dd put #error "Warning: Product number before %"" || text || "%" did not have description%n" done find [any-text except "%t"]+ => text output "%n<dd>" || text do when previous isnt dt put #error "Warning: Product description %"" || text || "%" without product number%n" done output "</dd>"
The method used to close the "dd" element allows us to use the "previous test" to check whether the previous element was "dt".
;----------------------------------------------------------------------- ; Clean-up rules.
Clean-up rules are placed at the end so that they will fire only when no other pattern identifies what is to be processed.
group #implied find any find-end repeat over reversed current elements as this-element output "</" || name of current element this-element || ">" again ;----------------------------------------------------------------------- ; XML output processing element (table | ulist | dl) output "<%q>%c%n</%q>" element (dt | dd | li | td | th | tr | p) output "%n<%q>%c</%q>" element title output "<%q>%c</%q>%n" element #implied output "%c</%q>"
The final part of the program simply puts the tags back in. A more elaborate conversion might use the element rules to perform other types of work. For example you can't write referents to the XML parser, so you could use element rules to handle forward links on the output side.
---- |