|  | 
 | ||||
|        | |||||
|  | |||||
| Prerequisite Concepts | Related Topics | ||||
| Validating markup | |||||
 Markup processing can be conceptually divided into separate steps: parsing (with optional validation),
      filtering, and generating the final output. Some applications may have other distinct steps, such as analysis,
      aggregation, and reporting. The parsing of markup is typically performed by OmniMark actions do sgml-parse 
      or do xml-parse. An example using the well-formed XML parser might look like:
      
  process
     do xml-parse scan #main-input
        output "%c"
     done
        
    
If validation is desired or required, then the validating XML parser, the Xerces parser, or the SGML parser (as appropriate) can be used instead of the non-validating XML parser. Validating parsers validate the input document as they parse it. The remaining processing steps are usually accomplished by markup rules.
Validation can be performed separately from parsing, using a schema library such as OMRELAXNG. Separation of parsing and validation steps makes the processing pipeline more flexible: the parser is not required to perform all possible validations, validators are not required to perform parsing, and the user is free to combine any of the available parsers and validators as needed.
 Legacy OmniMark programs typically process the markup coming from the parser immediately, so the body of the
        do sgml-parse contains a single output "%c" action to fire the markup rules. We can
        accomplish the same thing by applying do markup-parse on our markup source:
      
  process
     do xml-parse scan #main-input
        do markup-parse #content
           output "%c"
        done
     done
        
      
 In order to validate the markup coming from the parser, we can pass #content to a markup validator. For
        instance, we can validate using OMRELAXNG.
      
  import "omrelaxng.xmd" prefixed by relaxng.
  
  process
     do xml-parse scan #main-input
        using output as relaxng.validator against relaxng.compile-schema file "my-schema.rng"
           output #content
     done
        
      
 The function relaxng.compile-schema used above reads a given textual representation of a RELAX NG
        schema and returns its compiled representation as an instance of the relaxng.relaxng-schema-type opaque
        type. The compiled schema is then passed to the markup sink function relaxng.validator, which validates
        all markup written into it against the schema.
      
 The same compiled schema can be used to validate multiple document instances:
      
  import "omrelaxng.xmd" prefixed by relaxng.
  
  process
     local relaxng.relaxng-schema-type my-compiled-schema initial {relaxng.compile-schema file "my-schema.rng"}
  
     repeat over #args as input-file
        do xml-parse scan file input-file
           using output as relaxng.validator against my-compiled-schema
              output #content
        done
     again
        
    
 In the examples above, #content has been only validated, without any further processing. To accomplish
        both, we need to send the parsed markup in two directions. This can be done by adding a markup sink
        function that does the processing, and using the & operator to split the stream in two directions:
      
  define markup sink function
     markup-processor into value string sink destination
  as
     using output as destination
        using group "process markup"
        do markup-parse #current-input
           output "%c"
        done
  
  process
     do xml-parse scan #main-input
        using output as relaxng.validator against relaxng.compile-schema file "my-schema.rng"
                        & markup-processor into #current-output
           output #content
     done
        
      
      
 Keep in mind that #content is a markup source, not a DOM tree of the document. This means that
        the markup is streamed, as the parser creates it, to be both validated and processed concurrently. We can send
        the markup to an arbitrary number of destinations. For instance, we can validate the markup against two
        different schemas and process it in two different ways, and the markup will stream to all four concurrently from
        the parser:
      
     do xml-parse scan #main-input
        using output as relaxng.validator against relaxng.compile-schema file "my-schema-1.rng"
                        & relaxng.validator against relaxng.compile-schema file "my-schema-2.rng"
                        & markup-processor into #current-output
                        & another-markup-processor
           output #content
     done
        
      
 Apart from adding more destinations to widen the processing pipeline, we can also extend the pipeline by
        breaking it into multiple steps. For example, we could direct the markup into a preparatory filtering phase
        before the main processing:
      
  define markup sink function
     prepare-markup into value markup sink destination
  as
     using group "prepare markup"
     do markup-parse #current-input
        output "%c"
     done
  
  process
     do xml-parse scan #main-input
        using output as relaxng.validator against relaxng.compile-schema file "my-schema-1.rng"
                        & prepare-markup into markup-processor into #current-output
           output #content
     done
        
    
 The easiest way, however, to augment a legacy OmniMark program with schema validation is by using the function
        relaxng.validated. This function takes the markup source created by a parser as an argument, and
        produces another markup source that can be processed further. Here's an example of its use:
      
  process
     do xml-parse scan #main-input
        do markup-parse relaxng.validated #content against relaxng.compile-schema file "my-schema.rng"
           output "%c"
        done
     done
        
      
 relaxng.validated makes markup-processor defined earlier unnecessary: the
        markup source produced by validated can be processed by 
        do markup-parse directly. Another difference between the functions validator and
        validated is that the latter inserts all validation errors into the markup it produces, so in the
        previous example a markup-error rule would fire for them. The validator function, on the
        other hand, reports all validation errors in OmniMark's log stream. This behavior can be modified by specifying
        a different markup sink destination for validation errors: 
      
  define markup sink function
     error-processor into value string sink destination
  as
     using output as destination
     do markup-parse #current-input
        output "%c"
     done
  
  process
     do xml-parse scan #main-input
        using output as relaxng.validator against relaxng.compile-schema file "my-schema.rng"
                                report-errors-to error-processor into #current-output
                        & markup-processor
           output #content
     done
  
  markup-error
     log
     output "<!--%n"
        || "   Validation error: "
        || #message || "%n"
        || "-->"
        
    
 We've already said that a markup parser converts a string source to a markup source. Another way
        of looking at a markup parser is as a converter from a concrete representation of a markup stream (SGML or XML,
        with or without DTD) to its abstract representation. The abstract markup stream is simply a sequence of data
        characters interspersed with abstract markup events. The markup events are abstract because their original
        textual representation is abstracted away and only its meaning is kept. As a consequence, a markup schema
        designed for validating XML can be equally applied to validating SGML:
      
     do sgml-parse document scan #main-input
        do markup-parse relaxng.validated #content against relaxng.compile-schema file "my-schema.rng"
           output "%c"
        done
     done
        
      
 One problem with this example is that SGML is case-insensitive by default, while the RELAX NG specification
        treats all names as case-sensitive. There are different solutions to this mismatch: one can either modify the
        schema to all-uppercase names, or use an SGML declaration to specify that SGML names should be case-sensitive.
        Since these solutions are somewhat intrusive, both validator and validated have an optional
        argument, case-insensitive, that can be used to specify that case should be ignored during
        validation:
      
     do sgml-parse document scan #main-input
        do markup-parse relaxng.validated                  #content
                                                   against relaxng.compile-schema file "my-schema.rng"
                                          case-insensitive true
           output "%c"
        done
     done
        
    
 External text entity events coming from a markup parser require special attention. If the markup parser is to
        proceed with parsing, it has to be supplied with replacement text for the external text entity in question. In
        other words, an external-text-entity rule must be run, even if only the default
          one. To prevent mixups, OmniMark also requires that exactly one external-text-entity rule be
        run for each entity reference. For this reason, it is good practice to split external text entity events out of
        the markup stream meant for other processing as soon as the stream is produced by markup parser, and to divert
        them into a different markup sink responsible for resolving and expanding the entities.
      
 The function split-external-text-entities in the library OMMARKUPUTILITIES can help with this task. It
        takes two markup sinks as arguments, one responsible for resolving external text entities and the other to
        handle the rest of the markup stream. The result of this function is a markup sink which can be directly fed all
        markup produced by the parser:
      
  define markup sink function
     entity-resolver
  as
     using group "resolve entities"
     do markup-parse #current-input
        output "%c"
     done
  
  process
     do sgml-parse document scan #main-input
        using output as split-external-text-entities (entity-resolver,
                                                      relaxng.validator against relaxng.compile-schema file "my-schema.rng"
                                                              case-insensitive true
                                                              report-errors-to error-processor into #current-output
                                                      & markup-processor into #current-output)
                        & markup-processor into #current-output
           output #content
     done
  
  group "resolve entities"
  external-text-entity #implied
     output file "%eq"
        
      
 If you are using the function validated, the external text entity events will be ignored by the schema
        and reproduced in the returned markup source. If you apply do markup-parse to the result of
        validated, external-text-entity rules will fire as usual.
| Prerequisite Concepts | Related Topics | 
Copyright © Stilo International plc, 1988-2010.