By default, the data content of an XML or SGML document is streamed through to the current output scope by the parser. You can intercept and process data content in one of four ways:
data-content
rules,
translate
rules,
"%c"
in a markup rule, or
#content
in a markup rule.
If you add a data-content
rule to your program, it will be fired whenever a continuous piece of
text data occurs in your input data. You can then process that text by scanning "%c"
:
data-content repeat scan "%c" ... again
You can restrict a data-content rule to a particular element by adding a condition to the rule:
data-content when element is "product-name" repeat scan "%c" ... again
A data-content
rule processes a contiguous sequence of text characters. A contiguous sequence of
text characters is bounded by:
If you put translate rules into your program they will scan data-content (and attribute content) automatically,
without the need for you to explicitly initiate scanning. In effect, translate rules work like find
rules, except that they are initiated by do xml-parse
or do sgml-parse
instead of
submit
.
translate "$" digit+ => dollars ("." digit{2} => cents)? output dollars output "," || cents when cents is specified output "$"
When processing SGML, you can also use translate rules to capture and process entities.
You can also process data-content by scanning %c
in an element rule or another markup rule.
However, you should be aware that such a scanning process will scan the result of all the parsing operations
that take place on the content of an element, including the processing of any element, translate, or
data-content rules, not on the raw data content of the element.
You should scan "%c" only if you know that the current element contains only data content or you want to scan the result of parsing the current element. Bear in mind that even if the element has only data content, any applicable translate rules and data content rules will fire before the scanning operation takes place, and the scanning source will be the output of those rules acting on the data content, not the raw data content of the element.
While %c
enables another pass through already-processed content, #content
provides access to
unprocessed content. The downside of using #content
for processing data content is that, besides the plain
textual content, it may contain unprocessed markup events: the type of #content
is markup source
,
not string source
.
Before you can scan
or submit
#content
, you first need to convert it to a string source
using the operator take
, as in the following data-content
rule:
data-content repeat scan #content take any* ... again
You should be aware, however, that the take
operator will throw any markup event it encounters in
#content
, such as an SDATA entity reference. You should always wrap applications of take
and
drop
operators to a markup source
in a scope that catches all markup event throws. The following
rule will process all data content inside an element, while ignoring all element tags and other markup:
element #implied repeat submit #content take any* exit catch #markup-point catch #markup-start catch #markup-end again