|
|||||
Detecting the end of data | |||||
Prerequisite Concepts |
When you write pattern-matching programs that operate on files, there is one important pattern you don't have to worry much about: the end of the input. OmniMark knows when it has reached the end of the file, and any find rule that is partially matched fails.
But if you write programs that perform pattern matching on streams which are not files, particulary network data streams, you may have to take responsibility for detecting the end of the data. To do that, you will need to understand how data in the stream you are processing is terminated, and make sure you don't accidentally miss the terminator.
For instance, suppose you are receiving a series of HTML files over an open TCP/IP connection. You might think that it would be enough to create a find rule to match "</HTML>" as the end-of-data marker. But consider what happens if the following program fragment encounters an HTML file in which the author has neglected to close a block quote ("<blockquote>") with the appropriate close tag ("</blockquote>"):
find "</html>" ;found the end of the data find "<blockquote>" ((lookahead not "</blockquote>") any)+ ;found a blockquoteBecause the "</blockquote>" tag is missing, the second find rule keeps matching characters, including "</HTML>". If you were reading from a file, that wouldn't matter, because the rule would fail when the end of the file was reached and the first rule would have a chance to fire. But here there is no end of file. There may or may not be more characters received over the network connection, but there is nothing to say "stop" to the pattern. It may hang, or it may incorrectly consume hundreds of complete pages before it finds a terminating "</blockquote>".
And there is a second, more subtle, way in which this program could fail. The lookahead command needs to look ahead 13 characters to match "</blockquote>". If there are not 13 characters available (because, perhaps, "</html>, 7 characters, comes immediately after the opening "<blockquote>"), and no end of data condition has been signalled, lookahead
will hang waiting for more characters.
Server-side programs, and all programs that deal with network data streams, may get their input from a TCP or other network connection that remains open across many transactions. The server-side program may need to detect the end of data in some other manner -- often by examining the data itself for some application-specific characteristic.
The first step to dealing with this problem is to carefully select and understand the protocol used to package transmitted data. In many cases this may involve negotiation with the person sending the data, but that's what protocols are -- agreements about how data is encoded.
There are five basic ways in which end-of-data can be recognized on a connection, be it a TCP connection or other source of input:
The techniques can be combined -- and in the case of using packets, must be.
All five techniques have the important property that the end of data can be determined without looking at any data following it. This is the key property required to prevent hanging. For example, most connections can be closed, even those that are not closed in the normal processing of data -- so a closed connection should be recognized as a data end, even when other indications are used for normal data ends.
Basing the end of data solely on a timeout is not a good technique in general because of the variable latency inherent in systems. Timeout values need to be set quite high because normal, if rare, occurrences may cause delays that are long as compared to typical processing times. Because they are long compared to typical processing times, your depending on exceeding such delays to signal end of data will tend to cause unnecessary waiting and significant delays in normal processing. Even though timing out is not usually a good end-of-data indication, like connection closing, it is something that can usefully be combined with other protocols, so that a program remains reliable in the presence of excessive delays.
Both end-mark-encoded and length-encoded data can easily be recognized with OmniMark patterns. However, it is important to realize that unless you are willing to examine the input one byte at a time, there is no completely foolproof way to do data processing and end-of-data recognition in the same process. For data blocks of a reasonable size, the appropriate technique is to use a single rule to collect the data and then submit
the collected data for processing. If the data blocks are large, you may need to adapt a layered approach in which one process is responsible for end-of-data recognition, and a second process is responsible for data processing.
Returning to the problem we discussed above, we can provide a solution as follows: First, we add Control-Z characters between the HTML pages being sent. (It is prefereable that the end of data character be part of the protocol, rather than being assumed to exist in the data. The data may always be corrupt.) Next we adapt the program as follows:
group #implied process-start next group is protocol-processing group protocol-processing find [any except "%26#"]* => the-data "%26#" using group data-processing submit the-data group data-processing find "<blockquote>" ((lookahead not "</blockquote>") any)+ ;found a blockquote
Length-encoded data can be picked up with patterns such as the following:
any {2} => length-value any {binary length-value} => data-value
The data captured by data-value can then be submitted or scanned for further pattern matching.
Prerequisite Concepts Input/Output |
---- |