Detecting the end of data - OmniMark Concept

Detecting the end of data

Prerequisite Concepts

When you write pattern-matching programs that operate on files, there is one important pattern you don't have to worry much about: the end of the input. OmniMark knows when it has reached the end of the file, and any find rule that is partially matched fails.

But if you write programs that perform pattern matching on streams which are not files, particulary network data streams, you may have to take responsibility for detecting the end of the data. To do that, you will need to understand how data in the stream you are processing is terminated, and make sure you don't accidentally miss the terminator.

For instance, suppose you are receiving a series of HTML files over an open TCP/IP connection. You might think that it would be enough to create a find rule to match "</HTML>" as the end-of-data marker. But consider what happens if the following program fragment encounters an HTML file in which the author has neglected to close a block quote ("<blockquote>") with the appropriate close tag ("</blockquote>"):

  find "</html>"
     ;found the end of the data
  find "<blockquote>" ((lookahead not "</blockquote>") any)+
     ;found a blockquote

Because the "</blockquote>" tag is missing, the second find rule keeps matching characters, including "</HTML>". If you were reading from a file, that wouldn't matter, because the rule would fail when the end of the file was reached and the first rule would have a chance to fire. But here there is no end of file. There may or may not be more characters received over the network connection, but there is nothing to say "stop" to the pattern. It may hang, or it may incorrectly consume hundreds of complete pages before it finds a terminating "</blockquote>".

And there is a second, more subtle, way in which this program could fail. The lookahead command needs to look ahead 13 characters to match "</blockquote>". If there are not 13 characters available (because, perhaps, "</html>, 7 characters, comes immediately after the opening "<blockquote>"), and no end of data condition has been signalled, lookahead will hang waiting for more characters.

Server-side programs, and all programs that deal with network data streams, may get their input from a TCP or other network connection that remains open across many transactions. The server-side program may need to detect the end of data in some other manner -- often by examining the data itself for some application-specific characteristic.

The first step to dealing with this problem is to carefully select and understand the protocol used to package transmitted data. In many cases this may involve negotiation with the person sending the data, but that's what protocols are -- agreements about how data is encoded.

There are five basic ways in which end-of-data can be recognized on a connection, be it a TCP connection or other source of input:

Data ends when a connection is closed by the sending process. This is the case typical in batch programs.
Data ends when a timeout occurs. Timeout typically applies to the interval between incremental "get" operations -- while waiting for the "next" bit, rather than being a property of the whole transaction.
Data ends when a predetermined number of bytes/octets or other unit of encoding has been received. The number can be prefixed to the data, transmitted in some prior data (such as "content-length" in an HTTP header), or be a fixed size. The only necessity is that it be agreed upon prior to actual transmission of the data. This method can be called length encoding.
Data ends on recognition of a specific byte/octet or other unit of encoding, or on recognition of a specific sequence of data. Examples in common use are line ends, double line ends (at the end of an HTTP header), and the "Control Z" file end mark of CP/M and older releases of MS-DOS. The only requirement is that the character or sequence used as the end mark mustn't occur in data otherwise -- at least without some sort of quoting or escaping by which it can be distinguished from a real end mark. This method can be called end-mark encoding. End-mark encoding is more common in text-based applications than in binary data ones, just because it is more likely that there is an available non-text character that can be used (such as line-end or Control Z) than it is that there is a non-data binary octet.
Data is lumped into "packets", each of which has its end indicated in a length-encoded or end-mark-encoded manner. There is a unique end-of-data packet that marks the end of a sequence of data packets. This is not really a separate technique, but rather a way of employing one or more of the above techniques. Packet-encoding is usually thought of in conjunction with length encoding, but it is common in text-based, end-mark encoding applications too. HTTP headers actually consist of zero or more "packets" (lines), with a zero-text line serving as a data-end packet.

The techniques can be combined -- and in the case of using packets, must be.

All five techniques have the important property that the end of data can be determined without looking at any data following it. This is the key property required to prevent hanging. For example, most connections can be closed, even those that are not closed in the normal processing of data -- so a closed connection should be recognized as a data end, even when other indications are used for normal data ends.

Basing the end of data solely on a timeout is not a good technique in general because of the variable latency inherent in systems. Timeout values need to be set quite high because normal, if rare, occurrences may cause delays that are long as compared to typical processing times. Because they are long compared to typical processing times, your depending on exceeding such delays to signal end of data will tend to cause unnecessary waiting and significant delays in normal processing. Even though timing out is not usually a good end-of-data indication, like connection closing, it is something that can usefully be combined with other protocols, so that a program remains reliable in the presence of excessive delays.

Both end-mark-encoded and length-encoded data can easily be recognized with OmniMark patterns. However, it is important to realize that unless you are willing to examine the input one byte at a time, there is no completely foolproof way to do data processing and end-of-data recognition in the same process. For data blocks of a reasonable size, the appropriate technique is to use a single rule to collect the data and then submit the collected data for processing. If the data blocks are large, you may need to adapt a layered approach in which one process is responsible for end-of-data recognition, and a second process is responsible for data processing. Returning to the problem we discussed above, we can provide a solution as follows: First, we add Control-Z characters between the HTML pages being sent. (It is prefereable that the end of data character be part of the protocol, rather than being assumed to exist in the data. The data may always be corrupt.) Next we adapt the program as follows:

  group #implied
  process-start
     next group is protocol-processing

  group protocol-processing
  find [any except "%26#"]* => the-data "%26#"
     using group data-processing
        submit the-data

  group data-processing
  find "<blockquote>" ((lookahead not "</blockquote>") any)+
     ;found a blockquote

Length-encoded data can be picked up with patterns such as the following:

  any {2} => length-value any {binary length-value} => data-value

The data captured by data-value can then be submitted or scanned for further pattern matching.

Prerequisite Concepts
Input/Output

----

[CONTENTS] [CONCEPTS] [SYNTAX] [LIBRARIES] [SAMPLES] [ERRORS] [INDEX]