Table of Contents
In this chapter, we'll fill in some background about pipelines and get you up to speed on the concepts before we dive into the actual features of XProc. We're going to assume that you're familiar with XML: you're comfortable the vocabulary of tags and attributes, parsing and validation, and that you've used an XML application or two: XSLT, for example.
At a high level of abstraction, a pipeline is what you get whenever the output of one process feeds into the input of another. If you work with XML, you've probably already used pipelines even if you didn't think about them in those terms. If you validate a document and then transform it, or if you apply XInclude and then validate, or if you use XSLT to run two differen transformations, those are all simple “pipeline” operations.
In another context, pipelines are a common feature of Unix command lines, for example:
1 cat ch01.xml | grep "<para" | wc -l
What that says is:
Run the cat command over ch01.xml. That will display the contents of the file.
The “|” symbol means that instead of displaying those lines in the shell window, that output will become the input to the grep command. What grep will do is display all the lines that contain “<para” and discard all the rest.
Finally, with another “|”, the output of grep will become the input to wc. The wc command counts the words in a file, but we gave it the -l option, so it will count the lines instead.
What this pipeline does is print the number of paragraphs in the document:
By using a pipeline we've taken three useful utilities and composed them together to create a pipeline that will count the approximate number of paragraphs we've used in all the XML documents in the current directory. I say approximately because those are all line-based commands and XML isn't really line based. Our XML editor might have saved our XML files with great big long lines so that there's more than one “<para>…</para>” per line which would throw off the counts.
Part of the power of Unix comes from the fact that it provides a large and useful vocabulary of focused commands and a mechnanism for easily composing them. By the same token, the XML technology stack provides a large and useful vocabulary of “commands”: parsing, validation, transformation, XInclude, XML Base, escaping and unescaping markup, document comparison, etc. What's historically been missing is a standard, easy to use composition mechnaism.
Which isn't meant to suggest that XProc invented the notion of XML pipelines. Far from it, in fact. For as long as there have been APIs for processing XML, there have been mechanisms for constructing pipelines. One of the explicit goals of the XProc development exercise was to invent as little as possible: the committee attempted to survey the landscape of existing pipeline technologies and standardize the common core. The key features that XProc introduces are standardization and ease of use.
Through standardization we hope to foster the development of interoperable pipelines. The pipelines that you run in your web framework should also run on my command line and the pipelines that run on my command line should also run in your XML editor.
The ease of use goal is met, we hope, by providing an XML vocabulary for describing pipelines. That's what XProc really is. We hope that an analogy with XSLT is apt. Before XSLT, it was obviously possible for programmers to transform XML documents. What XSLT provided was an XML-based model that opened transformation up to a much wider audience. Many people who are able to write XSLT transformations would not feel comfortable writing a transformation application in a standard programming language.
Before XProc, it was obviously possible for programmers to build XML pipelines. What XProc provides is an XML-based model that we hope will make it possible for a much wider audience to take advantage of the benefits of focused tools, loosely connected through pipelines. The fact that many projects developed their own vocabularies for exactly this purpose makes us optimistic that having a standard mechanism will be widely successful.
So what does an XML pipeline look like? Here's an XProc pipeline that counts the number of paragraphs in a document:
1 <p:pipeline xmlns:p="http://www.w3.org/ns/xproc" xmlns:db="http://docbook.org/ns/docbook" version='1.0'> 5 <p:identity> <p:input port="source" select="//db:para"/> </p:identity> <p:count/> 10 </p:pipeline>
What this pipeline does is produce a single XML document that contains the number of DocBook para elements in the document:
1 <c:result xmlns:c="http://www.w3.org/ns/xproc-step">??FIXME:??</c:result>
It's not completely isomorphic to the Unix command line example because the abstractions are just a little bit different. Where the command line tools work with lines, XProc uses documents, where the command line tools use regular expressions, XProc uses XPath.
We'll cover all the details in subsequent chapters, but here's a short summary of how this pipeline works:
There's nothing equivalent to the cat command because the pipeline expects the processor to pass it the document(s) to process.
Where grep used a regular expression to find lines that contain “<para”, the XProc pipeline uses an XPath expression to select precisely the DocBook paragraph elements. (The p:identity step copies its to its output, unchanged; by selecting all the paragraphs on input, we get every paragraph in a separate document on output.)
The p:count step just counts the documents that come to it, producing as output a single c:result element containing the number of documents it saw.
In fairness, that's not a very useful pipeline. It was chosen for its similarity to the preceding Unix example. Here's a more realistic XProc pipeline:
1 <p:pipeline xmlns:p="http://www.w3.org/ns/xproc" version='1.0'> <p:xinclude/> 5 <p:validate-with-relax-ng> <p:input port="schema"> <p:document href="docbook.rng"/> </p:input> </p:validate-with-relax-ng> 10 <p:xslt> <p:input port="stylesheet"> <p:document href="docbook.xsl"/> </p:input> 15 </p:xslt> </p:pipeline>
That pipeline performs XInclude, RELAX NG validation, then transformation. These pipelines are all quite simple. As we'll see, pipelines can be very sophisticated: they can interact with web services, make choices, iterate over sequences, recover from errors, and be extended in a variety of ways.
Although there are exceptions, many XML technologies were designed with a very specific focus: XML base is about base URIs, XInclude is about transclusion, XQuery and XSLT are about querying and transformation, XML Schema and RELAX NG are about validation, etc.
There are many common use cases where the order and interaction between these processes is obvious. Your favorite stylesheet processor, for example, may give you options for performing validation or XInclude processing, or both. That processor supports the most common scenario of those technologies: XInclude, followed by validation, followed by transformation. In fact, those options allow you to control a very limited pipeline within the product.
But equally, there are many use cases where the order and interaction between these process is not simply less obvious, it doesn't exist:
You might want to validate before performing XInclude processing (in order to determine if the xi:include elements themselves occurred in legal places). You might then want to perform validation again, probably with a different schema after XInclude to assess the validity of the composed document.
Along the same lines, you might want to perform validation both before and after a transformation.
If a document can't be processed by some part of a pipeline, you might want to recover in application-specific ways.
You might want to interact with a web service as part of a larger process, perhaps choosing to perform different kinds of processing depending on the result of that interaction.
In these and countless other scenarious, the order and sequence of steps has no single, predefined, correct order.
Decomposing a complex “business process” (to use the vernacular) into a sequence of simpler steps, operations on XML documents in our case, is a natural and powerful abstraction. What's needed to achieve this vision is some “glue language” that allows us to express the nature and order of operations that comprise our application: we need a pipeline language.
Dozens of vendor and application-specicific languages have been invented for this task.
There are many different ways to model pipelines. The examples above are completely linear pipelines where each step requires a very small amount of context. These pipelines are simple and fast, but can only support the kinds of processes that require no branching and very little context.
This model can be expanded and generalized to what we might call the “flowing water” pipeline model. These are conceptually just like the oil, water, and natural gas pipelines we are familiar with in the real world. In an oil pipeline, there are processing stations of various sorts (reservoirs, filters, valves, distillation plants, etc.) connected together by physical pipelines. Crude oil flows in at one end and moves from station to station through pipes. Each stations performs some process and directs the resulting output through one or more pipes to the next process. In an XML pipeline, the processing stations are XML processes and the pipelines are the pathways down which XML documents may “flow”.
Another model is based on events and state transitions. In this model there are no fixed pathways between processes. Rather, each document has an associated state and each step in the process applies some action and optionally moves the document to another state. Event driven pipelines may process a document many times, moving it through a complex network of states until some step concludes that the work is done. They are flexible and powerful, but can be tricky to understand.
XProc pipelines follow the “flowing water” model. This model was selected because it's sufficiently powerful to address many real use cases and also conceptually quite easy to explain and understand.
The discussion of a processing model raises the question of streaming. Streaming is difficult to define succinctly. For our purposes it's sufficient to say that a streaming process differs from a non-streaming one if it can process an arbitrarily large document, or arbitrarily many documens, without running out of memory.
Some steps are more naturally streamable than others. A step that deletes all elements that have a revisionflag attribute with the value “deleted”, is conceptually easy to stream. A step that adds an attribute to the document element if the document contains an odd number of elements with no attributes appears impossible to stream.
XProc tries very hard to be neutral about streaming. There's no requirement that an implementation be able to stream, but the language attempts to avoid imposing semantics that would prevent it from streaming.
Generally speaking, an XProc pipeline document has a document element of p:pipeline (or possibly p:declare-step) and contains one or more steps. The steps are connected, either implicitly or explicitly, and documents flow between steps along those connections. The name of each step identifies its type which determines what kind of processing it performs.
[[NICE GRAPHICAL REPRESENTATION OF A STEP HERE]]
In addition to the connections between them, steps can have options and parameters, some steps can be nested, variables may be declared, etc.
We'll consider the anatomy of pipelines in more detail in the following chapters.
In order to run an XProc pipeline, you need an XProc processor. At the time of this writing, there are two complete implementations: XML Calabash, written by the author, and Calumet, written by Vojtech Toman of EMC. By the time you read this, there may be more, see http://xproc.org/implementations/ for an up-to-date list.
All pipelines that conform to XProc: An XML Pipeline Language should produce the same results if run by any conformant processor. Not surprisingly, we're going to use XML Calabash for most of our examples in this book.
All pipelines that conform to XProc: An XML Pipeline Language should produce the same results if run by any conformant processor—if the processor supports all the steps used.
Steps come in three flavors: required steps, optional steps, and extension steps. The required and optional steps are described in the XProc specification. As you could probably guess, all processors are required to support the required steps but may or may not choose to support (all of) the optional steps. Any step not described in the specification is an extension step.
Extension steps come in a few flavors as well. Every pipeline has the potential to be used as a step in another pipeline. Extension steps implemented as XProc pipelines are, naturally, every bit as interoperable as the steps they contain.
Implementations may also support the creation of extension steps directly using whatever APIs their underlying programming language supports. In Appendix A, Introduction to XML Calabash, we'll examine how you can write your own XML Calabash extensions in Java. Such steps are only going to work in the implementation for which they were written.
There's one last wrinkle in this story. The http://exproc.org/ website is attempting to collect a set of community-supported extension steps. With luck, the vocabulary of extensions defined at EXProc.org will also be portable across different implementations.