The tutorial below is also a valid CDuce program. You can save it as "tutorial.cd" in a fresh directory and learn...
(* -*- tuareg -*-
CDuce tutorial for the OCaml programmer
=======================================
CDuce is a programming language dedicated to the manipulation of XML
documents. The official documentation is at
http://www.cduce.org/documentation.html
------------------
This whole file constitutes a valid CDuce program.
-*- tuareg -*- on the first line tells emacs to load
the tuareg-mode which is normally used for editing OCaml code,
but works pretty well with CDuce too.
Run this program from a fresh directory, by executing:
cduce tutorial.cd
It should not display any error message.
It is recommended to practice both with the interactive mode of cduce,
and by modifying and compiling the code from emacs with tuareg-mode
or caml-mode (other text editors are probably fine too).
To start cduce in interactive mode, just type "cduce" on the command line.
Most tips for using the ocaml toplevel apply here too
(see http://wiki.cocan.org/tips_for_using_the_ocaml_toplevel).
Prerequisites for this tutorial:
- you should be reasonably familiar with XML,
- you should be reasonably familiar with OCaml,
- you should realize that CDuce is pretty different from OCaml,
although it shares some syntaxic similarities,
- you should have a basic idea of what regular expressions are,
and their usual notations (star, plus, question mark, vertical bar)
*)
(*
Note about comments: C-style comments using /* and */ should be used
for text that contains unmatched quotes, while OCaml-style comments
using (* *) are preferred for commenting out pieces of code.
*)
/*
* Let's create a simple but realistic example
* that we will use throughout this tutorial.
*/
type a = <a>[ b* ]
type b = <b ..>[ (<c>String)* | Char* ]
let doc : a = <a> [ <b> []
<b name="b1"> [ <c> "c text 1"
<c> "c text 2" ]
<b name="b2"> "Pure Text" ]
/* doc represents the following XML code:
<a>
<b/>
<b name="b1">
<c>c text 1</c>
<c>c text 2</c>
</b>
<b name="b2">Pure Text</b>
</a>
*/
/* You can input and output XML data using some predefined functions.
Here is a small list that should be enough for us now:
Output:
print_xml: converts any data to a string (type String)
print: prints a string to stdout
dump_xml (CDuce versions >= 0.4.1):
takes any data and prints it directly to stdout
dump_to_file: takes a file name (first argument), a string (second arg.),
and writes the string to the file.
Input:
load_xml: take a file name or a URI, and load it as XML.
For a full list of primitives, see: http://www.cduce.org/memento.html
Let's get started: the following code defines a function "test_io" which
writes some XML data to a file, and reads it back from the file. Not very
useful, but instructive.
*/
let test_io (file : Latin1) (data : Any) : Bool =
let _ = dump_to_file file (print_xml data) in
let data2 = load_xml file in
if data = data2 then `true
else `false
let _ =
match test_io "doc.xml" doc with
`true -> [] (* "nil" *)
| `false -> raise "test_io didn't work as expected"
/* A few notes about the constructs that we saw above:
- match-with is similar to OCaml, and if-then-else is just a specialization
of a match-with to booleans;
- _ has the same meaning as in OCaml;
- there is no exception type: "raise" accepts data of any type as argument;
- [] is here used like () in OCaml (unit type). It is actually pretty much
like the empty list or nil. The equivalent of lists is called sequences,
however their type can define what kind of elements they contain,
in which order and how many times they can occur.
*/
/* We already did some pretty advanced stuff:
- we defined the structure of an XML document (type a);
- we defined an XML document (doc) of type a;
- we exported and imported it back from a file;
- we saw how to apply functions and one syntax for defining a function.
Now we will see how to manipulate effectively XML data, i.e. transform
an XML tree into some other data, which typically would be ready to
be exported to OCaml.
*/
/* Let's explore several syntaxic constructs that will allow us
do some common tasks */
/* Task 1: extract all the b nodes from doc.
The slash operator (/) expects:
- on the left: a sequence of XML nodes (an expression);
- on the right: a pattern for matching all subnodes.
It important to note that the lefthand expression is a sequence, not just
a node. This why we have to put square brackets around doc. So [doc] is
a sequence of one element.
*/
let bnodes = [doc] / <b ..> _
/* The previous example was not extremely useful because it returns
all the childrens of the single node <a>. That could have been
achieved directly using simple pattern matching:
*/
let achildren =
match doc with
<a> children -> children
/* Before continuing, let's have a closer look at the pattern matching above.
"children", on the left side of the arrow binds the variable "children"
to the sequence which constitutes the contents of the <a> node.
The pattern matching is complete because the type of doc is t.
It is however possible to cast "doc" to a more general type
(a supertype of t).
For example, the predefined type "Any" represents any possible CDuce value,
XML or not: a value of any type can be cast to type Any.
Let's do it: we create doc2, which is the same document as doc, just
with the general type Any:
*/
let doc2 = doc : Any
/* But now, if you try to define the "achildren" example using doc2 instead
of doc, cduce will complain. The peculiarity of this type system, as opposed
to the type system of OCaml, is that there are no polymorphism that
uses type parameters (e.g. 'a) as in OCaml. For example, you can not
define a polymorphic identity function in CDuce: it would always return
something of type Any.
In OCaml, the identity function can be defined as follows:
let identity x = x
Its signature is:
val identity : 'a -> 'a
So in OCaml, (identity 123) has type int like 123.
In CDuce, a generic identity function would always return an object of type
Any. Let's define it:
*/
let identity (x : Any) : Any = x
/* If you try it in the cduce toplevel, you get this:
# let identity (x : Any) : Any = x;;
> val identity : Any -> Any = <fun>
# identity 123;;
- : Any = 123
And if you try to use it as an Int, you get one of those common type errors:
That works all right (by the way, note the funny type "124"
which is a subtype of Int):
# 123 + 1;;
- : 124 = 124
That's the problem we are talking about:
# identity 123 + 1;;
Characters 0-12:
This expression should have type:
{ .. } | Int
but its inferred type is:
Any
which is not a subtype, as shown by the sample:
Atom
These error messages can be confusing, but it often means that a more
specific type was expected. It may mean that you forgot a downcast
(see below) or that your data doesn't fit one of your type definitions.
*/
/*
It is possible to view the same object with another type:
- a more general type (supertype) is always allowed;
- a more specific type (subtype) is allowed, if it matches the structure
of the object.
The former is something which is possible in OCaml.
The latter is a downcast and it is not possible in OCaml,
since it requires to store some type information at runtime. In CDuce,
some typing happens at runtime (dynamically), so downcasts
are possible, and naturally may cause runtime errors.
1. You can change the type of an object to a supertype (upcast) using ":".
This is done statically, so you will get a message from the compiler
if the given type does not include the current type of the object.
2. You can change the type of an object to any compatible type
(downcast or upcast) using ":?".
This is done at runtime and raises an exception if the requested type
is not compatible with the structure of the object.
The usefulness of static type conversions is limited, just like in OCaml,
since there is little need to purposefully set the type of an object
to a more general type: it is done automatically when the object
is passed as an argument to function which expects a more general type.
Downcasts are not possible directly in OCaml, and are generally
considered bad practice anyway. Here, we will use them to check and assign
a type to an XML document, which usually comes from some data loaded
at runtime. Typically, we would load our "doc.xml" file as follows:
*/
let doc_reloaded = load_xml "doc.xml" :? a
/* The command above may fail if the file "doc.xml" does not contain
an XML document that conforms to type a.
It is now clear that we use the dynamic cast operator ":?" as a way
of matching the structure of a document against some predefined pattern,
i.e. a type.
Once an XML document has been validated, it can be passed as an argument
to functions that work exclusively on that type.
*/
/* Let's go back to our sheep, as we say in French.
We wanted to extract some nodes from our data.
We saw that we can take a sequence of nodes, select
and regroup all the children that match some pattern, using the
slash operator:
let bnodes = [doc] / <b ..> _
We were saying that this thing above was a bit complicated for just
extracting the children of <a>. Let's jump to task 2.
*/
/* Task 2: Extract only the <b> nodes that have a "name" attribute.
Very easy, we just have to make the pattern (righthand side of the slash)
a little more specific:
*/
let named_bnodes = [doc] / <b name=_ ..> _
/* Using the same technique twice, we can extract the grandchildren of <a>: */
let cnodes = [doc] / <b ..> _ / <c> _
/* Note that the code above only selects the <c> nodes without attributes,
because we omitted the ".." wildcard.
It's okay because this is what we want, but using .. may be a good habit
in general.
It is nice to be able to go down the hierarchy using a sequence of
node patterns separated by slashes, like for a filesystem.
This explains why the expression (on the left) must be
a sequence of nodes rather than just a node.
*/
/* Task 3: Extract the strings that are enclosed within <c> tags, as a sequence
of strings (rather than a sequence of <c> nodes) */
/* From the previous example, we know how to extract the <c> nodes,
and they are already stored in the cnodes variable.
We are going to convert the sequence of <c> nodes into a sequence
of the same length containing what we want. For this, we use
the map-with construct. It is analog to List.map in OCaml, but unlike
List.map it is not a function.
*/
let ccontents = map cnodes with <c> x -> x
/* Not that what follows the mandatory "with" keyword is a pattern matching,
not a function. But we can create our own mapf function which
would take a function as its first argument, and map the list passed
as second argument:
*/
let fmap (f : Any -> Any) (seq : [Any*]) : [Any*] =
map seq with x -> f x
/* As opposed to OCaml's List.map and other polymorphic functions,
the result of fmap would always be of type [Any*] which is the most
general type of sequence.
So if you want to use such a function, the result would have
to be downcasted using ":?", which involves a runtime check of
your data. So you should probably not use that technique.
However a workaround is presented there: http://www.cduce.org/tips.html
*/
/* Task 4: Write a function that selects <b> nodes that have a "name"
attribute of a certain value. This value should be passed
as a parameter to the function.
Here is the solution:
*/
let select_bnode (name : String) (seq : [b*]) : [b*] =
transform seq with
x & <b name=y ..> _ -> if y = name then [x] else []
let b1_nodes = select_bnode "b1" bnodes
let b2_nodes = select_bnode "b2" bnodes
/* This solution introduces two main novelties:
- the transform-with syntaxic construct,
- the "&" operator in patterns.
First, let's see what transform-with does. Like map-with, it is a language
construct, not a function. Like map-with, it scans the elements of
a sequence and returns another sequence.
Its role is to allow mapping and filtering of data at the same time.
Each item of the list is pattern-matched and must be
converted into a sequence of zero, one or maybe more elements.
With map-with it would result in a sequence of sequences, but here
the result is flattened, i.e. all the sequences are joined together.
In OCaml, there is no such builtin functionality,
but an equivalent polymorphic function could be written as follows:
# let rec transform f l = List.flatten (List.map f l) ;;
val transform : ('a -> 'b list) -> 'a list -> 'b list = <fun>
In the transform-with construct, pattern-matching always succeeds, since
an invisible catch-all case is added and is equivalent to returning
the empty sequence []. In other words, all elements that don't match
are discarded.
Now let's look at the pattern. It uses "&", placed between two patterns.
The first pattern "x" matches everything and is just used to bind
a variable (x) to the whole element. The second pattern "<b name=y ..> _"
selects <b> elements that have a "name" attribute.
So the "&" here is used like the "as" keyword in OCaml's pattern matching.
It is however more general since it allows to force a single object
to match two different patterns.
Please note that CDuce also has a "::" operator, whose role is to name
subsequences; it only appears from within the square brackets of sequence
patterns, e.g.:
# match [1 2 3 4] with [ _ x :: (_ _) _ ] -> x;;
- : [ 2 3 ] = [ 2 3 ]
*/
/* Task 5: Understanding types */
/* CDuce provides a broad set of types, which are reminiscent of OCaml types.
In addition to those, XML types exist and can be used to represent
some XML data. However there are several interesting considerations to take
into account
*/
/* 1) so-called XML types can represent more than just XML documents. In XML,
data are always string-based. Here, other types can be used, such
as Ints or records. When converting an object of an XML type,
an error would occur if it cannot be converted to real XML:
for instance Ints are translated to their string representation, but
other types like records cause an error.
The following object is an XML type a tag <a> that contains a record,
and it can be manipulated within CDuce:
*/
let xml_with_record = <a> { x = 1; y = 2 }
/* but it cannot be converted to a traditional XML file because
records don't exist in real XML. So if you try,
print_xml (<a> { x = 1; y = 2 })
would fail.
*/
/* 2) type and variable names can be capitalized or not,
but they are case-sensitive, just like XML attribute labels.
In addition, type names can be used in pattern-matching,
just like capture variables.
For example, the meaning of
match 123 with t -> 456
depends on the context:
- If a type t was defined, it means that the structure
of x should be checked against the pattern defined by type t.
- If there is no such type as t, then t is considered as a variable,
which here would be an equivalent for x.
Test 1: t as a variable (the warning is expected):
# match 123 with t -> 456;;
Characters 15-23:
Warning: The capture variable t is declared in the pattern but not used in the body of this branch. It might be a misspelled or undeclared type or name (if it isn't, use _ instead).
- : 456 = 456
Test 2: t as a type
# type t = Int;;
# match 123 with t -> 456;;
- : 456 = 456
Test 2 works because 123 actually belongs to type t or Int.
Using an incompatible type such as String results in an error:
# match 123 with String -> 456;;
Characters 6-9:
This expression should have type:
String
but its inferred type is:
123
which is not a subtype, as shown by the sample:
123
*/
/* Task 6: Making CDuce functions and data available to an OCaml program.
Here is what you need:
- a CDuce program (a.cd)
- a compatible OCaml interface file for the CDuce program (a.mli)
A CDuce file will constitute an OCaml module. Essentially, cduce will
compile it into an OCaml implementation file, which use the CDuce runtime
library.
Sequence of commands to produce the OCaml implementation:
ocamlfind ocamlc -c a.mli -package cduce
cduce --compile a.cd
cduce --mlstub a.cdo > a.ml
Then a.ml is compiled normally with either ocamlc or ocamlopt,
using the CDuce library:
ocamlfind ocamlopt -c a.ml -package cduce
In a Makefile, in addition to the rules you already use to compile
all your .mli and .ml files, you can add those two:
a.cdo: a.cd a.cmi
cduce --compile a.cd
a.ml: a.cdo
cduce --mlstub a.cdo > a.ml
The correspondence between OCaml and CDuce types is described
in the official documentation at
http://www.cduce.org/manual_interfacewithocaml.html#transl
We will just give a few remarks and a simple example.
About translating OCaml types to CDuce types:
- Not all CDuce types can be converted to OCaml types.
- Some CDuce types can be converted into different kinds of OCaml types,
depending on how you define the OCaml interface.
- Some types remain abstract in OCaml. The most common example is the Char
type which forms the String type (String is an alias for [Char*]).
If you want to use OCaml's string type, you have two options:
1. If your string only uses Unicode codes 0 to 255, then you can convert
it from String to Latin1, e.g. yourstring :? Latin1
2. If your string may contain Unicode characters above 255, then you
may want to export them as-is. The OCaml type you get is
Cduce_lib.Encodings.Utf8.t, and it can be converted to a regular
OCaml string (UTF8 encoded) with the
Cduce_lib.Encodings.Utf8.to_string function.
It is recommended that you browse through the available functions of
the CDuce library using a tool like ocamlbrowser.
Example: some OCaml types, followed their CDuce counterparts
type opt = string option
type opt8 = Cduce_lib.Encodings.Utf8.t option
type stringlist = string list
type variant = A | B of variant
type variantpoly = `A | `B of variant
type f = bool -> unit
type flab = lab:bool -> unit
type point = { x : float;
y : float }
*/
type opt = [ Latin1? ]
type opt8 = [ String? ]
type stringlist = [ Latin1* ]
type variant = `A | (`B, variant)
type variantpoly = `A | (`B, variant)
type f = Bool -> []
type flab = Bool -> []
type point = { x = Float;
y = Float }
/* Complete information about interfacing CDuce and OCaml is given at
http://www.cduce.org/manual_interfacewithocaml.html
*/