Release 3.0 on December 29, 2012

The 2.0 version was not handling mixed content correctly. That is
now fixed.

README:

This tool makes it easier for you to extract information out of
XML Schemas.

The tool converts a schema into a single file and puts the schema 
into a Russian Doll form. Plus it adds a few enhancements such as:

- If an element can be substituted, an attribute is added to
  the element declaration that idref's to each substitutable element.

- A targetNamespace attribute is added to each element and
  attribute declaration to show what namespace they are part of.

With the schema in this enhanced form, I have found that extracting
information out of schemas is greatly facilitated.


Here are a couple examples of queries that I've needed to perform
on schemas in the past:

- What are all the elements and attributes that are declared
  to be of type xs:QName (or xs:string, or xs:gYear, etc.)?

- For simpleType A, what are its applicable facets? (Take
  into account the facets in all its ancestor simpleTypes)

- How many element declarations are in the schema? How many
  complexType definitions? simpleTypes? attributes?

- How many lines of schema code are there?

With my tool it is easy to get answers to those questions.


Without this tool, it can be difficult to get the info you desire
from XML Schemas. Here are a few reasons for the difficulty:

1. The schema may be scattered over multiple files. So you have 
   to search through multiple files to find the info you want.

2. A simpleType may be part of a long chain of restrictions. And the
   simpleTypes may be scattered over multiple files. That
   makes it difficult to know exactly what is the net value space 
   for the simpleType.

3. Likewise a complexType may be part of a long chain of derive-by-
   extensions and derive-by-restrictions. And the complexTypes
   may be scattered over multiple files. That makes it difficult
   to know exactly what is the final set of elements and attributes
   in a complexType.

4. An element may be substituted. So, many different elements may
   be possible at a certain point in a schema.

5. Consider an element declaration with a type attribute. The type
   definition could be located in many places: in the document that
   the element declaration is located in, in a document that it
   includes or imports, or one that they include or import. It 
   could be in the document that included the document that
   contains the element declaration. And many more places. Ouch!

6. The elements and attributes in a no-namespace schema are
   part of one namespace when they are included by a schema with
   targetNamespace A and another namespace when they are included 
   by a schema with targetNamespace B. 
   


Let's take a quick look at what my tool does. In the "schemas"
folder (in the same folder as this README.txt) are two schema
files, A.xsd and B.xsd. Open them. 

   - A.xsd declares a Book element and includes B.xsd 

   - B.xsd declares an Author element (which is ref'ed in the Book 
     element). 

My tool reads in those schemas, replaces all its ref's with inlined 
values, adds identifiers, adds namespaces, and outputs a single file,
results.xml. Open results.xml to see what my tool has done. 
Hopefully you agree that file has the schemas in a form that 
can be readily queried.

To run my tool you need to first download Saxon Home Edition (Saxon HE),
which is avialable here:

    http://saxon.sourceforge.net/#F9.4HE


Then, place saxon9he.jar into this folder:

    programs/saxon

After you've done that, you can simply double-click on run.bat 
in the schemas folder.


Now for some useful details about the tool...

It is implemented using XSLT 2.0

My tool consists of a number of XSLT programs. Each program performs
cumulative processing on the schema. That is, the output of one
program is the input to the next program, and each program makes
a tiny change to the schema. After many such tiny changes I end up 
with a schema that is in a form that is quite amenable to querying.

Before I describe each program, let me give you the big picture.

a. A schema may consist of multiple schema documents. I collect
   them all into one document. That makes querying much easier.

b. Schemas do a lot of ref'ing. For example, an element declaration
   may have a type attribute whose value is the name of a simpleType.
   That type attribute is a reference. To make is easy to fetch the 
   items being referenced, I use id-idref pairs. I add an id to the 
   simpleType (for example) and add an idref to the element 
   declaration. With that id-idref pair in place, I can use the 
   XSLT key() function to easily fetch the simpleType, starting from
   the element declaration.
   
   In other words, I take good advantage of id-idref pairs.

c. After adding id-idref pairs I go through the schemas and fetch
   each item being ref'ed and inline it. For example, an element
   declaration has a type attribute whose value is the name of a
   simpleType. Now that the simpleType has an id attribute on it,
   and the element declaratio has an idref attribute on it, I can
   use the XSLT key() function to immediately fetch the simpleType.
   Then I inline it in the element declaration (and toss out its
   type attribute, of course). Do you recognize that what I am 
   doing is converting the schema into Russian Doll form? That form 
   is much easier to query than the Salami Slice form.

d. A simpleType may be restrict another simpleType that restricts
   another simpleType and so on. I take that simpleType chain and
   reduce it down to a single simpleType.

e. Likewise a complexType may be part of a long complexType chain.
   I take that complexType chain and reduce it down to a single
   complexType.


Okay, now for a description of each program:

1. My first program reads in all the schema files and
puts them all into a single file. More specifically, you give this
program the name of your "main" schema file and the program will
input it plus all the schema files it includes, imports, and
redefines, and all the schema files they include, import, and
redefine, and so on. If the schema files loop (e.g. A includes B 
which includes C which includes A) that is no problem. This program 
handles that. This program adds xml:base to each xs:schema. The 
value of xml:base is the full path to the schema file. Additionally,
the program adds an id attribute to each xs:schema to uniquely 
identify each schema. 

This program is located in this folder:

    programs/input-schemas

In that folder is a sub-folder, schemas. It contains some schema
files for testing the program.

This program may be all you need. With this program you can already
perform lots of useful queries, such as "How many element declarations
are there?" or "How many schema files are there?" You may
not need the following programs, which takes the output of this
first program and cumulatively performs further processing.


2. The next program is actually a collection of small programs. They
are located in this folder:

    programs/transform-input-schemas

Remember that program #1 outputs a single file containing all the
schema files. I will refer to this as the "single file."

Here are the small programs that make up program 2:

(a) add-xsd-datatypes-schema: I created a schema, schema-data-types.xsd,
that contains a simpleType for each XSD datatype. This program
inputs schema-data-types.xsd and appends it to the single file.

(b) add-idref-on-includes-imports-redefines: Suppose a schema has an
include element that references schema A. Recall from (a) that each 
schema has an id attribute. This program adds an attribute on the
include element, idref, with a value that matches the id attribute.
Ditto for all import elements and redefine elements.

Thus, there is now an id-idref pair, which makes for easy processing.

(c) add-import-xsd-datatypes-schema: recall that I added to the
single file a schema that contains a simpleType for each XSD 
datatype. This program inserts into all the other schemas an 
import element that imports the datatypes schema.

(d) add-id-to-top-level-items: this program adds an id attribute
onto each global element, attribute, simpleType, complexType,
group, and attributeGroup. What's the reason for doing this?
Well, only global stuff can be referenced, so this program
prepares them to be part of an id-idref pair. In program (3) 
I insert the matching idref attributes.

(e) do-input-transforms: this program simply invokes each of the
above programs, a - d, and outputs the resulting single file.


3. This program is a real workhorse. For every reference in a
schema it adds an idref. Thus, between program 2 (d) and this program
I have created an id-idref pair. This makes it easy to jump
from an item with a reference to the referenced item.

The program is located in this folder:

    programs/add-idref-substitutionGroupRef-and-targetNamespace

Suppose an element declaration has a type attribute with value A. 
Suppose A is a global complexType. It has an id attribute, say 2DE. 
This program adds an idref attribute on the element declaration
whose value is 2DE. Ditto for an attribute with a type attribute.

An idref attribute is added to these items:

- elements with a type attribute
- attributes with a type attribute
- elements with a ref attribute
- attributes with a ref attribute
- group with a ref attribute
- attributeGroup with a ref attribute
- restriction (in a simpleType or complexType) with a 
  base attribute
- extension (in a complexType) with a base attribute
- list with an itemType attribute
- union with a memberTypes attribute

This program also adds a targetNamespace attribute onto each
element and attribute declaration . Obviously, its value is the 
namespace of the item. What about a no-namespace schema that is 
coerced into the namespace of an including schema? No problem. 
My program handles that.

Finally, if an element substitutes for another element, I add
an attribute, substitutionGroupRef, to the former.

Phew! This program does a lot.


4. An element may be substitutable by other elements. This
program adds an attribute, substitutable-by-these-elements, to 
an element declaration if it can be substituted. The value of 
substitutable-by-these-elements is a space-separated list of 
idref values ... one idref value to each element that can substitute
for the element.

The program is located in this folder:

    programs/add-refed-by-substitutionGroup
 

5. A complexType may extend another. This program merges the
stuff in the base type into the child type.  Now, the child
type is standalone (i.e. no longer dependent on a base type).

The program is located in this folder:

    programs/inline-complexContent-derive-by-extension

-----------------------------------------------------------
The next programs do the task of replacing references with
inlined items. That is, it does the task of converting the
schema to Russian Doll form.
-----------------------------------------------------------


6. A complexType may restrict another. This program merges the
the base type and the child type.  After merging the child
type is standalone (i.e. no longer dependent on the base type).

The program is located in this folder:

    programs/inline-complexContent-derive-by-restriction


7. Suppose an element declaration has a type attribute. This program
fetches the item referenced by the type attribute and inlines the 
item within the element (and removes the type attribute). Ditto for
attribute declarations with a type attribute.

The program is located in this folder:

    programs/inline-types


8. A simpleType may be at the end of a long simpleType chain
(e.g. simpleType A restricts simpleType B, which restricts C and so on).
This program merges the chain into a single simpleType. Wow!

The program is located in this folder:

    programs/reduce-simpleTypes


9. Suppose an element declaration has a ref attribute. This program
will fetch the element being ref'ed and replace the ref'ing 
element declaration with the ref'ed element (of course it retains
minOccurs and maxOccurs from the ref'ing element. Ditto for attribute
declarations containing a ref attribute, groups containing
a ref attribute, and attributeGroups containing a ref
attribute.

The program is located in this folder:

    programs/inline-ref


Done!

By this point the original schema has gone through a lot
of changes. The result is a single file in which references have
been replaced by inlined values, type chains have been merged,
elements that can be substituted have a pointer to those elements
that may substitute, and so forth.

Any suggestions? Let me know.

Found a bug? Let me know.

Roger L. Costello
costello@mitre.org

December 29, 2012