Web technologies -- Laboratory 5 -- 2009-2010 -- info.uvt.ro

Parsing XML documents

An important issue when dealing with XML is parsing the documents. There are several parser types including:

DOM parsers:
- allow the navigation of the XML document as it were a tree.
- the main drawback is that the document needs to be completely loaded into memory before actually parsing it.
- DOM documents can be either created by parsing an XML file, or by users which want to create an XML file programmatic.
SAX parsers:
- event-driven API in which the XML document is read sequentially by using callbacks that are triggered when different element types are meet.
- overcomes the DOM’s memory problem, and is fast and efficient at reading files sequentially.
- its problem comes from the fact that it is quite difficult to read random information from inside an XML file.
FlexML parsers:
- follow the SAX approach and rely on events during the parsing process.
- it does not constitute a parsing library by itself, but instead it converts the DTD file into a parser specification usable with the classical Flex parser generator.
Pull parsers:
- use an iterator design pattern in order to sequentially read various XML items such as elements, attributes or data.
- this method allows the programmer to write recursive-descent parsers:
  - applications in which the structure of the code that handles the parsing looks like the XML they process.
  - examples of parsers from this category include: StAX13, and the .NET System.Xml.XmlReader.
Non-extractive parsers:
- a new technology in which the object oriented modeling of the XML is replaced with 64-bit Virtual Token Descriptors.
- one of the most expressive parser belonging to this category is VTD-XML.

SAX

SAX (Simple API for XML) is a serial access XML parser. A SAX parser can be found in the Xerces library found here.

The following fragment of code shows how we could use SAX to parse an XML document:

import org.xml.sax.*;
import org.xml.sax.helpers.XMLReaderFactory;
import java.io.IOException;


public class SAXExample {

	public static void main(String[] args) throws IOException {
      
		try {
			XMLReader parser = XMLReaderFactory.createXMLReader();
			ContentHandler handler = new TextExtractor();
			parser.setContentHandler(handler);

			parser.parse(args[0]);
			System.out.println(args[0] + " is well-formed.");
		}
		catch (SAXException e) {
			System.out.println(args[0] + " is not well-formed.");
		}
	}

}

import org.xml.sax.*;
import java.io.*;

public class TextExtractor implements ContentHandler {
 
	public TextExtractor() { }
    
	// Handle the #PCDATA i.e. the text nodes
	public void characters(char[] text, int start, int length) throws SAXException {
		System.out.println("Found text node: ");
		System.out.println(new String(text).substring(start, start+length)); 		
    
	}  
    
	public void setDocumentLocator(Locator locator) {}
	// Handles the start of a document event.
	public void startDocument() {
		System.out.println("Entering document");
	}
	// Handles the end of a document event.
	public void endDocument() {
		System.out.println("Leaving document");
	}
	// Handles the beginning of the scope of a prefix-URI Namespace mapping.
	public void startPrefixMapping(String prefix, String uri) {}
	// Handles the ending of the scope of a prefix-URI Namespace mapping.
	public void endPrefixMapping(String prefix) {}
	// Triggers each time a start element is found.
	public void startElement(String namespaceURI, String localName,	String qualifiedName, Attributes atts) {
		System.out.println("Found element: " + localName);

		System.out.println("Attributes:");
		for (int i=0; i<atts.getLength(); i++) {
			System.out.println("Found attribute: " + atts.getLocalName(i) + " with value: " + atts.getValue(i));
		}
	}
	// Triggers each time an end element is found.
	public void endElement(String namespaceURI, String localName, String qualifiedName) {
		System.out.println("Leaving element: " + localName);
	}
	// Handles white characters.
	public void ignorableWhitespace(char[] text, int start, int length) throws SAXException {}
	// Handles the processing instruction. For example it can be called xml with version=1.0 and a certain encoding.
	public void processingInstruction(String target, String data){}
	// Handles a skipped entity.
	public void skippedEntity(String name) {}
}

Links:

SAX Tutorial

DOM

DOM (Document Object Model) is a convention for representing XML documents. A DOM parser can be found in the Xerces library found here.

DOM handles XML files as being made of the following types of nodes:

Document node
Element nodes
Attribute nodes
Leaf nodes:
- Text nodes
- Comment nodes
- Processing instruction nodes
- CDATA nodes
- Entity reference nodes
- Document type nodes
Non-tree nodes;

The following fragment of code shows how we could use DOM to traverse an XML tree:

import javax.xml.parsers.*;  // JAXP
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
import java.io.IOException;


public class DOMExample {

	public static void main(String[] args) {
          
		DOMExample iterator = new DOMExample();
		try {
			// Use JAXP to find a parser
			DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
			// Turn on namespace support
			factory.setNamespaceAware(true);
			DocumentBuilder parser = factory.newDocumentBuilder();
      
			// Read the entire document into memory
			Node document = parser.parse(args[0]); 
      
			// Process it starting at the root
			iterator.followNode(document);

		}
		catch (SAXException e) {
			System.out.println(args[0] + " is not well-formed.");
			System.out.println(e.getMessage());
		}   
		catch (IOException e) { 
			System.out.println(e); 
		}
		catch (ParserConfigurationException e) { 
			System.out.println("Could not locate a JAXP parser"); 
		}
  
	}
 
	public void followNode(Node node) throws IOException {
		// Print information on node.
		System.out.println("Node name:" + node.getNodeName());
		System.out.println("Node type:" + node.getNodeType());
		System.out.println("Node local name:" + node.getLocalName());
		System.out.println("Node value:" + node.getNodeValue());

		// Process the children.
		NodeList children = node.getChildNodes();
		for (int i = 0; i < children.getLength(); i++) {
			Node child = children.item(i);
			// Recursion on child.
			followNode(child); 
		}    
	}
}

DOM also allows users to create a new XML document or change the structure of an already existing one:

import java.io.IOException;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.Element;

public class DOMCreatorExample {

	public static void main(String[] av) throws IOException {
		DOMCreatorExample dc = new DOMCreatorExample();
		Document doc = dc.makeXML();
	}

	public Document makeXML() {
		try {
			DocumentBuilderFactory fact = DocumentBuilderFactory.newInstance();
			DocumentBuilder parser = fact.newDocumentBuilder();
			Document doc = parser.newDocument();

			Node root = doc.createElement("books");
			doc.appendChild(root);

			Node book = doc.createElement("book");
			((Element) book).setAttribute("title", "Processing XML with Java");
			((Element) book).setAttribute("author", "Elliotte Rusty Harold");
			book.appendChild(doc.createTextNode("A complete tutorial about writing Java programs that read and write XML documents."));
			root.appendChild(book);
      
			return doc;

		} catch (Exception ex) {
			ex.printStackTrace();
			return null;
		}
	}
}

Links:

Exercises

Parse the XML created in your assignment from Web technologies -- Laboratory 3 -- 2009-2010 -- info.uvt.ro using both SAX and DOM. Print out the parsing time of each method (hint: use System.currentTimeMillis() to get the start and end time). Populate with the retrieved data (using SAX only) the ArrayList and beans used to create the HTML code of your assignment from Web technologies -- Laboratory 4 -- 2009-2010 -- info.uvt.ro.
Create the XML from your assignment in Web technologies -- Laboratory 3 -- 2009-2010 -- info.uvt.ro using DOM. Print the result to an XML file.