Jump to content

Web technologies -- Laboratory 5 -- 2009-2010 -- info.uvt.ro

From Wikiversity

Parsing XML documents

[edit]

An important issue when dealing with XML is parsing the documents. There are several parser types including:

  • DOM parsers:
    • allow the navigation of the XML document as it were a tree.
    • the main drawback is that the document needs to be completely loaded into memory before actually parsing it.
    • DOM documents can be either created by parsing an XML file, or by users which want to create an XML file programmatic.
  • SAX parsers:
    • event-driven API in which the XML document is read sequentially by using callbacks that are triggered when different element types are meet.
    • overcomes the DOM’s memory problem, and is fast and efficient at reading files sequentially.
    • its problem comes from the fact that it is quite difficult to read random information from inside an XML file.
  • FlexML parsers:
    • follow the SAX approach and rely on events during the parsing process.
    • it does not constitute a parsing library by itself, but instead it converts the DTD file into a parser specification usable with the classical Flex parser generator.
  • Pull parsers:
    • use an iterator design pattern in order to sequentially read various XML items such as elements, attributes or data.
    • this method allows the programmer to write recursive-descent parsers:
      • applications in which the structure of the code that handles the parsing looks like the XML they process.
      • examples of parsers from this category include: StAX13, and the .NET System.Xml.XmlReader.
  • Non-extractive parsers:
    • a new technology in which the object oriented modeling of the XML is replaced with 64-bit Virtual Token Descriptors.
    • one of the most expressive parser belonging to this category is VTD-XML.

SAX

[edit]

SAX (Simple API for XML) is a serial access XML parser. A SAX parser can be found in the Xerces library found here.

The following fragment of code shows how we could use SAX to parse an XML document:

import org.xml.sax.*;
import org.xml.sax.helpers.XMLReaderFactory;
import java.io.IOException;


public class SAXExample {

	public static void main(String[] args) throws IOException {
      
		try {
			XMLReader parser = XMLReaderFactory.createXMLReader();
			ContentHandler handler = new TextExtractor();
			parser.setContentHandler(handler);

			parser.parse(args[0]);
			System.out.println(args[0] + " is well-formed.");
		}
		catch (SAXException e) {
			System.out.println(args[0] + " is not well-formed.");
		}
	}

}
import org.xml.sax.*;
import java.io.*;

public class TextExtractor implements ContentHandler {
 
	public TextExtractor() { }
    
	// Handle the #PCDATA i.e. the text nodes
	public void characters(char[] text, int start, int length) throws SAXException {
		System.out.println("Found text node: ");
		System.out.println(new String(text).substring(start, start+length)); 		
    
	}  
    
	public void setDocumentLocator(Locator locator) {}
	// Handles the start of a document event.
	public void startDocument() {
		System.out.println("Entering document");
	}
	// Handles the end of a document event.
	public void endDocument() {
		System.out.println("Leaving document");
	}
	// Handles the beginning of the scope of a prefix-URI Namespace mapping.
	public void startPrefixMapping(String prefix, String uri) {}
	// Handles the ending of the scope of a prefix-URI Namespace mapping.
	public void endPrefixMapping(String prefix) {}
	// Triggers each time a start element is found.
	public void startElement(String namespaceURI, String localName,	String qualifiedName, Attributes atts) {
		System.out.println("Found element: " + localName);

		System.out.println("Attributes:");
		for (int i=0; i<atts.getLength(); i++) {
			System.out.println("Found attribute: " + atts.getLocalName(i) + " with value: " + atts.getValue(i));
		}
	}
	// Triggers each time an end element is found.
	public void endElement(String namespaceURI, String localName, String qualifiedName) {
		System.out.println("Leaving element: " + localName);
	}
	// Handles white characters.
	public void ignorableWhitespace(char[] text, int start, int length) throws SAXException {}
	// Handles the processing instruction. For example it can be called xml with version=1.0 and a certain encoding.
	public void processingInstruction(String target, String data){}
	// Handles a skipped entity.
	public void skippedEntity(String name) {}
}

Links:

SAX Tutorial

DOM

[edit]

DOM (Document Object Model) is a convention for representing XML documents. A DOM parser can be found in the Xerces library found here.

DOM handles XML files as being made of the following types of nodes:

  • Document node
  • Element nodes
  • Attribute nodes
  • Leaf nodes:
    • Text nodes
    • Comment nodes
    • Processing instruction nodes
    • CDATA nodes
    • Entity reference nodes
    • Document type nodes
  • Non-tree nodes;

The following fragment of code shows how we could use DOM to traverse an XML tree:

import javax.xml.parsers.*;  // JAXP
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
import java.io.IOException;


public class DOMExample {

	public static void main(String[] args) {
          
		DOMExample iterator = new DOMExample();
		try {
			// Use JAXP to find a parser
			DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
			// Turn on namespace support
			factory.setNamespaceAware(true);
			DocumentBuilder parser = factory.newDocumentBuilder();
      
			// Read the entire document into memory
			Node document = parser.parse(args[0]); 
      
			// Process it starting at the root
			iterator.followNode(document);

		}
		catch (SAXException e) {
			System.out.println(args[0] + " is not well-formed.");
			System.out.println(e.getMessage());
		}   
		catch (IOException e) { 
			System.out.println(e); 
		}
		catch (ParserConfigurationException e) { 
			System.out.println("Could not locate a JAXP parser"); 
		}
  
	}
 
	public void followNode(Node node) throws IOException {
		// Print information on node.
		System.out.println("Node name:" + node.getNodeName());
		System.out.println("Node type:" + node.getNodeType());
		System.out.println("Node local name:" + node.getLocalName());
		System.out.println("Node value:" + node.getNodeValue());

		// Process the children.
		NodeList children = node.getChildNodes();
		for (int i = 0; i < children.getLength(); i++) {
			Node child = children.item(i);
			// Recursion on child.
			followNode(child); 
		}    
	}
}

DOM also allows users to create a new XML document or change the structure of an already existing one:

import java.io.IOException;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.Element;

public class DOMCreatorExample {

	public static void main(String[] av) throws IOException {
		DOMCreatorExample dc = new DOMCreatorExample();
		Document doc = dc.makeXML();
	}

	public Document makeXML() {
		try {
			DocumentBuilderFactory fact = DocumentBuilderFactory.newInstance();
			DocumentBuilder parser = fact.newDocumentBuilder();
			Document doc = parser.newDocument();

			Node root = doc.createElement("books");
			doc.appendChild(root);

			Node book = doc.createElement("book");
			((Element) book).setAttribute("title", "Processing XML with Java");
			((Element) book).setAttribute("author", "Elliotte Rusty Harold");
			book.appendChild(doc.createTextNode("A complete tutorial about writing Java programs that read and write XML documents."));
			root.appendChild(book);
      
			return doc;

		} catch (Exception ex) {
			ex.printStackTrace();
			return null;
		}
	}
}

Links:

Exercises

[edit]