XPath Performance in Java

In this post we’ll survey some options for evaluating XPath expressions with Java in a more industrial, performance-sensitive fashion. Although not exhaustive our analysis will provide a framework for comparing the solutions and expose some tradeoffs.

This exploration was motivated by a client who needed run a zillion XPaths against a set of fairly large xml documents.

The Contenders

The Code

The following groovy code can be pasted in the Groovy console. You’ll need to add the following jars to the Groovy console classpath via the Script/Add Jar to Classpath menu option.

  • commons-jxpath-1.3.jar
  • jaxen-1.1.1.jar
  • xom-1.2.7.jar

import javax.xml.namespace.*
import javax.xml.parsers.*
import javax.xml.xpath.*

import org.apache.commons.jxpath.JXPathContext
import org.jaxen.dom.DOMXPath



class Timer {
    String name
    long start = System.currentTimeMillis()
    
    Timer(String name) { this.name = name }
    
    def mark() { println "$name\t\t${System.currentTimeMillis() - start}" }
}



def createDoc = { path ->
    InputStream is = new FileInputStream(path)
    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance()
    factory.setNamespaceAware(true)
    factory.newDocumentBuilder().parse(is)
}

def createXomDoc = { path ->
    nu.xom.Builder builder = new nu.xom.Builder()
    builder.build(path)
}

def nsMap = [
    "atom": "http://www.w3.org/2005/Atom",
    "xlink": "http://www.w3.org/1999/xlink",
]

NamespaceContext nsCtx = [
    getNamespaceURI: {
        nsMap.get(it)
    },
    getPrefix: {},
    getPrefixes: { nsMap.keySet().iterator() }
] as NamespaceContext



def jdk = { doc, xp, ctx ->
    XPath xpath = XPathFactory.newInstance().newXPath()
    xpath.setNamespaceContext(ctx)
    XPathExpression expr = xpath.compile(xp)
    expr.evaluate(doc, XPathConstants.NODESET)
}

def jxpath = { doc, xp, ctx ->
    JXPathContext jpc = JXPathContext.newContext(doc)
    jpc.setLenient(true)
    ctx.getPrefixes().each { pre ->
        jpc.registerNamespace(pre, ctx.getNamespaceURI(pre))
    }
    jpc.selectNodes(xp)
}

def jaxen = { doc, xp, ctx ->
    DOMXPath xpath = new DOMXPath(xp)
    ctx.getPrefixes().each { pre ->
       xpath.addNamespace(pre, ctx.getNamespaceURI(pre))
    }
    xpath.selectNodes(doc)
}

def xom = { xomDoc, xp, ctx, Closure clo ->
    nu.xom.XPathContext xpc = new nu.xom.XPathContext();
    ctx.getPrefixes().each { pre ->
       xpc.addNamespace(pre, ctx.getNamespaceURI(pre))
    }
    nu.xom.Nodes nodes = xomDoc.query(xp, xpc)
    for (int i = 0; i < nodes.size(); i++) {
        clo.call(nodes.get(i))
    }
}   



// now test the above

//def path = "/home/John/backup/projects/xpath2/usgs_feed.xml"
def path = "/home/John/backup/projects/xpath2/sam.xml"

def doc = createDoc(path)
def xomDoc = createXomDoc(path)

//def xpath = "//atom:entry/atom:id/text()"
//def xpath = "//atom:entry/atom:link[not(@rel)]/@href/text()"
def xpath = "//atom:entry/atom:link[string-length(@rel) = 0]/@href"



def count = 3000 // number of iterations for each solution
//def show = { println it }
def show = { }

2.times {
    Timer t = new Timer("jdk")
    count.times {
        jdk(doc, xpath, nsCtx).each {
            show it.nodeValue
        }
    }
    t.mark()
    
    t = new Timer("jxpath")
    count.times {
        jxpath(doc, xpath, nsCtx).each {
            show it.nodeValue
        }
    }
    t.mark()
    
    t = new Timer("jaxen")
    count.times {
        jaxen(doc, xpath, nsCtx).each {
            show it.nodeValue
        }
    }
    t.mark()
    
    t = new Timer("xom")
    count.times {
        xom(xomDoc, xpath, nsCtx) {
            show it.getValue()
        }
    }
    t.mark()
}

The input xml referenced by the path variable was a snapshot of the atom feed from Sam Ruby’s blog.

The Explanation

The create*Doc() closures in lines 21 and 28 create node trees from XML text. All but XOM use vanilla DOM trees.

The nsMap and nsCtx variables defined in lines 33 and 38 provide XML namespace information to the four eponymous XPath evaluators defined in lines 48, 55, 64, and 72, respectively. These four closures take a node tree, an XPath expression, and a NamespaceContext as arguments and return a collection – except for xom() which takes a fourth closure argument to process each matching node.

After creating node trees from the XML text in lines 90-91 (XOM is again the exception) the code in lines 103-135 calls each processor count times with the same XPath expression and prints the elapsed time. The entire process is then repeated to account for start up and caching effects.

The upfront creation of the DOM trees followed by “intense” processing was inspired by our client’s workflow.

If you’d like to examine the actual results change the count variable to 1 and transpose the commenting on the show() closure in lines 100 and 101. You should see something like this:

/blog/2012/02/25/Hearing-Aid
/blog/2012/02/22/Your-Next-Desktop-Could-be-a-Phone
/blog/2012/02/20/Mulligan
/blog/2012/02/19/OpenID-upgrades
/blog/2012/02/18/WunderWiki
/blog/2012/02/15/Twelve-Steps
/blog/2012/02/13/On-The-Move
/blog/2012/02/09/Dominoes
/blog/2012/02/04/Default-to-Incognito
/blog/2012/02/02/Wunderbar
/blog/2012/01/23/Port-Forwarding
/blog/2012/01/20/AWDwR-updated-for-Rails-3-2
/blog/2012/01/17/The-Presidents-challenge
/blog/2012/01/04/Bootstrapping-Debian-Unstable
/blog/2011/12/24/Ubuntu-vs-Ruby
/blog/2011/11/24/Experience-with-Git
/blog/2011/10/10/Building-Dart
/blog/2011/10/08/Thunderbird-add-ons-in-Flux
/blog/2011/10/03/No-more-XML-parsing-failed-errors
/blog/2011/08/31/AWDwR-updated-for-Rails-3-1
jdk		10

The Results

Here are the times in milliseconds reported on my 8GB 2.4GHz dual processor Fedora box:

jdk		5094
jxpath	1981
jaxen	2275
xom		2298
jdk		4949
jxpath	2076
jaxen	2236
xom		2198

That’s pretty much consistent with our results at the client: JXPath was the fastest and ahead of the JDK by a wide margin for our particular workload. Kudos to my colleague RC for suggesting it.

The Caveats

JXPath implements a large subset of XPath 1.0. If your XPath expressions are wild enough you’ll find some holes. I’ll try to update this post with a few.

Other Options

If your XPath expressions are simple and stable you could dodge the entire node tree creation step by implementing a custom SAX DefaultHandler or XML pull parser.

VTD-XML takes a novel approach to implement “the world’s fastest XPath 1.0 Processor”.

Update

Forgot to add:

java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)

This entry was posted in Technical. Bookmark the permalink.

Leave a Reply