OpenSearch with Scala

OpenSearch is a specification for accessing search services. A website providing OpenSearch functionality advertises that capability by including a link element like the following in its xhtml.

<link title="dzone" type="application/opensearchdescription+xml" rel="search" href="http://www.dzone.com/links/DZone-opensearch.xml"/>

The href attribute above points to the site’s OpenSearch description document. This document tells clients how to consume the OpenSearch services provided by the site.

<?xml version="1.0" encoding="UTF-8"?> 
<OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/"> 
            <ShortName>DZone Search</ShortName> 
            <Description>DZone Search Provider</Description> 
            <Url type="text/html"
                   template="http://www.dzone.com/search.html?query={searchTerms}"/> 
</OpenSearchDescription>

The “Url” elements in the OpenSearch description contain search templates: urls with curly-enclosed parameters that a client replaces with desired values. Besides {searchTerms} common parameters include {startPage?} and {count?}.

There can be more than one Url element in the description for different flavors of search. e.g. if the site provides more than one result media type. Although “text/html” is common on the web “application/atom+xml” is more sophisticated.

Modern browsers provide UI support for OpenSearch. If your current page in Firefox provides OpenSearch services you can add it to your menu of quick search engines.

Let’s examine a simple OpenSearch client written in the Scala programming language.

package synaptic

import java.io._
import java.net._
import scala.xml._

import com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl

// Find and consume a web site's OpenSearch service.
object Search {

  // HTTP GET to InputStream
  def get(url : String) : InputStream = {
    val u = new URL(url)
    val conn = u.openConnection()
    conn.getInputStream()
  }
  
  // HTTP GET to String
  def getAsString(url : String) : String = {
    scala.io.Source.fromInputStream(get(url)).getLines().mkString("\n")
  }
  
  // get OpenSearch search template from web site
  def getSearchTemplate(url : String,
      mediaType : String = "application/atom+xml") : String = {
    // get page containing an opensearch descriptor
    var xml : Elem = XML.load(get(url))
    
    // parse OpenSearch description href from a link element
    var href : String = (( xml \\ "link").filter(
      u => { (u \ "@rel").text == "search" &&
      (u \ "@type").text == "application/opensearchdescription+xml" })(0)
      \ "@href").text
    
    // absolutize description link if necessary
    if (!href.startsWith("http")) {
      href = url.replaceFirst("([^/]+//[^/]+).*", "$1/" + href)
    }

    // get OpenSearch description
    xml = XML.load(get(href))
    
    // parse OpenSearch template from a Url element
    ((xml \\ "Url").filter(
        u => { (u \ "@type").text == mediaType })(0) \ "@template").text
  }
  
  // perform search against template with a few OpenSearch parameters
  def search(template : String,
      query : String,
      startPage : Int = 1,
      pageSize : Int = 5) : String = {
    val url = template.replace("{searchTerms}", query)
    	.replace("{startPage?}", startPage.toString())
    	.replace("{count?}", pageSize.toString())
    	
    getAsString(url)
  }

  // search for "tungsten"-related results on www.nature.com
  def main(args: Array[String]): Unit = {
    makeXmlLoadWorkWithXhtml()

    val url = "http://www.nature.com"
   
    val template = getSearchTemplate(url)
    
    val results = search(template, "tungsten")
    
    println(results)
  }
  
  // only included to make XML.load() work with xhtml
  def makeXmlLoadWorkWithXhtml() = {
    System.setProperty("javax.xml.parsers.SAXParserFactory",
        "synaptic.MyXMLParserFactory")
  }  
}

// only included to make XML.load() work with xhtml
@throws(classOf[org.xml.sax.SAXNotRecognizedException])
@throws(classOf[org.xml.sax.SAXNotSupportedException])
@throws(classOf[javax.xml.parsers.ParserConfigurationException])
class MyXMLParserFactory extends SAXParserFactoryImpl() {
  super.setFeature("http://xml.org/sax/features/validation", false)
  super.setFeature("http://apache.org/xml/features/disallow-doctype-decl", false)
  super.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false)
  super.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false)
}

The getSearchTemplate() method in lines 25-47 retrieves a site’s xhtml, finds the OpenSearch
descriptor href, retrieves that descriptor, and extracts the first OpenSearch template therein.

The search() method in lines 50-59 stuffs some search parameters into a given search template and performs a search.

Here’s one of the search results returned from the above search. It’s an atom entry embellished with some fancy xml.

    <entry>
        <title>Characterization of exposures among cemented tungsten carbide workers. Part I: Size-fractionated exposures to airborne cobalt and tungsten particles</title>
        <link href="http://dx.doi.org/10.1038/jes.2008.37"/>
        <id>http://dx.doi.org/10.1038/jes.2008.37</id>
        <updated>2011-07-15T13:47:05+00:00</updated>
        <content type="xhtml">
            <div xmlns="http://www.w3.org/1999/xhtml">         <p>
                    <b>Characterization of exposures among cemented tungsten carbide workers. Part I: Size-fractionated exposures to airborne cobalt and tungsten particles</b>
                </p>
                <p>Journal of Exposure Science and Environmental Epidemiology 19, 475 (2008). <a href="http://dx.doi.org/10.1038/jes.2008.37">doi:10.1038/jes.2008.37</a></p>
                <p>Authors: Aleksandr B Stefaniak, M Abbas Virji &amp; Gregory A Day</p>
</div>
        </content>
        <dc:identifier>doi:10.1038/jes.2008.37</dc:identifier>
        <dc:title>Characterization of exposures among cemented tungsten carbide workers. Part I: Size-fractionated exposures to airborne cobalt and tungsten particles</dc:title>
        <prism:productCode>jes</prism:productCode>
        <dc:creator>Aleksandr B Stefaniak</dc:creator>
        <dc:creator>M Abbas Virji</dc:creator>
        <dc:creator>Gregory A Day</dc:creator>
        <prism:publicationName>Journal of Exposure Science and Environmental Epidemiology</prism:publicationName>
        <prism:issn>1559-0631</prism:issn>
        <prism:eIssn>1559-064X</prism:eIssn>
        <prism:doi>10.1038/jes.2008.37</prism:doi>
        <dc:publisher>Nature Publishing Group</dc:publisher>
        <prism:publicationDate>2008-07-16</prism:publicationDate>
        <prism:volume>19</prism:volume>
        <prism:number>5</prism:number>
        <prism:startingPage>475</prism:startingPage>
        <prism:endingPage>491</prism:endingPage>
        <prism:url>http://dx.doi.org/10.1038/jes.2008.37</prism:url>
        <prism:channel/>
        <prism:section/>
        <prism:genre>Research Article</prism:genre>
        <prism:copyright>© 2009 Nature Publishing Group</prism:copyright>
        <sru:recordSchema>info:srw/schema/11/pam-v2.1</sru:recordSchema>
        <sru:recordPacking>unpacked</sru:recordPacking>
        <sru:recordData>
            <pam:message xmlns:pam="http://prismstandard.org/namespaces/pam/2.0/" xsi:schemaLocation="http://prismstandard.org/namespaces/pam/2.0/ http://prismstandard.org/schemas/pam/2.1/pam.xsd">
                <pam:article>
                    <xhtml:head xmlns:xhtml="http://www.w3.org/1999/xhtml"/>
                </pam:article>
            </pam:message>
        </sru:recordData>
        <sru:recordPosition>1</sru:recordPosition>
        <sru:extraRecordData/>
    </entry>

A couple caveats: This client parses the initial site html as xml. If, as is often the case, the site isn’t well-formed xml you’re out of luck. Also, Scala seems to have a problem with xhtml. The cruft at the bottom of the client is a workaround. Full details are available in this post

This entry was posted in Technical. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>