Monday, May 30, 2011

SADI Workshop

SADI stands for Semantic Automated Discovery and Integration (something you won't find on the SADI Framework website no matter how hard you look) and is a relatively lightweight way to serve and access database information for the Semantic Web. This post is a summary of a two-day SADI workshop. Well, actually, it is a summary of the first day and a half, because on the second day we did the 'practical' part of the workshop. The practical part really didn't go very well. For some reason, we started with the SADI server, which is much more complex than the client would have been. There were numerous connectivity and software issues. I left half way through after the presenter looked at the code, shrugged, and said "I can't fix it." But despite the weakness of the workshop, the SADI framework itself is promising and I recommend people look into it.

OK, if you are already familiar with RDF and OWL and the like, you will want to skip down to Bioinformatics Web Services, where the good stuff starts.


Semantic web:
- extension of current www
- machine understandable annotations
- semantic agreement defined by ontologies
- logic-based representation languages
- automated reasoning
- enables answering of sophisticated questions

Languages (Berners-Lee 2000):
- Unicode + uri
- XML + NS + xmlschema
- RDF + rdfschema
- Logic (SWRL)
- Proof
- Trust

- is a specification of a domain
- eg. ‘A cat is a type of animal’
- ontologies used to infer implicit knowledge

- provide structured information
- eg. NCBI/Entrez, EBI/EB-eye
- expose the deep web – surface web = 167 terabytes, deep web = 91,000 terabytes

How do we integrate these resources? We can’t create a global schema, changing everyone’s database. This is where linked data and RDF comes into play.

Eg. – standard table/column representation may have some data, eg. Treatments for depression, with prices. Another table from a different source may list the same treatments, with outcomes and side effects. In the database world, to merge this information, we have to execute a join. But when we do mappings, we see they have the same identifier, so we don’t have to do a join.

In addition, we can take the data we’ve obtained from databases, and add information from ontologies, and information from the web (eg., in the form of published papers). This is especially useful in fields like bioinformatics, where there are numerous well-established standards.


Resource description framework describes resources and the relations between them. A resource is anything represented by a URI. Eg. In graphical representation, a resource is a node in the graph.

Properties link resources. Properties represent binary relations between two resources.

Literals are data values, information that we have about the resources. They are dates, numbers, strings, etc.

Statements are the building blocks of RDF , expressing the relations. Eg.


(NT1, hasOutcome, NT01)

@prefix ss:

(something else)

RDF/XML syntax:
- specifies the namespaces in the header
- a namespace is a document that describes the RDF elements being used
- rdf: the RDF schema provides the basic elements, ie, ‘resource’, ‘class’, ‘literals, ‘subClassOf’, ‘domain’, ‘range’, etc.
- rdfs: Annotations are meant for humans, eg. ‘label’, ‘comment’, ‘seeAlso’, ‘isDefinedBy’

13 Entainment rules:
Eg. transitivity If x rdfs:subClassOf y and y of z, then x of z.

An RDF database is a ‘triple-store’. A query on the database is a graph-pattern with variables. The variables will be bound as a result of the query.


SELECT ?treat
? treat rdf:type ss:DrugTreatment.
? treat ss:targets SS:Depression.
? treat ss:hasOutcome ?out.
? out rdf:type example:CognitiveMeasure

This may not be explicitly represented in the database; this is where we use the rules.

In addition to queries returning variables, we can construct graphs. We merge data by specifying more than one URI (source). There are SPARQL endpoints on the web in variable formats (XML,JSON). Eg. Bio2rdf – 30 billion linked triples of biological data. Bio2RDF supports a faceted text search. Instead of retrieving websites, it retrieves sources where it finds that pattern. You can also retrieve raw data.

Describing a service with SPARQL: is essentially a shortcut for a SPARQL query executed on the database:
?s ?p ?o .
?s ?p ?o . FILTER(?s = http://bio2rdf/ns:id).
or you can enter the query into the quesry window (OpenLink Virtuoso SPARQL query).

RDF and Ontologies

RDF is already an ontology language. But:
- there is no negations (eg., ‘a drug is not a disease’
- there is no way to express complex descriptions, eg. “f is a p and a c”
So we need a language above RDF, which has this expressivity. This will be the ‘Web Ontology Language’ (OWL).



Why do we write axioms in the context of the semantic web? A key goal of the semantic web is to create data such that the data carries its own meaning. Then, most applications would be able to use the data as-is. Often, light-weight semantics is enough: RDF plus shared controlled vocabularies. The meanings of the terms are fixed and shared by different parties. If everyone agrees with the definition, we can just pass data.

But this is not all we want to do in the semantic web. Take even a simple scenario, where your data describes a specific drug manufactured in Italy, and you want it to answer questions about all kinds of drugs used in the EU. You need extra information, eg., that Italy is a part of the EU.

So, we begin with axioms. The typical kinds of axioms are ‘class’ and ‘subClassOf’ axioms. In addition, we often write subproperty axioms, ‘ObjectProperty: hasAgent’ and ‘SubPropertyOf:hasParticipant”. These are the main workhorses of knowledge representation on the semantic web. They are used wherever axioms are used at all, and are particularly useful for querying.

Other kinds of axioms often serve as scaffolds. These are auxiallary axioms to support the hierarchy of concepts and properties. They are basically redundant axioms about concepts, and are used to detect inconsistencies. They are used in semantic reasoning the way types are used in programming; you don’t specifically need them, but they help identify when something is being used in an unexpected way. Similarly, data validation checks that classes and properties in an RDF dataset are being used correctly, eg., you may be applying a predicate to a resource where it is intended to be applied to other kinds of resources. And specifically in the context of SADI, axioms are used to define the input and output classes of a service.

Knowledge representation as a discipline is the study of how to represent knowledge in such a way as to support reasoning. The description logics is a family of logics tailored for ontology development; they are concept-centric by design.

Description logics contain three main primitives:
- concepts or classes ( = roughly nouns and adjectives)
- properties ( = roughly stative verbs)
- individuals

Description logic has a tradition of distinguishing between general and specific axioms: TBox (“schema”) and ABox (“data”). For example:
- ABox axioms – instance and property asserions. “T is an antidepressant. It targets depression.”
- Individual: T
- Type: antidepressant
Facts: targets whatever
- TBox axioms apply to many instances “Every TCA drug is an antidepressant”
- Class: TCA
- SubClassOf: Antidepresant

Class herrchies are central for description logics. They are created using subclass axioms. subclass is transitive. DL processors should perform this inference without explicit instructions. So you write A is a subclass of B, B is a subclass of C, and you do not need to write A is a subclass of C.

Property herarchies can be helpful. If you know some specific, eg., T targets P, then you know that T isRelated to P. Propertty hierarchies are created with subProperty axioms. Thus we have, eg.:
- property: has Agent and subPropertyOf:hasParticipant
- property: hasPartifipant and SubPropertyOf: isReatedTo
This is also transitive.

In description logics, we have special notation:
Thing T denotes the class of all things; semantically, any individual is an instance of thing, so any class is a subclass of thing
Nothing ( _|_ ) denotes the empty class. Sematically, nothing contains no instances. Nothing is a subclass of any class, and all subclasses of Nothing are empty.

There are Boolean operations on class expressions: unions, intersections and negations. These allow us to form complex class expressions.
- Class Union: the equivalent to class: Q or class:X
- Class Intersection: the equivalent of class:O and class:X
- Class Complement: the equivalent of not class:Q

The Axioms:

We can define basic property restrictions in description logics, which are inspired by relational database constraints (so urban legend goes): a non-null field requires a proper value a field with a foreign key references exactly one record. The restrctions are:
- existential property restrictions – an instance of a value of some property exists.
- Universal property restrictions
- Cardinality restrictions
Existential property restrictions are used to define IO in SADI. Universality and cardinality restrictions are less evident.

We also have ways of saying things about properties in description logics. Eg. ‘x contains y’ is the same as saying ‘y belongs to x’. We say ‘is_contained_by’ is an inverse property of ‘contains’. In OWL, we just declare that:
ObjectProperty: contained_by
InverseOf: contains
Other special property axioms: transitivity, symmetry, reflexivity.

Class disjointness is defined via intersection, ie., we say that two classes are disjoined if the intersecting class is empty. Disjointness axioms are mainly used for trapping inconsistencies. (Caution: if you think something is disjoint, you’d better be very sure.)

Another type of axiom is one that defines property domains and ranges. These again are mostly traps for inconsistencies. Eg. If you have a property ‘domains’ you must have a domain ‘drug treatment’. Or (for range) if you have a property ‘targets’, then it must have a range of ‘disease’. It’s a lot like typechecking in programming languages.

Descriptive Logic Reasoning Basics

Ontological consistency (aka ‘satisfiability’) checking: given an ontology, check to see if it implies contradiction. It’s useful by itself, but also:
- concept (class) satisfiability – classes that cannot have instances are suspicious. This can be reduced to ontology consistency: add one axiom, that ‘something belongs to C’. If this results in a contradiction the class cannot be satisfied – no entity can be a member of that class.
- Concept subsumption – is class C1 implicitly a subclass of C? Check to see is (C1 and not C2) is unsatisfiable, ie, no member of C1 can be outside C2.

Classification – discover implicit subclass links between named classes

Instance classification (aka realization) – fi nd all most specific types of an instance described with a ABox. Eg. If Individual:T targets:P then P will be classified as a disease. This can be implemented by repetitive application of the instance check: given a ABox for an instance obj, does obj classify into the class C?

Another reasoning task: conjunctive query answering. Eg. Find all ?D, ?T such that ?D:Drug ?D:isAgent

Tableaux reasoning: tableaux reasoners decide satisfiability. If the ontology is consistent, it has a model, a special structure on which the ontology is true. Given an ontology, a tableaux reasoned tries to find a model for it; if it fails, the on tology is inconsistent.

The main mechanism for this is ‘case analysis’. If we have an object that belongs to types A or B, then we replace it with an object that belongs to type B. Try to find the model for that simplified ontology. If this succeeds, you have a model for the original ontology. If it fails, try the other type.

Various tableaux reasoners for OWL:
- Pellet Clark&Parsiam dual-licensed, Java
- Fact++ ManchesterU, open source C__ with Java API
- Racer Pro
- HermiT

There are also rule-based systems that are not tableaux reasoners.

So – how is OWL related to description logics? OWL is essentially a concrete syntactic incarnation of DL. OWL itself has several serializations (ie, ways of expressing it):
- mapped to RDF, so any RDF syntax can be used to represent OWL
- (etc)

OWL goes beyond description logics in a few ways:
- datatypes in addition to regular classes
- data literals in addition to individuals
- separation of properties into object and data properties
- punning: classes and properties can be used as individuals
- annotation properties

Practical possibilities for reasoning with OWL:
- call reasoners from ontology editors, eg., Protégé. Classification, realization, consistency, and cass satisfiability
- stand-alone use of reasoners, eg., SPARQL querying
- programmatic use of reasoners via APIs. Eg, you can request all the subclasses of a given class, including implicit ones.

Demo. Take Protégé, load an ontology with both TBox and ABox sections, run an available reasoner.

Demo with stand-alone Pellet. Open Peller with .jar or .sh, check consistency of one consistent and one inconsistent ontology, classify an ontology, realize an ontology, compute a SPARL query on it.

Practical Aspects of OWL

There has been a huge increase in the diversity of ontology editors, development environments, etc.

Some best practices:
- become acquainted with the capabilities and limits of KR languages and technologies
- be objective – no distinction without a difference – a subclass must differ from a parent in a significant way
- classes should have at least one instance; and disjoint classes on the same level, and leaf classes, should never share instances
- Clearly state the intent of the ontology. “This ontology describes entities and relationships in ‘domain’.” Vs specific list of knowledge intended t be captured, outcomes expected from use, etc.
- Reuse ontologies. Import, reference, and refine.

Design patterns are needed (inspired by software design patterns). They are cetgroized into three groups:
- lists and n-ary relationships
- modeling, realist mentality

N-ary relationships: create a class of specific individuals that may not exist in the real world. Eg. ‘treatment’ – we can see diseases, outcomes, drug, outcome, etc., I cannot see ‘treatment’, but it’s something that I want to talk about.

- use an OWL ontology in Protégé
o a simplified pharma ontology

Web Services

We services are programs that can be called over the internet. The service is executed on the server, which is a remote system. Typically, we’re talking about services called over HTTP and executed using XML.

One common platform is SOAP+WSDL+UDDI. These are services that are discoverable via a register and which are self-describing.

Reasons to use web services:
- distributed computing – if data resides in one place, other data resides in another, and it all has to be used in combination
- shared use of costly resources – eg., large data, especially live, is difficult to replicate
- platform independence – WS allows you to use your efficient program in Linux from your iPhone app
- software component reuse – a good program already exists, others can use it

Some web service examples include:
- stock quotes
- postal code lookup
- unit conversion
- BLAST genetic sequence search

Main WS Platform: SOAP+WSDL_UDDI

- SOAP – simple object access protocol – standardizes the message format and communications protocol – it’s a generic XML based way of passing any XML content; it relieves programmers of having to develop libraries to handle messages. The message consists of the message body containing the payload and a header carrying auxiliary information, such as authentication data, IDs of messages to be acknowledgement, etc.

- WSDL – an extension of XML schema that allows you to define what your service does, specifically: (i) input/output types using XML schema types, (ii) concrete ways (bindings) the service can be called, and (iii) concrete locations (endpoints) the service can be called from. The main part of any WSDL is the definition of input and output classes, in XML schema.

- UDDI – universal description, discovery and integration – XML-based standard for registries that facilitate finding services. Once the service is written and described with WSDL, you want it to be accessible. UDD is a searchable, categorized collection of service descriptions in XML. UDDI is the standard for these registries. White pages: info about the provider, the name and nature of the business, the name, URL, etc. Yellow pages: categorization of the services in terms of standard nomenclature. Green pages: more technical info, eg. WSDL. (*) caution that there is no support for UDDI any more; it’s up to version 4 point something, but nobody uses it.

Why would you want to mix services with semantics? There are three major goas the WS community is trying to achieve:
- automated discovery – find the right kind of services that consume the right kind of data and produce the right kind of data
- automatic composition – compose (connect) services – eg. One service produces a list of employees connects to a service that consumes a list of people.
- Automatic invocation – even if a service is semantically and yntactically correct, it may produce the wrong sort of result; this allows specification of services that produce a specific result.
The first two goals can be accomplished by semantic categorization. Eg., assign a class/type/category to a service to enable its discovery. Eg. Categorize a service as a ‘LatLongCoordinatesComputation’ service, and ensure that this is a subset of ‘location service’. Also, via specifying input and output data types.

SAWSDL – is an attempt to standardize this – extends WSDL with semantic categorizations. Sawsdl:modelReference attributes maps of operation, input and output elements to URLs from classes of ontologies. SAWSDL can be used for service matchmaking: if you know how the service you are looking for is categorized, and how the IO is categorized, you can write a ‘fake’ SAWSDL document, which becomes the input for a query. (aside: comparison of this with OWLS).

Semantic categorization goes a long way, but:
- one can have several services with the same IO but doing different things
- fine categorization of services only alleviates the problem, but doesn’t solve it.
- In general, categorizing a service and IO doesn’t explicitly define what the service computes, ie., the relation between input and output.

Bioinformatics Web Services

Bioinformatics web services tend to be must simpler than W3C examples. Specifically:
- they tend to be stateless, atomic
- they are data-centric
They tend to ignore existing semantic standards for reasonable reasons. Eg. They ignore SAWSDL because it’s so lightweight. But OWL-S is, by contrast, huge and daunting; people have no desire to go into that level of complexity. Lately they’ve tended even to reject basic standards for web services, eschewing SOAP (in favour of RESTful services) and UDDI (in favour of BioMoby and BioCatalogue).

The existing standards were too complicated; they require too much effort for not enough gain.

SADI – you use simple HTTP operations, no project specific protocol. Services consume and produce RDF data, no message scaffolding. OWL is used to describe the service interface, reusing existing ontologies where possible. No SOAP, no XML schema.

The input is an RDF graph, rooted at a particular node, the output is an RDF graph, rooted at the same node. Hence there is an explicit relationship between input and output. Eg., in a temperature converter, instead of accepting ‘6 C’ input and ’38 F’ output, you attach new unit-value pairs to the same temperature object. The output has the same node as the input, that is, the node has the same URI.

SADI – is basically a philosophy for expressing web services:
- service is identified by an HTTP URL – used to identify the service, used to invoke the service
- responds to HTTP-GET with an RDF document that describes the service (note that and are different URIs).
- The service description indicates the input OWL class and the output OWL class.
- Eg.
- The service responds to HTTP-POST by invoking the service; the input is RDF containing OWL-class instances, the output is RDF
- Indicate error with the appropriate HTTP error code (importantly, don’t send an HTTP ok with an error document).

To call a SADI service, send a POST where the POST body is an RDF document:
Cotent-type: HTTP header

Eg. see slide module 3, 3.2.1

For asynchronous services, the server sends a stub in response immediately, and the client polls periodically for further response. The response code is 202 (accepted but incomplete): rdfs:isDefinedBy , response to poll can return 302 (Moved Temporarily) service suggests how long to wait using HTTP Retry-After header. Response when received is appended to original response.

For parameterized SADI service, eg., substitution matrix, etc. etc. Solution: create a parameterized OWL class, client sends an instance of that class along with the request.

The SADI registry: clients need to find SADI services; ideally you search for them in the worl, but we can't find them, so we have a registry. Enter a URL in the form to register. There's a SPARQL query endpoint and a Java API

Service Interface Definition

The service description describes services by properties input and output. The property restrictions describe the data consumed by the service. All property restrictions must be satisfied by each input instance. The service description can point to URIs that define input classes accepted by the service.

In these input classes, instances should be dynamically identifiable. Use necessary and sufficient conditions, and avoid universal and exact cardinality specifications. This addresses the 'open world assumption' - the idea that just because you don't know something is true, you can't say it's false. Eg. say 'minimum one name', not 'exactly one name'.

You can also specify input classes by specifying particular classes to an input property. Eg., if I am only interested in compounds, not genes, I can specify in my input that it must be of class 'compound'. Use 'some' or 'someValuesFrom' restriction to define this. By contrast, don't use 'allValuesFrom' - this is a universal condition that can never be satisfied because of the open world assumption.

An input class with multiple classes eg. 'datedValue' must have both a date and a value. Again, the idea is to identify necessary and sufficient conditions.

Output classes use property restrictions to describe data produced by the service. Instances don't have to be dynamically identifiable. The range of attached data should be indicated by the property restriction. Ie., clients should be able to identify services not only by what they output but by the specific values of what they output.


(Notes on having perl modules load locally)

perl5 lib environment variable
cpan instalbasedir

vi /home/stephen/bashrc

#Perl stuff


edit .cpan/CPAN/

'makepl_arg' => q[INSTALL_BASE=~/Lbrary/Perl],
'mbuild_install_org' => q[--install_base=~/Library/Perl],

sadi google group

(end notesa)


Seriously, folks, if you're going to do a workshop on new technologies:

- don't do more than a day of theoretical stuff before even trying anything practical and work from more accessible stuff to more complex stuff; do a little demo, some hands on, then explain the theory behind it, rinse, lather, repeat.

- don't assume everybody knows the notation you are using; as you write things out, say what they mean, rather than just saying "this" and pointing

- if you are demonstrating an environment on the screen, make sure it looks the same as your participants' - and not all customized with labels and shortcuts and such

- make sure you know the software - if the instructions on the website say 'download, run make, run the installer' then don't tell participants 'just use CPAN'

- if your demonstration server is a box sitting on your desktop in your office, don't slam it with 30 download requests all at once (and when you do that, don't blame local connectivity, which was perfectly fine)

- make sure the software works. That means, don't commit a new untested version of the software the night before the workshop.

- when you do all of this and your practical session blows up, don't blame the participants. It's not their fault the software doesn't work.

The last is the most important. I've done lots of tech demos. They break; it happens, and everyone understands. You patch things up as best you can. But blaming the participants is a cheap shot. There's nothing they can do to fix your mess, so there's nothing they can do to escape the blame. It just makes everyone feel bad and turns what could have been a worthwhile experience into nastiness and recriminations.

No comments:

Post a Comment

Your comments will be moderated. Sorry, but it's not a nice world out there.