Welcome to SORTA

System for Ontology-based Re-coding and Technical Annotation

 

Background

SORTA, a matching tool built in MOLGENIS, is able to semi-automatically match data values with standard codes such as ontologies or local terminologies. For each data value, SORTA provides a list of the most relevant standard codes based on the lexical similarity in percentage, users can then pick the correct matches from the suggested list.


Demo

Click here for a demo.
The demo version does not have full functionality, data will not be saved in the database and will be lost after the session expires. To get access to SORTA, please contact the administrator for login credentials. Try out the examples below, you can directly get match results by clicking one of the two example links.


matching with Human Phenotype Ontology

To reproduce the matching results, 1. copy the example below 2. click the demo link above 3. paste the example into the text area 4. select the human phenotype ontology

Name
Hearing impairment
protruding eyeball
hyperextensibility at elbow joint

matching with Orphanet

To reproduce the matching results, 1. copy the example below 2. click the demo link above 3. paste the example into the text area 4. select the orphanet ontology

Name;Synonym;OMIM
3-oxoacyl-CoA thiolase deficiency;peroxisomal thiolase deficiency;604054
2-ketoglutarate dehydrogenase deficiency;2-ketoglutaric aciduria;203740
acid sphingomyelinase deficiency;sfingomyelinase deficiency;607608

 


Technical design

SORTA is built based on Lucene in combination with the N-gram string matching algorithm to achieve high performance and accuracy. Lucene matching scores are too abstract for users to understand and they are not comparable between each other. Therefore we use the N-gram algorithm to re-calculate the similarity scores (in percentages) between data values and the concepts retrieved by Lucene. The new similarity scores are more clear and comparable, enabling us to explore the uniform cut-off value.

  • Step 1 - Index the standard concepts in Lucene to establish a knowledge base.
  • Step 2 - Lucene retrieves the most relevant concepts for data values from the knowledge base.
  • Step 3 - The N-gram algorithm is applied to re-calculate the similarity scores between data values and concepts retrieved by Lucene.
  • Step 4 - Users can pick the correct matches from the list of concepts sorted based on N-gram similarity scores.

 

Not available

 

Ontology model

Standard codes (ontologies) can be imported using the EMX format, the model can be browsed and viewed as a UML diagram as well as a flat list in the webbrowser.