Naming conventions - SourceForge



Naming Conventions for

Controlled Vocabularies (CVs) and Ontologies



1 Rationale for this document 5

1.1 Target audience 5

1.2 Naming Conventions 5

1.2.1 What are the profits from a Naming Convention ? 8

1.2.2 Separation between Knowledge and Implementation Levels 9

2 (Meta-) Reference Terminology 11

2.1 Naming representational artifacts 11

2.1.1 Terminology or Vocabulary 11

2.1.2 Semi structured data 12

2.1.3 Controlled Vocabulary 12

2.1.4 Glossary 13

2.1.5 Dictionary 13

2.1.6 Hierarchy 13

2.1.7 Taxonomy, Meronymy 13

2.1.8 Folksonomy 14

2.1.9 Thesausus (Structured Vocabulary) 15

2.1.10 Directed acyclic graph, DAG 15

2.1.11 Object model 15

2.1.12 Ontology 17

2.1.13 Knowledgebases 18

2.1.14 RAs on the interpretation continuum 19

2.2 Naming representational units 19

3 General principles for creating sound RUs 21

3.1 Univocity 23

3.2 Positivity 23

3.3 Objectivity 23

3.4 Try to avoid multiple parenthood first 24

3.5 Avoid overloaded term names 24

4 Naming Classes 25

4.1 Implementing the Class name 25

4.2 Class name precision 25

4.2.1 Avoid linguistic ellipses 26

4.3 Synonyms 27

4.3.1 Avoid different sorts of Synonyms 27

4.3.2 Property synonyms 30

4.4 Acronyms and Abbreviations 30

4.5 Registered Product- and Company-names 31

4.6 Lexical Properties of class names 31

4.6.1 Capitalisation 31

4.6.2 Character set 31

4.6.3 Formattings 32

4.6.4 Punctuation 32

4.6.4.1 Word separators 33

4.6.4.2 Hyphens, dash and slash 33

4.6.5 Wordform and tense 34

4.6.5.1 Plurals and sets 35

4.6.6 Word length and word compositions 35

4.6.6.1 Compound vs. atomic names for representational units 36

4.6.6.2 Splitting and merging classes 37

4.6.7 Affixes (prefix, suffix, infix and circumfix) 37

4.6.8 Logical connectives 38

4.6.9 "Taboo" words and Characters 38

4.6.10 Specific language requirements 39

5 Depicting representational units within text 40

6 Class definitions 41

6.1 General rules for creating sound normalized definitions 41

6.2 Property definitions 43

6.2.1 Implementation of definitions 43

7 Unique identifiers 44

7.1 Capturing the class name and ID using the autoID plugin in Protégé-owl 45

7.2 Life science Identifier, (LSID: ) 47

8 Namespace 49

9 Location of webaccessible repository 50

10 Ontology Imports in Protégé-owl 51

10.1 The “lang” attribute issue 51

10.1.1 Import 52

10.1.1.1 Importing from repositories (extracted from the Protégé wiki) 53

10.1.1.2 Changing the imported ontology to be the newest updated version 54

11 Properties (Attributes and Relations) 56

11.1 Assigning "key-properties" to top level classes 56

12 Naming of Ontology files and Ontology Versions 57

13 Ontology updating procedures 59

14 OBO vs OWL 60

15 Acknowledgements 61

16 References 62

Rationale for this document

This document defines naming conventions for controlled vocabularies (CVs) and ontologies. Metadata annotation elements are not covered here; these are addressed in the document [1].

These recommendations have been developed to guide the activities of the Metabolomics Standards Initiative (MSI) [2] Ontology Working Group (OWG) [3].

The MSI OWG seeks to facilitate the consistent description of metabolomics experiment components by reaching a consensus on a core set of CVs and then developing an ontology. The CVs are developed in close collaboration with the HUPO Proteomics Standards Initiative (PSI) [4] and structured as taxonomies in owl and OBO format. The ontology is developed as part of the Ontology for Biomedical Investigation (OBI, previously ‘FuGO’) [5], a larger, multi-domain collaborative effort.

These naming conventions are also used in the context of the OBI, developed in OWL.

The table below gives an overview of which sections of the document are relevant to purely taxonomic CVs and which are relevant to the development of ontologies.

[To be added]:

|Document Sections |Relevant to CVs |Relevant to Ontology |

|2 (example….) |( |NO |

| | | |

| | | |

| | | |

1 Target audience

This document is addressed to those involved in the preparation and review of symbolic representational artefacts like taxonomies, controlled vocabularies and ontologies.

2 Naming Conventions

From: c035347_ISO_IEC_11179-5_2005(E)-1.zip :

The name for an administered item is specified within a context. A naming convention (NC) describes what is known about how names are formulated. It may be simply descriptive; e.g., where the Registration Authority has no control over the formulation of names for a specific context and merely registers names that already exist. This naming convention is prescriptive, specifying how names shall be formulated, with a Registration Authority expected to enforce compliance with these. The objectives of this prescriptive naming convention is to normalize name consistency, name appearance, and name semantics. The NC can also enforce the exclusion of irrelevant facts about administered items.

The naming conventions reference document shall cover all relevant documentation aspects:

• the scope of the naming convention, e.g. established industry name;

• the authority that establishes names;

• semantic rules governing the source and content of the terms used in a name, e.g. terms derived from data models, terms commonly used in the discipline, etc.;

• syntactic rules covering required term order;

• lexical rules covering controlled term lists, name length, character set, language;

• a rule establishing whether or not names must be unique.

In addition to the scope and authority rules needed to document descriptive naming conventions, prescriptive conventions should be documented by semantic, syntactic, lexical, and uniqueness rules. Semantic rules enable meaning to be conveyed. Syntactic rules relate items in a consistent, specified order. Lexical (word form and vocabulary) rules reduce redundancy and increase precision. A uniqueness rule documents how to prevent homonyms occurring within the scope of the naming convention.

Semantic rules: Just an example….

a) Object classes represent things of interest in a universe of discourse that may, for instance, be found in a model of that universe. EXAMPLE Cost

b) One and only one object class term shall be present.

Syntactic rules:

a) The object class term shall occupy the first (leftmost) position in the name.

b) Qualifier terms shall preceed the part qualified. The order of qualifiers shall not be used to differentiate names.

c) The property term shall occupy the next position.

d) The representation term shall occupy the last position. If any word in the representation term is redundant with any word in the property term, one occurrence will be deleted. EXAMPLE Cost Budget Period Total Amount

Lexical rules:

a) Nouns are used in singular form only. Verbs (if any) are in the present tense.

b) Name parts and words in multi-word terms are separated by underscores. No special characters are allowed.

c) All words in the name are in mixed case. The rules of “mixed case” are defined by the RA. These rules may by different for different parts of the administered item name (object class, property, representation class).

d) Abbreviations, acronyms, and initialisms are allowed. EXAMPLE Cost Budget Period Total Amount

Uniqueness rule:

All names in each language shall be unique within this context.

Or:

Lexical rules:

a) Nouns are used in singular form only, unless the concept itself is plural. Verbs (if any) are in the present tense.

b) Name parts are separated by capitalizing the first character of the second thru nth word.

c) All words in the name are in mixed case.

d) Abbreviations, acronyms, and initialisms are allowed only when used normally within business terms.

e) Words contain letters and numbers only.

Uniqueness rule:

All names shall be unique within a DTD.

ISO 11179:

A part of ISO/IEC 11179 develops a set of principles, methods, and procedures for specifying what is needed (at a minimum) to document the association between the various types of administered items and one or more classification schemes. This includes the names, definitions, and other aspects of the classification scheme and its contents. These can be captured through use of a set of attributes. Particular attributes are specified in ISO/IEC 11179, along with a structure for the contents of these attributes. Users may extend the set of attributes as necessary. Additional information may accompany a taxonomy or ontology; for example, to provide a suggested set of qualifiers that could be applied to the object class, property, or representation taxa to more fully qualify the classification of the particular administered item. A part of ISO/IEC 11179 summarizes the basic attributes and model specified in ISO/IEC 11179-3:2003. An example in this part of ISO/IEC 11179 shows how selected components of data elements can be associated with a classification scheme through the attributes specified in this part of ISO/IEC 11179. Use of one or more classification schemes is intended to provide a sound conceptual basis for the development of metadata having enhanced semantic purity and design integrity.

Scope of this NC:

The scope of a naming convention specifies the range within which it is in effect. The scope of a naming convention may be as broad or narrow as the Registration Authority, or other authority, determines is appropriate. The scope should document whether the naming convention is descriptive or prescriptive.

The NC document proposes some recommendations for naming conventions of Knowledge Representation (KR) idioms (classes and relationships) also called representational units, in an effort to facilitate the communication and the developments. Defining naming conventions and then strictly adhering to these will simplifies ontology alignment and merging processes. The conventions proposed here have been defined considering that the ontology will be posted under the Open Biomedical Ontologies (OBO) umbrella [4] and built in OWL ultimately. This document can be downloaded from

1 What are the profits from a Naming Convention ?

A rigorous formal and logically consistent way of naming RUs eases

• Indexing and Categorisation of RUs

• Integrated tool access across different ontologies

• Ontology alignment (mapping), difference detection and merging (e.g. through PROMPT)

• Consistent visualisation

• Unified understanding of meaning to humans as well as web agents

• Avoidance of masked redundant content

The overall profit is the ease to access different ontologies through a unified Mechanism and thereby better exploit the given ontological resources. I.e. these naming conventions were created for msi ontologies initially, but also to foster integration of RAs within the OBO ontology libraries. Applying these NC eases the integrated access of the OBO ontologies through Meta-tools, e.g. the ones developed currently by the NCBO BioPortal.

[To be worked on:]

Authority:

Semantic principle:

Semantics concerns the meanings of name parts and possibly separators that delimit them. The set of semantic rules documents whether or not names convey meaning, and if so, how.

ISO/IEC 11179-5:2005(E)

Name parts may be derived from structure sets that identify relationships among (classify) members.

Syntactic principle:

Syntax specifies the arrangement of parts within a name. The arrangement may be specified as relative or absolute, or some combination of the two.

Relative arrangement specifies parts in terms of other parts, e.g., a rule within a convention might require that a qualifier term must always appear before the part being qualified appears.

Lexical principle:

Lexical issues: preferred and non-preferred terms, synonyms, abbreviations, part length, spelling, permissible character set, case sensitivity, etc. The result of applying lexical rules should be that all names governed by a specific naming convention have a consistent appearance.

2 Separation between Knowledge and Implementation Levels

The description of a metadata standard should distinguish between the ontology conceptualization and an ontology implementation as concrete realization of an ontology in a particular representation language (in various languages, syntaxes, versions etc.). This separation should be based on the observation that any ontology is based on a language-independent conceptual model. The conceptualization represents the view of the engineering team upon the application domain, which then is implemented using an ontology editor and stored in a specific format. The same conceptualisation might result in several implementations, with various classes, properties and axioms, depending on the concrete representation paradigm, language and syntax. An Ontology Conceptualization (OC) represents the abstract or core idea of an ontology. It describes the core properties of an ontology, independent from any implementation details. An Ontology Implementation (OI) represents a specific implementation of a conceptualization. Therefore, it describes implementationspecific properties of an ontology.

(Meta-) Reference Terminology

Papers: RefTermin, Interpretation continuum, What are the differences…, DAG,

At first we would like to clarify the terminology to talk about the idioms which are the matter of this text. We introduce a common reference terminology to harmonize cross domain understanding of the things that are talked about.

[Here Graphic: Andrew, Ontogenesis…]

Knowledge representations (KR, also called representational models) are referred to with the term ‘representational artefact’, RA). A representational artefact is made of related ‘representational units’ (RU, also known as KR-idioms) - in most cases classes and properties.

When RAs and RUs are explained, the problem is, that there can only be a holistic approach to introduce the idioms. They can not easily be introduced in a simple serial manner, because each ideom heavily relates to all others. So we can't expect immediate understanding of everything mentioned when serially going through this text. Understanding will rather come holistically in the sense that you will have to read the whole text several times and while doing so your understanding on each chapter will build up and re-new gradually. So do not worry if you do not get it at the first time. There will always be words which you might not understand immediately.

1 Naming representational artifacts

The most often cited RAs will be described highlighting their differences.

1 Terminology or Vocabulary

Any set of symbols or terms (in most cases words or word compositions) used for communication, which can be interpreted by the addresse in the way intended by the addresser. Interpreted means it is felt to be descriptive in the sense that the recipiation of the terms induces some kind of understanding or conceptual model which ideally has as most overlap with the conceptual model of the addresser. In this sense a terminology is the medium for exchanging knowledge models. Language related terminologies consist of words suitable for describing a domain of interest.

Key characteristic (primary intrinsic quality, or quale): Intended meaning

2 Semi structured data

Semi-structured data are usually considered documents that contain free-text fragments, structured in accordance to some schema. Typical sorts of semi-structured documents are forms and tables, which have some strict structure (fields, parts, etc.), but still the content of the specific parts of the document is a free-text.

3 Controlled Vocabulary

Any terminology which is taken care of by some registration authority or standardisation body (can be very small, i.e. a working group only) in the sense that the terms used are controlled. Controlled means the sense or the appearances of the terms are defined in a consistent manner. All terms should have unambiguously defined and non-redundant meanings. Usually Homonyms (a term that refers to different meanings) are resolved and synonyms (different terms that refer to the same meaning) are captured. The word "CV" does not say anything about the structure of the terminology or RA, i.e. a CV can be a simple list of terms or an ontology. No formal statement about the relationships between the terms have to be made, but can be made. A CV does not have to state anything about the meaning of its terms but usually informal definitions are provided for each term.

Key characteristic: A standard body enumerates and defines the terms for usage.

Usually all RAs which have ID enumerated RUs are CVs.

A controlled vocabulary is a representational artefact based on a hierarchical list of standardized terms which are used to annotate data. The relation between parent and child terms used to build the hierarchy is an 'is_a' (subsumption) relationship and all terms are agreed standards.

or: (taken from )

A *controlled vocabulary* is a list of terms that have been enumerated explicitly. This list is controlled by and is available from a controlled vocabulary registration authority. All terms in a controlled vocabulary should have an unambiguous, non-redundant definition. This is a design goal that may not be true in practice. It depends on how strict the controlled vocabulary registration authority is regarding registration of terms into a controlled vocabulary. At a minimum, the following two rules should be enforced:

• If the same term is commonly used to mean different concepts in different contexts, then its name is explicitly qualified to resolve this ambiguity.

• If multiple terms are used to mean the same thing, one of the terms is identified as the preferred term in the controlled vocabulary and the other terms are listed as synonyms or aliases

4 Glossary

A list of terms and their explanation in natural language.

WIKI: A glossary is a list of terms in a particular domain of knowledge with the definitions for those terms. Traditionally, a glossary appears at the end a book and includes terms within that book which are either newly introduced or at least uncommon.

In a more general sense, a glossary contains explanations of concepts relevant to a certain field of study or action.

5 Dictionary

Any list which entries refer to entries in another list. In contrary to a thesauRUs the dictionary usually defines words.

6 Hierarchy

A hierarchy is a nested set of symbols or terms (in most cases words or word compositions). In a hierarchy the principle used to build the nested structure is not specified and can be of any transitive relation (i.e. part-of, is-a, ….) and even of multiple relations at the same time (???). The term refers to the graphical structure and does not specify the semantics behind the parent-child relationship. In this sense nested xml elements are hierarchical when displayed as such, but the meaning of 'B being nested in A' is not defined within the xml. Hierarchies have meanings specifies via whatever the meaning of the hierarchical relationship is.

There are one parent only hierarchies (mono-hierarchies, and DAG???) and multiple parent hierarchies (poly hierarchy or DCG ???), in which one term can be found under more than one parent. Multiple parenthood is a well established practice to profit from multiple inheritance of properties.

Key characteristic: Graph structure

7 Taxonomy, Meronymy

When the relation used to build the hierarchy is of one transitive relation only, i.e. the nested (child-) term stands in a 'is-a' (generalisation-specialisation) relationship to its parent term throughout, we speak of a Taxonomy (from Greek verb τασσεν or tassein = "to classify" and νόμος or nomos = law, science, cf "economy"). Taxonony was once only the science of classifying living organisms.

The Taxonomy is a hierarchy build according to one intrinsic property of the items to be taxononized.

A taxonomy is a collection of controlled vocabulary terms organized into a hierarchical structure. Each term in a taxonomy is in one or more parent-child relationships to other terms in the taxonomy. There may be different types of parent-child relationships in a taxonomy (e.g., whole-part, genus-species, type-instance), but good practice limits all parent-child relationships to a single parent to be of the same type. Some taxonomies allow poly-hierarchy, which means that a term can have multiple parents. This means that if a term appears in multiple places in a taxonomy, then it is the same term. Specifically, if a term has children in one place in a taxonomy, then it has the same children in every other place where it appears. There is not one generally accepted definition so far.

The difference between a classification and a taxonomy is that a taxonomy classifies in a structure according to one defined relation between the entities and that a classification uses more arbitrary (or extrinsic) grounds. As an example of intrinsic grounds, spinach is a vegetable and not every vegetable is spinach, so spinach is a subclass of vegetable. The decision to place spinach in the class vegetable is based upon data intrinsic to the entities, so this would be a piece of taxonomy (a taxonomy with a subclass hierarchy). An extrinsic reason could be for instance classification of building components according to the branches of the building industry. This would lead to a classification, not a taxonomy. A taxonomic relation is a relation between entities in the taxonomy (a subclass relation for instance), a classification relates the entities to something that is external (like branches of an industry or safety classes).

A taxonomy can also be build according to other transitive relations.

When the relation is of 'part-of' type, then we call such a taxonomy a Meronymy. For example, 'finger' is a meronym of 'hand' because a finger is part of a hand.

8 Folksonomy

A collection of terms allocated to resources by endusers in order to categorise or index them in a way that these endusers consider useful is called Folksonomy. Terms in such 'democratic' folksonomies, are typically added in a fast, pragmatic, decentralized and uncontrolled manner, without making the underlying structures or principles explicit nessessarily. The process of folksonomic data (in most cases website-) annotation is intended to make a body of information increasingly easier to search, discover, and navigate over time. A well-developed folksonomy is accessible as a shared vocabulary that is both originated by, and familiar to its primary users. Part of the appeal of folksonomy is its independency of search engine (e.g. Google-) censorship.

9 Thesausus (Structured Vocabulary)

A thesauRUs is an associatively networked CV. The terms refer to each other through different relations. A thesauRUs does not need to have a taxonomic structure. Usually it is a list of controlled terms that verbally refer to each other. The relationships are informal and vary in detailledness (can be simple broader than, or even related to relations). A formal definition of a thesauRUs designed for indexing is: "A list of every important term (single-word or multi-word) in a given domain of knowledge and a set of related terms for each term in the list." (wiki).

The NCI-ThesauRUs is represented in owl and should be better called the NCI ontology.

10 Directed acyclic graph, DAG

A DAG is a directed graph with no directed cycles; that is, for any node, there is no nonempty directed path starting and ending on itself. DAGs appear in models where it doesn't make sense for a node to have a path to itself.

11 Object model

Wiki: In the computing discipline object model has two related but distinct meanings:

1. The properties of objects in general, in a specific computer programming language, technology, notation or methodology that uses them. For example, the Java object model, the COM object model, or the object model of OMT. Such object models are usually defined using concepts such as class, message, inheritance, polymorphism, and encapsulation. There is an extensive literature on formalized object models as a subset of the formal semantics of programming languages.

2. A collection of objects or classes through which a program can examine and manipulate some specific part of its world. In other words, the object-oriented interface to some service or system. Such an interface is said to be the object model of the represented service or system. For example, the HTML Document Object Model (DOM) [1] is a collection of objects that represent a page in a web browser, used by script programs to examine and dynamically change the page. There is a Microsoft Excel object model [2] for controlling Microsoft Excel from another program.

Definitions of Object model on the Web (Google. define: Object model):

A description of the structural relationships among components of a library object including its metadata.

cs.cornell.edu/wya/DigLib/MS1999/glossary.html

An object model defines the structural relationships and dynamic interaction between a group of related objects.

resources/Glossary.html

a group of objects that work together for a common purpose. The JavaScript object model comprises all the elements that make up a Web page.

builder.5100-31-5076597.html

The definition of an abstract representation that is used for real data, devices, operator stations, programs, event conditions, and event enrollments.

glossary_menue/glossary_tase2.html

A collection of related managed objects forming a logical and consistent grouping.

en/US/products/ps6456/products_programming_usage_guide_chapter09186a0080490c90.html

From: Technical Committee H7 , Object Model Features Matrix, Doc. No.: X3H7-93-007v12b, Doc. Date: May 25, 1997 :

The H7 Object Model Features Matrix is organized by rows denoting various object models (or object-oriented languages/systems), and columns denoting specified object model features. The intent is to describe each object model (language/system) with respect to the specified features (an entry is intended to be text describing the model's support for the feature, not "yes" or "no"). In the text version, the presentation of the matrix is in column order; that is, each column is defined, and the entries for each row for that column follow. This is to facilitate comparing models according to a given feature. In this Web version, each row and column has a separate page.

The term object model also refers to the collection of concepts used to describe objects in a particular object-oriented language, specification, or analysis and design methodology, and corresponds closely to the use of the term data model in "the relational data model". Thus, we speak of "the Smalltalk object model" or "the OMG object model". This is in contrast to the use of object model to describe the collection of objects created to model a particular system or application, as in "the Automatic Teller Machine object model" or "the object model of a windowing system" [RBPE+91]. From our point of view, [RBPE+91] defines a particular object model (our sense), which includes concepts like object, inheritance, attribute, and so on, and uses it to define the object models (second sense) of various applications. This dual usage is unfortunate, but is common in the literature.

12 Ontology

Ontologty has a long established tradition in philosophy. Mentioned as Categories in Aristoteles Metaphysik, the word 'ontology' itself was first established in the 17.century.

The Encyclopaedia Britannica defines ontology as “the theory or study of being as such; i.e., of the basic characteristics of all reality”. This is a philosophy centered definition.

The field has exploded with the dawning of IT technology and has shifted in meaning within this field.

“Ontology” is the buzzword used on the internet when discussing the semantic web. The WebOntology working group at W3C emphasises that ontologies are a machine-readable set of definitions that create a taxonomy of classes and subclasses and relationships between them.

The word ontology was established to the biocommunity through Gene Ontology, an effort that in fact build a taxonomic CV. This has created much confusion over what an ontology is. An ontology resembles both a kind of taxonomy-plus definitions and a kind of knowledge representation language that allows to capture additional relations, not just the one used to build the taxonomic structure. In an ontology one can specify the relation which is used to build the hierarchy.

A clear boarder between a rich “taxonomy” and a “simple ontology” is nevertheless hard to define.

The fundamental difference between a classification and an ontology is in the richness of information available. Both provide a list or structure of classes, but a classification stops at that point, whereas an ontology also provides further information on the classes such as definitions, attributes and relations.

Ontology is defined in the DIP Glossary as “The formalization of a terminology (set of terms and possibly their interrelations) used in some domain of discourse. An ontology represents consensual knowledge about a domain of discourse (in form of terms and possible interrelation among them) in a formal way that can be shared between agents and makes this knowledge accessible by machines. …” The most popular definition of an ontology from the Semantic Web and AI perspective is the one provided in [Gruber, T. R. (1993). A translation approach to portable ontologies. Knowledge Acquisition, 5(2):199-220, 1993. ]: “An ontology is an explicit specification of a conceptualization”, where “a conceptualization is an abstract, simplified view of the world that we wish to represent for some purpose.” Another widely used extended definition is provided in [Borst, P.; Akkermans, H.; Top, J. (1997). Engineering Ontologies. International Journal of Human-Computer Studies, (46)365-406, 1997.]: “An ontology is a formal, explicit specification of a shared conceptualisation.”

Ontologies can be considered as RAs intended to represent knowledge in the most formal and re-usable way possible. Formal ontologies (considered in the AI) are represented in logical formalisms (like OWL) which allow automatic inference over them or datasets aligned to them.

We would describe ontology as a CV expressed in a formal representation language, which enables to formally capture a defined semantics. The most well known representation languages used to structure ontologies are OWL (DL-semantics) and OBO. Ontology representation languages differ in their semantic expressivity. Ontologies are rich enough to express meanings as formal and hence computer-accessable models through use of defined related RUs. Ontology representation languages have a defined syntax, semanitcs and grammar. Usually it is regarded that the use of one of the following semantic idioms makes a CV an ontology: object properties, Cardinalities, Restrictions and Axioms.

13 Knowledgebases

From Data, Information, and Process Integration with Semantic Web Services (DIP), and :

Knowledge Base (KB) is a term with a wide usage and multiple meanings. It can be seen as a dataset described through some formal semantics. A KB, similar to an ontology, is represented with respect to a knowledge representation (or just a logical) formalism, which usually allows automatic inference. It could include multiple axioms, definitions, rules, facts and statements. In contrast to ontologies, KBs are not intended to represent a (shared/consensual) schema, a basic theory, or a conceptualization of a domain. Thus, the ontologies are just a specific case of knowledge bases.

In short a Knowledgebase is the ontology in use, when it is instantiated or it's classes are used to annotate data. In this sense The ontology is the T(terminological)box and the data annotated through the Tbox is called A(assertional)box. Both together make a knowledgebase.

14 RAs on the interpretation continuum

[To be refined:]

We can sort the different RAs according their formality and semantic expressivity:

Lassila and McGuinness have presented an ontology spectrum that presents various levels of formalization (2001 Deborah L. McGuinness. Ontologies come of age. In Dieter Fensel, Jim Hendler, Henry Lieberman, and Wolfgang Wahlster, editors, Spinning the semantic web: bringing the world wide web to its full potential. MIT press, 2002. Available on-line at ).

Along this spectrum are:

catalogs: a finite list of terms

glossary: list of terms and natural language meanings

thesauri: relating terms by synonymy, typically non-hierarchical but hierarchy perhaps deducible from broader/narrower description

taxonomies

formal is-a: a hierarchically arranged scheme without strict subclassing – the example provided is of Yahoo’s “categories”

formal is-a: strict control of inheritance

ontologies

frames: including property information, with inheritance of properties

value-restrictions: constraints on properties

2 Naming representational units

We recommend using the term ‘class’ to refer to the representational unit that models a ‘universal’ in an ontological representational artefact. Each class has a ‘class name’, a term (string) to designate the class. An ‘Instance’ is the representation of a ‘particular’ in reality. A particular instantiates a universal and an instance instantiates a class. Properties of universals are represented through representational units called ‘properties’. Properties which have fillers of simple datatypes (e.g. integer, string, boolean, ...) are called ‘attributes’ or ‘datatype properties’. Properties which have classes or instances as their fillers (also called ‘range’) are called ‘relations’ or ‘object properties’. Confusingly other formats use the word "property" for restrictions. The word ‘domain’ can mean a group of classes that a property is asserted to (in owl), but also describes the area of interest of a representational artefact.

For a detailed recommendation have a look at the full paper:

General principles for creating sound RUs

The Reality of Computer Based Information: [modify, maybe move to metadata advantages…]

A number of problems are found as a result of the way data is held in information systems:

• Arbitrary or inappropriate restrictions on the data that can be held.

• History data cannot be held.

• Fudge or false data may be introduced to overcome restrictions.

• Uncontrolled redundancy of data requiring reconciliation of different versions.

•Difficulty in integrating data from different sources because of incompatibility in definitions and format.

• The same RU may be replicated.

• The same functionality may be replicated.

All of these problems either restrict the way a project develops and increases costs. Some financial and time penalties are:

• Translating data is expensive. The cost of interfaces to translate the meaning of data can account for 25-70% of the total cost of a system development project.

• The need to translate data means that users of different systems can often only share data sequentially, and not concurrently. This can extend the time required for critical business processes.

• There is a slower response to the need for change in systems. Interfaces cost time as well as money.

• Quality suffers. Duplication of data is inefficient and invites errors, which may lead to inferior business decisions.

• Staff time is wasted trying to locate and reconcile data.

Imposing abitrary or inappropriate restrictions through the RU means:

• Abitrary or inappropriate restrictions are placed on the data that can be held.

• Fudge or false data may be introduced to overcome the restrictions in the RU. This may have to be programmed around.

• The entity type will only work within the context defined. A change in business rules may require a change in the database structure.

• The resultant system is harder to share.

Failing to correctly recognise entity types means:

• The same RU may be replicated.

On RU names:

One of the biggest problems in managing data is identifying what is being talked about. That is, what is a sound basis for identifying and naming entity types? In order to be able to hold data about something we need to identify what it is. In order to be able to share data about something, we need to have a consistent view of what it is about, independent of the context for a particular use.

When data is context dependent, then it means that the data could mean something else in another context. In order to make such data independent of its context, the context must be made an explicit part of the data, rather than something assumed.

Often you will find that the same word falls in more than one place. This is quite normal and can be for a number of reasons.

• The word represents a complex concept.

• The word is a homonym (has more than one meaning).

• The discussion brings related concepts to light.

When this happens all the mappings should be considered valid, at least initially. Additional mappings should be created and added in the relevant place, perhaps with a qualifying word or number to differentiate them.

---

Become acquainted with the capablities and incapabilities of both the representation formalism and its implementation (an ontology engineering tool) of your choice.

Save often! Always save to a new version number including the date. Protégé-OWL is not yet completely stable. Undo is difficult and bugs occasionally corrupt ontologies beyond retrieval.

Don’t get into 'analysis paralysis'! You will not get it right at the first time! Sometimes one has to throw things away and start again. Do not get into the ‘naïve euphoria’ either. Not every fancy just-built piece of representation is an ontology worth bothering others.

General Ontology Engineering Axioms:

Every class has at least one instance at the KB level

Distinct classes on the same level and leaf classes never share instances

1 Univocity

Names of RUs (including the ones for relations) should have the same meaning on every occasion of use and refer to the same universals and kinds of entities in reality. Each name should refer to exactly one RU, and each RU should represent exactly one entity in reality (a universal in the case of a class). In effect, it should unambiguously refer to the same entity in reality. Note that this principle of univocity excludes homonyms, terms that are used as names of more than one RU. For example, if you use the term ‘cell’ as a name of the class representing (the type of) cells as found in all organisms, the same term should not be used as a name for a more specialized class representing (the type of) cells as found only in plants. Likewise, the term ‘part of’ should not be used to name more than one relation, e.g., partonomy, set membership, etc.

Further more:

Don’t confuse universals with ways of getting to know types

Don’t confuse universals with ways of talking about types

Don’t confuses universals with data about types

2 Positivity

Complements of classes such as ‘non-mammal’ or ‘non-membrane’ are not necessarily themselves classes and don’t designate genuine universals. Similarly, do not represent the absence of a wing as the presence of the non-existence of a wing, e.g.: 'wing' has_status "absent". The positivity recommendation may need to be weakened; sometimes it can make sense to have e.g. an "ex-vivo" role or a “non-living_organism”.

3 Objectivity

No distinction without a difference. A child class must differ from its parent class in a distinctive way. A child class must share all the properties of its parent classes (inheritance principle) and have additional ones that the parents have not. Each class must be defined in a formula which states the necessary and sufficient conditions for being an instance of the corresponding universal. The sibling class of a given parent class should have differentia which are really distinct. This means that the universals of these classes at least have distinct (ideally non-overlapping = single inheritance) extensions. The distinction between each pair of siblings must be explicitly represented (opposition principle).

Which universals exist is not a function of our biological knowledge. Be aware that terms such as ‘unknown’ or ‘untypified’ or ‘unlocalized’ do not designate genuine universals. To characterize classes, formulate intrinsic properties (properties that are inherent to the universal represented by the RU) rather than extrinsic ones (properties that are asserted from outside, e.g. accession numbers). ‘Intrinsic’ describes a characteristic or property of some thing or action which is essential and specific to that thing or action, and which is wholly independent of any other object, action or consequence. A characteristic which is not essential or inherent is extrinsic (from ).

4 Try to avoid multiple parenthood first

No class in the hierarchy should have more than one superclass when starting to build an ontology. Multiple inheritance can generate subtle but systematic ambiguity in the meaning of formal relations like is_a and part_of within the ontology. One should not press the "is_a" into service to mean a variety of different things (see univocity principle). Domain-experts should build single parenthood taxonomies of their views of reality. Other domain experts build the same for theirs and only later all these taxonomies will get ‘multidimensionally’ aligned within obo and secure common nodes will result which make consistent (!) multiple inheritance possible.

There are however many opinions on this issue and we might discuss this matter further, when we feel there is a real need for multiple parenthood. Alan Rectors Normalisation and untangling practices have to be discussed here also…

5 Avoid overloaded term names

The use of overloaded terms such as “experiment” “method” “technique”, “instructions” has to be avoided. They are ambiguous and have many meanings across diverse domains. In this example a series of events or actions used should be represented as a single or collection of atomic “Protocols” rather than using the terms above.

Naming Classes

Each class representing a universal in a representational artefact is labelled with a human readable class name. Class names should be short, easy to remember and as self-explanatory as the pragmatic compromise allows. This class name should be used as default browser key when navigating through the class hierarchy and should therefore be as intuitive as possible to the ontology engineer building the ontological structure. However this class name will not necessarily be used as the main search attribute by the end-users when they are searching for classes. For this a short and intuitive class name should be captured as preferred synonym, which would be the term of highest usage frequency found in the literature of that domain, i.e. the term with the highest user acceptance. Use a name that is most widely accepted in the user domain. The class should represent and be named after the intrinsic, underlying nature of the universal to be represented, not according to extrinsic properties or roles a class can play in a particular context. Embodying the whole meaning of the class - with all its relationships to other classes - in its name is in most cases neither possible nor recommended. Keep semantics in the definitions and formalize it explicitly as properties and axioms. For example, a class “distinct_identifiable_physical_part” should be just called “physical_part”. For the preferred synonym readability should have higher priority than constraining interpretation through the class names. For the class name that is used for OE, it is the other way round.

Epistemological statements (using meta-level jargon) don't belong in the class names so avoid calling the class “instrument” “instrument_class” or the relation “has_part” “has_part_relation”.

1 Implementing the Class name

Each class has a name. An OBO term must have a name: tag. This is the element in OBO-XML. OBO names are mapped to rdfs:label, which is a sub-element of

2 Class name precision

Class names should be precise, concise and linguistically correct (i.e. they should conform to the rules of the language used). Often terms for RUs are not precise, i.e. they do not capture the intended meaning. Imprecise terms are especially problematic in the absence of good definitions. For example the term “anatomic_structure, system or substance” does not give us any clue as to whether the scope of the adjective prefix “anatomic” is restricted to structure or extends also to system and substance. This ambiguity can lead to problems like the following: If “anatomic” is restricted to “structure” only, then “drug” and “chemical” would be classified under this class, since these are clearly substances. If it is not restricted “drug” and “chemical” could not be classified under this class.

1 Avoid linguistic ellipses

Be explicit, try to avoid ellipses, because what you leave out or think as implicitly clear is not necessarily known by others and in any case not for computers. An ellipsis is a rhetorical figure of speech, the omission of a word or words required by strict grammatical rules but not by sense. The missing words are implied by the context in human language. Ellipse usage often points to slang words which should be avoided, or put as synonyms, e.g. "chemo" for "chemotherapy". The aposiopesis is special form of rhetorical ellipsis (wiki). Typical examples of this are: Pat embraces Meredith, and Meredith, Pat, in which the second instance of the word embraces is implied rather than explicit. And so to bed, which appears on several occasions in the diary of Samuel Pepys, meaning and so I went to bed.

The Plant Ontology used to use 'cell' to mean 'plant cell' in this way, which led to problems when they had to extend the ontology to deal with bacteria in plants. They have now changed the definition and name of their former 'cell' to ‘plant cell’ and created a broader ‘cell’ class. The general rule is, for every expression 'E': 'E' means: E. The term ‘E’ means what the word ‘E’ means, but the word ‘E’ may mean different things...

Sometimes hyphen usage is a hint for Ellipse usage. This should be avoided, e.g. "bio- and genetechnology" would be "biotechnology and genetechnology" and then probably modelled as two separate classes "biotechnology” and “genetechnology".

Confusingly we sometimes use the same general terms to refer both universals and collections of particulars. Consider:

· HIV is an infectious retrovirus

· HIV is spreading very rapidly through Asia

This however could also be regarded as an ellipise usage: The first ellipse "HIV" stands for "HIV-Virus", the second ellipse stands for "HIV-Disease".

3 Synonyms

One definition of synonymy, as proposed by ISO 1087-1:2000: A synonym is a “… relation between or among terms in a given language representing the same concept, with a note to the effect that terms which are interchangeable in all contexts are called synonyms; if they are interchangeable only in some contexts, they are called quasi-synonyms.“ [I don’t think ‘quasi-synonym’ should exist, see next chapter].

The number of synonyms for a class is not limited, and the same text string can be used as a synonym for more than one class (???). Add synonyms if you edit or delete a class name, but the old name can still a valid synonym, e.g. if you change "respiration" to "cellular_respiration", think of keeping "respiration" as a synonym (but in this case make it a superclass…). This helps other users to find "familiar" classes. 'Jargon' type phrases, abbreviations and acronyms are synonymous with the full name as long as they are not used in any other sense elsewhere. Translations of the class name into other languages are sometimes captured as synonyms, too. We would recommend to capture translations in a different element, e.g. called 'Translation'. In any case it has to be indicated which language the term exists in. Owl provides a nice functionality to set the 'lang' attribute, e.g. for an rdfs:label annotation property.

To capture synonyms in owl, one could use the rdf:comment field, and add a comma separated list of synonyms after a simple in-text “synonym: ”-marker. This is a very dirty solution and can only be a preliminary one. Another way would be to create a new metaclass with a new string datatype property “has_synonyme” and derive all new classes from this new metaclass (see also ). This has the disadvantage of the whole ontology becoming OWL-full. Capturing synonyms in further rdfs:label fields has the disadvantage that when more synonyms are present, it is not possible to know which one is the preferred class name, the human readable class name to display as the browser key and which is another kind of synonym given the preferred class name is formalised with the rdfs:label). Usually the alphabetically first rdfs;label would be displayed.

1 Avoid different sorts of Synonyms

As we saw above some ontologists perceive synonyms as not always 'synonymous' in the strictest sense of the word, as they feel they not always mean exactly the same as the class they are attached to. Some ‘synonyms’ seem to be broader or narrower in meaning than the class name; it may be a related phrase or alternative wording, spelling or use a different system of nomenclature. Having a single, broad relationship between a class and its synonyms is adequate for most search purposes, but for applications such as semantic matching, the inclusion of a more formal relationship set can be valuable. Here sometimes synonym types are introduced, e.g. like GO does. Such relationships can be stored in the OBO format flat file.

Synonym types:

Some synonym relationship types are:

* the term is an exact synonym to the class name, “ornithine_cycle” is an exact synonym of “urea_cycle”

* the term is related to the class name, “cytochrome_bc1_complex” is a related synonym of “ubiquinol-cytochrome-c_reductase_activity”

* the synonym is broader than the class name, “cell division” is a broad synonym of “cytokinesis”

* the synonym is narrower or more precise than the class name, “pyrimidine-dimer_repair_by_photolyase” is a narrow synonym of “photoreactive_repair”

* the synonym is related to the class name, but is not exact, broader or narrower, “virulence” has a synonym type of other related to the class name “pathogenesis”

However we do not recommend to capture such ‘synonym types’ as the GO style guide suggests. Capture only exact synonyms.

For the OWL format one could use the W3 standard for thesauri ‘Simple Knowledge Organisation System’ (SKOS, ) to encode synonym types through relations like “narrower than”, “broader than”. It also provides a “preferred label” and "related to" element for terminological mapping:

The SKOS Core Vocabulary includes the following properties for asserting semantic relationships between concepts: skos:semanticRelation, skos:broader, skos:narrower and skos:related. In a property hierarchy semanticRelation is the top semantic relationship and others are children relationships. To assert that one concept is broader in meaning (i.e. more general) than another, where the scope (meaning) of one falls completely within the scope of the other, use the skos:broader property. To assert the inverse, that one concept is narrower in meaning (i.e. more specific) than another, use the skos:narrower property.

To assert that one concept is broader in meaning (i.e. more general) than another, where the scope (meaning) of one falls completely within the scope of the other, use the skos:broader property. To assert the inverse, that one concept is narrower in meaning (i.e. more specific) than another, use the skos:narrower property. For example:

[pic]

mammals

animals

When you add a synonym in OBO-format using OBO-Edit, choose a type from the pull-down selector (see the DAG-Edit user guide for more information). DAG-Edit will incorporate the synonym type into the OBO format flat file when you save. The default synonym type is the broadest, 'synonym' (equivalent to 'related' above).

2 Property synonyms

One should also capture object property synonymes (see section 4.1 of ), e.g:

4 Acronyms and Abbreviations

Ideally, abbreviations in names should be avoided and acronyms resolved. Names for RUs should be explicit, e.g. "number_of_residues" should be used instead of a totally unintuitive "n_res". Acronyms should be included in the synonyms list and resolved if used as preferred class name. When an acronym, however, is commonly used with very high frequency in everyday language in place of its full name (then called an anacronym), for example “laser”, it should be used as class name, while its resolved name listed in the synonym list. Domain-specific acronyms should be resolved. Only the main focus Acronyms that are found frequently in the ontology can stay as they are. Resolving e.g. “NMR” as “nuclear_magnetic_resonance_spectroscopy” in each RU within an NMR ontology makes too many terms unnecessary long and hard to read.

Top level classes should never have abbreviations or acronyms in their names, however, there are bottom level classes in which an acronym or abbreviation could be used. In these cases of compound terms on the bottom level the acronym should be unambiguous and be resolved at least in one of the synonyms. Do not allow abbreviations which employ expressions with other meanings ('chronic olfactory lung disorder' should never be abbreviated: cold). If they can’t be avoided capitalize Acronyms. There is no clear policy on when to spell out abbreviations, so use your common sense.

5 Registered Product- and Company-names

Proprietary names should be captured as they are, as long as this is not prohibited by the allowed character rule for the element used to represent it. In our case we are not restricted here, but should discuss, whether we allow spaces, or substitute them with the underscore.

[parsers, add and refine]

6 Lexical Properties of class names

1 Capitalisation

Names should be lower case letters throughout except for acronyms which are capitalised (if their use in class names can't be avoided) and proprietary names, which are written as such. Proper names / brand names can break the conventions rules unless rdf-field restrictions prevent these. E.g. there can be a "CBS_station" (starting with a capital letter) and there can be a CamelCase brand name. This is the recommendation of the OBO-Consortium. The other KR-domains (semantic web / OWL, Protégé-group), use capitals for beginning class names, while properties start with lower case letters.

Internal capitalization is however enforced by some computer systems, and mandated by the coding standards of many programming languages, i.e. Java coding style dictates that UpperCamelCase be used for classes, and lowerCamelCase be used for instances and members. So unless you plan to use auto generated java classes or any MDA approaches to convert the ontology into software code avoid CamelCase.

2 Character set

Terms designating RUs should consist mainly of alphabetic characters, numerals and underscores. Whether you will be allowed to use the space as word delimiter depends on the way the implementation handles the strings for the representational unit in question. Avoid special characters where possible. Avoid character-combinations that may have a special meaning in regular expressions or programming languages and XML. This recommendations are largely dependant on what the parsers for the implementation format for the specific RU can handle.

OWL identifiers (values of the rdfID / :NAME property) must begin with a letter or underscore and contain only letters, numerals, and the underscore character (‘_’). Spaces are not allowed here. For the full less restrictive specification see :

NCNameStartChar::=Letter | '_'

NCNameChar::=NameChar - ':'

NameChar::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender

If you keep the class name in another element, e.g. the rdf:label, you are in principle not restricted in your character usage.

3 Formattings

No accents, subscripts or superscripts are allowed (e.g. cm3 replaces cm3 and CO2 replaces CO2). The Names of chemical elements from the periodic table should be written in full length and should not be abbreviated with their symbols. (use hydrogen, copper and zinc rather than H, Cu and Zn). Greek symbols should be spelled out e.g. "alpha" instead of (. Temperature designations like 37° C. can be represented as 37C or better be represented formally through a proper units ontology.

4 Punctuation

Various kinds of punctuation connect name parts, including separators such as spaces, hyphens, and grouping symbols such as parentheses. These may have:

a) No semantic meaning. A naming rule may state that separators will consist of one blank space or exactly one special character (for example the underscore) regardless of semantic relationships of parts. Such a rule simplifies name formation.

b) Semantic meaning. Separators can convey semantic meaning by, for example, assigning a different separator between words in the qualifier term from the separator that separates words in the other part terms. In this way, the separator identifies the qualifier term clearly as different from the rest of the name. For example, in the data element name “Cost_Budget-Period_Total_Amount” the separator between words in the qualifier term is a hyphen; other name parts are separated by underscores.

Other languages , e.g. asian languages, form words using two characters which, separately, have different meanings, but when joined together have a third meaning unrelated to its parts. This may pose a problem in the interpretation of a name because ambiguity may be created by the juxtaposition of characters. A possible solution is to use one separator to distinguish when two characters form a single word, and another when they are individual words.

1 Word separators

Class name terms should be delimited by the "_" (underscore) separator. The underscore substitutes the space character. Whether you will be allowed to use the space as word delimiter depends on the way the implementation handles the strings for the representational unit in question. Under the OBO umbrella one can find: "MyClass" "My Class", "My-Class", "My_Class", “My_class" and "my class" conventions, sometimes even within one ontology. One convention is not necessarily better or worse than the other as long it is used consistently within the ontology. Java programmers, for example, use the "MyClass" (CamelCase-) convention, because that is the standard for naming Java classes, whereas text miners use "My class" convention, because it is easier to tokenize by natural language processing tools. The CamelCase convention has problems to capture class names like “Sample_pH” which would then read “SamplePH”. XML based languages don't like the space as a separator, so check how your parser copes with it in the (meta-) RU which captures the name for the RU. The current Jena XML parser does not cope with spaces in class-names when using the Protégé-2000 OWL-plugin. When the class name is captured within the rdf:ID element, it becomes also part of a Namespace and URI in OWL and these as explained above should not contain spaces or special characters. This is not an issue when using the rdf:label element to capture the class name. The easiest thing is however to avoid the space at all.

2 Hyphens, dash and slash

The hyphen should be avoided as word-separator; it should be used as in normal written English language as long as the representation formalism allows it. Java will interpret the Hyphen as a minus. Using the hyphen as separator would also cause ambiguity when using hyphens when required by English, e.g. “copper-based_compound” and when used to restrict or refine the meaning of a name, e.g. "bow-boat_part" and "bow-the_weapon" as is still done in some ontologies. The Hyphen has many meanings which we take for granted, but which have to be assigned more explicitly to be processed by computers. When using the hyphen one should be aware that its meanings can conflict: It can generally mark an undefined "somehow-related-to" relationship, it can mark a closer semantic binding as in “copper-based_compound” and can encode substantiation like in "abdomen-sonography", but it can also mark a divergence in meaning between the two words, as in "black-white". In “bio- and genetechnology” it encodes an ellipse, standing for the morpheme “technology”. Sometimes the hyphen encodes different logical connectors like "and" or "or" and it can be used to separate syllable when breaking a work in two at the end of a line. In sentences it can of course also encode separation marks for additional thoughts squeezed into a sentence as in “Enzymes – except Prions – are useful Proteins” The hyphen also demarks numerical, spatial or temporal lengths as in “1–4 telephone calls”, “Bremen–Hamburg” and “25.09.–28.12”, or is used as a minus or to indicate an omission as in “the PC is worth 300,–“. Last, but not least it can be confused with a minus.

We need to differentiate between the hyphen and a dash. There are two kinds of dashes: the n-dash and the m-dash. The n-dash is called that because it is the same width as the letter "n". The m-dash is longer, he width of the letter "m". We use the n-dash for numerical ranges, as in "6-10 years." When we need a dash as a form of parenthetical punctuation in a sentence use the m-dash.

The slash "/" means OR or AND in most cases and should be avoided in class names as should logical connectives in general.

5 Wordform and tense

Names for RUs should be in the singular form throughout. This prevents redundancy and misclassifications, for example creating a class "experiments" (plural) and then "experiment" as its subclass deeper in the hierarchy (true only if the .NAME field is used, which is checked to keep a unique string). If you want to import legacy XML or generate XML feeds from the ontology you have to use the singular form anyway, since this is the expected convention for XML tags.

Class names are always nouns, so use "randomisation" instead of "randomise". Nouns are the most concrete part of speech. Verbs can be converted to nouns (cleans to cleaning). Adjectives and adverbs, however, seldom convey meanings captured per classes. They correspond more to properties.

Class and property names should be uniformly captured in present tense. Sometimes a time perspective is indicated within class or property names, i.e. ”to_be_measured”, “measuring”, “measurement_taken”. Class names should be normalized consistently into the present tense form, e.g. “measurement”.

1 Plurals and sets

If you have to capture plurals you have three possibilities e.g. “protocols” “set_of_protocols” or “protocol_set”. The last form is recommended, because it is easier to spot (also for textmining). It is preferred over “set_of_x” because it is placed alphabetically directly beneath its singular form within the hierarchy. Use plurals sparsely. Creating for each singular x a plural-container of the form “x_set” creates a lot of classes, which we might not use at all. An instance of 'protocol' is a protocol and an instance of 'protocol_set' is a set of protocols. Be aware of the difference: Each class 'A' in an ontology has the implicit meaning 'the class A'.

[Refine, (Chebi comment)]

Discriminate carefully between Class and Set: Both classes and sets are marked by granularity, but sets are timeless. A class endures through time and survives the turnover in its instances. A class is not determined by its instances (as a state is not determined by its citizens and as an organism is not determined by its molecules). A set is determined by its members. It is an abstract structure, existing outside time and space. The set of human beings existing at t is (timelessly) a different entity from the set of human beings existing at t' because of births and deaths.

6 Word length and word compositions

Names for RUs that are used to show up in the hierarchy (i.e. the browser or display key) and should be read in a fast manner for orientation purposes, should be at least four characters long and as short as possible to be easy readable and understandable. It should be avoided to create human readable or preferred names that look like full sentences. Ideally, short and maximally intuitive names are to be preferred. Names are useful only if they are in fact used

[see JacobKoehler paper."intelligibility of GO terms" + DILS paper].

Word compositions longer than five words / morphemes should be avoided. When class names are made out of more words, try to use words that are already defined in higher hierarchy levels of the ontology. ‘Recycle’ words whenever possible. Build compound names out of simpler ones from the ontology in a consistent LEGO-like approach. Consistent means that the binding operators (words used to connect the other parts of the class name) are used in the same sound manner throughout the ontology.

A formal class name can be given to a class, i.e. a name for the class that is formaly controlled through linguistical rules and axioms. E.G. OBOL normalized ones, that adhere to defined principles of word/morpheme/affix order and form. ???

1 Compound vs. atomic names for representational units

Sometimes one encounters rather long names for RUs, which encode a lot of semantics within the name. These complex names are compositions of many words and therefore are called compound terms. They often consist of a noun phrase, like "sample_temperature_in_autosampler" embedding a prepositional term (localizational property like "in_autosampler").

[Compositionality – see Chris Mungall's OBOL , see Okren]

When the representational formalism allows to formalize properties and the atomic compounds are already present, these classes can be refactored / dissected / decomposed into more primitive existing classes (atoms) and attributes or relations between them. I.E. this is encouraged for OWL ontologies. When only an is_a hierarchy (without properties) is provided, compound names should be kept in the long form to capture what the user really wants to express and one has to keep the semantics within the class. As long as working with CVs one should aim to be reasonably descriptive, even at the risk of some verbal redundancy or longer names. That is why one often finds rather long class names in taxonomic CVs.

When word combinations with genitive, dative or accusative case occur, variants are possible, e.g. Combination into one single word, e.g. Breaking_off_the_experiment ( experiment_breakoff or connection with hyphen, e.g. NMR_of_Hydrogen ( Hydrogen-NMR.

According to DIN 12/1993, when new terms are created out of existing already defined class names the following types of multi-word terms can be distinguished (B. Schaeder, Fachlexicographie: Fachwissen und seine Repraesentation in Woerterbuechern, 1994, Tübingen):

Determinative term (Concept) linkage:

A second term occurs additionally, as a feature in the content of the original term, whereby the latter is restricted. The resulting multi-word term is a subterm. E.g. randomised_study.

Disjunctive term linkage:

The new multi-word term encompasses the scope of both constituent terms. E.g. consensus_study.

Integrating term linkage:

Objects associated to terms are combined into the next higher whole. E.g. sponsor-investigator.

Conjunctive term integration:

The new term merges the contents of both constituent terms, and is their next common subterm. E.g. investigator_study.

[To be evaluated…]

2 Splitting and merging classes

Simple (sometimes hyphen separated) and bimorphemic compound terms like "histology-result" should only be atomised into histology and result when the occurring morphemes represent single important classes themselves which are of use in other multi-word creations. E.g. for a clinical trail the atomic morphemes "ethics" and "commission" are not important, so a multi-word term like "ethics_commission" can stay like this and needs only be defined once as is.

The standard procedure for refactoring / splitting a class is to obsolete the original class and add a suitable comment directing annotators to the new classes (see Metadata Annotation document on ). Classes are merged in cases where two classes have exactly the same meaning in all contexts (i.e. are synonymous). Usually this situation arises when one class exists, and another wording of the same concept is added as a new class instead of as a synonym, either because a curator didn't find the old class or didn't know it meant the same thing.

For owl: When two classes are merged, e.g. class A and class B are merged into class A, the class name and the ID of class B is made a synonym of class A.

For obo: When two classes are merged, e.g. class A and class B are merged into class A, the ID of class B is made a secondary ID, and the class name is made a synonym. Usually, the ID that has existed longer is used as the primary ID, but exceptions can be made; e.g. the name of the class with the newer ID may be more correct or the definition may be better. Secondary IDs are stored in the OBO flat file with the 'alt_id' tag.

7 Affixes (prefix, suffix, infix and circumfix)

The word-stem should be used and affixes to names should be avoided where possible or at least be used consistently. Since each class 'A' implicitly means 'the class A', either prefixes or affixes involving “_class” must be avoided. The same applies to suffixes like "_entity" and "_type". When an ontology has many terms starting with the same prefix, for example “sample_number”, “sample_origin”, … , it suggests the need for transforming the postfixes into properties of a [prefix]-class when building the ontology. If subclasses are named using the class-name and a further descriptive morpheme, this should be done in a consistent way throughout the subclasses. For example, a class "receptor" can have two subclasses named either “katecholamine_receptor” and “peptide_receptor” (naming them just “katecholamine” and “peptide” would be a bad practice since ellipses have to be avoided and “peptide” designates a complete different class anyway). So there should not be the names “katecholamine_receptor” and “peptide”. If one prefixes a "receptor"-subclass name in the form xy_receptor, e.g. "adrenaline_receptor" (having the ligand as xy (prefix), one can't integrate receptors that are named according to their succeeding signalling transduction module, e.g. "G-proteine_coupled_receptor" (and not the ligand) in a consistent way. Infixes, circumfixes, articles, conjunctions and possessive forms of words should be used consistently, but be avoided when possible.

8 Logical connectives

Logical connectives such as "and", "or" and "not" should not be used within names for RUs, because they will be formalised as constraints and axioms later (and hence will allow for reasoning). 'rabbit or whale' does not designate a special universal of mammal.

9 "Taboo" words and Characters

Where possible, words from the metalevel (the representation formalism / KR language) should not be used within names for RUs. The use of database or ontology language keywords, for example "Model", "Class", "KIF", "Clips" and "OWL" and xml style tags or characters designating tags or regular expressions should be avoided when possible, because you never know whether all parsers you might need to use will handle these. Also when translations into other formats have to be made you can be sure not to run into parser problems in these other formats.

Other words and morphemes to be avoided are highly ambiguous ones, e.g. the affixes “set” and “setting” belong to the most ambiguous words in English. "Set" alone has over 20 different meanings (set refers to the process of setting parameters or to a plural of parameters.

10 Specific language requirements

Consistency is required if encountering this special case.

Where there are differences in the accepted spelling between English and US usage, use the US form, e.g. polymerizing, signalling rather than polymerising, signalling.

A common source of misspelled tags is the translation from other alphabets or characters. For example, the Umlaut, commonly used in German, is usually represented by the Latin-1 character set. Since this character set is often unavailable, Germans frequently represent an Umlaut character by means of a longhand encoding, such as "ue" for "ü". Consistency is required in these special cases to avoid mixture of "ü"s and "ue"s.

Depicting representational units within text

Be consistent in your notation. We use bold type to depict relations involving particulars; italics for universals and for relations between universals and Roman for particulars.

[to be added: Formatting convention when using ontological repr units in literature – see OBO RO]

Italics

Bold

“ “

‘ ‘

UPPERCASE throughout

lowercase throughout

underlined

One Recommendation: If you use boldface to emphasize that you speak of the term and not of its denotation, then do not use boldface for other purposes. Use single quotes to explicitly refer to the term 'class'. Since classes are not terms, but rather have terms as names one should say: "the class called 'human'" (where 'human' is the term used to name the class in question), or "the class human" (where italics are used to emphasize that human represents a class). One might though want to reserve italics for universals, eg., "the class representing (the universal) human", and then one should say "the class human", or "the class 'human'" (the last is a shortcut, and this kind of shortcut should be introduced explicitly).

Class definitions

Class definitions should provide the context and meaning of the class in a way to ease its interpretation. The definition should contain important keywords that describe the classes inherent attributes and relations to other classes in natural language. However in reality proper definitions can not be created for all universals, especially at the root level of the ontology (e.g. it is hard to define “thing”). A class should be given a humanly intelligible definition only when the necessary and sufficient conditions for being an instance of the corresponding universal are really understood. Before that, do not make up pseudo-definitions (e.g. circular definitions), but provisionally collect the necessary conditions in the comment field. Proofread your definitions carefully to eliminate typos and double spaces. As with class names, avoid using abbreviations that may be ambiguous. Definitions should be as brief as possible, but as complex as necessary. They should begin with an upper-case letter, can consist of more than one sentence if necessary and end always with a period (full stop). Definitions should start in the following way: “A [class described] is a [superclass], which/that [most relevant intrinsic properties (attributes and relations to other classes)]. It…. [Enter]”. When using the word “it” make sure you always refer to the described class only.

In practice one would first capture non-formal definitions as they come from the domain experts, glossaries or gathered by a google:define search. These are captured with their provenance (meta-) data, after a “tempdef” marker. Then one creates a second definition which is more formal and standardized according to the defined principles mentioned below (put after the def marker, see metadata section). Currently all definitions are captured together with metadata in the rdfs:comment field, which is not the cleanest solution, since the comment field can mean anything from editorial notes, scope notes, provenance notes and definitions. The xml:lang attributes do not have to be set, because they can be set once for all classes in the metadata ontology description tab and these lang-attributes - at least for the rdf:label field - tend to cause problems when importing these ontologies.

1 General rules for creating sound normalized definitions

1. Each definition refers to only one class.

2. Definitions should be as clear and concise as possible in order to convey the essence, "Das Wesen" (Silesius) of the universal to the user of the ontology.

3. The definition should be written at the same level of specificity as the class itself.

4. Definitions should define classes and their referred universals and not the words used to refer to classes (class names), so in definitions avoid terms like ‘class’, 'descriptor', 'name', etc. that refer to RUs and not to the universals in reality. E.g. the definition of 'eye' is 'organ of sight', not 'is name of organ of sight', nor ‘class or concept describing an organ of sight’. Avoid using acronyms within definitions.

5. The definitions should explain what are characteristics (or properties) that distinguish members of this class from the others (the upper class and siblings).

6. Definitions should use simple, easy to understand words that are meaningful to most of the users. In the best case all terms in the definition can be find as classes in higher levels of the ontology and are thus defined.

7. It should be positive and not negative. Definitions like ‘all animals that are not a mammal’ or ‘ all non-membrane proteins’, which do not designate natural kinds are not helpful, since complements of universals are not necessarily themselves universals.

8. The formal rules for definitions laid down by Aristotle should be applied. When A is_a B, the definition of ‘A’ takes the form: An A is a B which C... e.g: “A human being is a mammal which is rational”. Essence = Genus + Differentiae. If a class has more parents, I.e. multiple parenthood can not be avoided, mention all parent classes in the definition.

9. The definition should be free from words sharing the same root as the thing being defined (to be represented) and should not contain the class name itself. Avoid circularity in definitions like these:

An A is an A which is B (person = person with identity documents)

An A is the B of an A (heptolysis = the causes of heptolysis)

10. Each definition should reflect the position in the hierarchy to which a defined RU belongs. The position of a RU within the hierarchy enriches its own definition by incorporating automatically the definitions of all RUs above it. The entire information content of the hierarchy can then be translated cleanly into a computer representation.

11. The definition must be correct in all possible contexts the class is used, so that the class and all its synonyms are intersubstitutable with its definition in such a way, that the result is both grammatically correct and truth preserving.

12. Include some examples of well known prototypical instances or subclass of the class.

Additionally have a look at the following paper by Jacob Koehler:



[Do we need definitions for particulars that we currently represent as classes, e.g. do the brand names of nmr-instrument vendors need definitions???]

If you refer to other classes, use their real natural language names and avoid the artificial Underscore delimiter.

In the future definitions might be autogenerated through semantic conversions. Automated inference of class definitions is already available from the Obol page. Note that these are automated, highly experimental and subject to change: Obol []

2 Property definitions

Object properties (relations) should have a definition as follows:

"A relation that indicates a (class name from one relationship) is (nature of relation) for an (class name from other relationship).” For example, the definition for the property ‘storage (of material)’ might read:

“A relation that indicates a material is stored in a facility.”

[refine]

Notice that the formal definition is clear, concise, and unambiguous (i.e. you could look at something and say whether or not it belonged to the entity type). Definitions with words like 'and', 'or', or 'where' in them should be viewed with suspicion.

1 Implementation of definitions

OBO terms have optional human-readable text definitions; this is the def: tag and is represented as in OBO-XML definitions are currently mapped to

Note that this is not ideal - a comment is more general than a definition. This will be fixed in future to use an owl:AnnotationProperty

Unique identifiers

[refine]

Following the decentralized web paradigm, every single RU (class or relation) should be versioned independently rather versioning the ontology as a whole. Therefore it is necessary to consider conventions for unique identifiers for RUs. If one tries to edit a set of modular ontologies held together by just the string class names, every time somebody wants to change a name, fix a spelling error, etc. there is a global change that is intrinsically unreliable or, if the ontologies are distributed, requires a major organisational effort. When the identifiers are formal ID numbers and human readable class names are kept as labels you can change the label without disturbing the linkages. Hence versioning becomes easier when using unique formal Identifiers for RUs in representational artefacts. Some ontology editors, like Protégé-2000, construct identifiers out of the ontology name and numbers automatically.

A unique identifier MUST NOT be deleted once used. IDs should be conserved at all times so that, even if a term is defunct or has a new ID, someone searching using the old ID can find it.

OBO encourages numeric local IDs. Anything that is a valid XML ID can be used. As a rule of thumb while user friendly names for RUs should not cause problems for human processing, their IDs should not cause problems for machine processing. Always remind that an ID is associated with a definition and a universal rather than with the preferred class name. The numeric identifier resides in the rdf:ID field and the human readable name of the class is in the rdfs:label field. These correspond to the X and Y fields in the OBO-Format.

OBO IDs consist of a (all capitalised) prefix + underscore or ”:”(not in owl) + local ID. The prefix can be the more commonly used short form (e.g. ‘OBI’ or ‘msi-nmr’) or a long form (e.g. the full URI prefix). Only the long form + local ID is used in proper OWL files (although the short form can be used as a qname). Currently the long form is left implicit for most OBO ontologies; OBO will come up with a default mapping (which can be overridden by the ontology maintainer); e.g. ONTOLOGYSHORTNAME_21 ( urn:lsid:ontologyshortname.: ONTOLOGYSHORTNAME_21 and there will be widgets in Protégé for substituting the short with a long form throughout an ontology. OBO has to decide whether to go with URNs on more standard URIs as the default short->long mapping.

[The RECOMMENDED system of identifiers for the PSI CVs consists of two parts. Part one should be the official ‘namespace abbreviation’ PSI:XXX. The second part corresponds to a numeric accession numbers having the pattern “000000’. Therefore, the local identifier is XXX:000000 and the complete PSI CV unique identifier is of the format “PSI:XXX:000000”.]

Within OBO an "OBO_REL_"-prefix is used to name relations within the rdf:ID field, e.g. rdf:ID="OBO_REL_part_of". The OBO prefix / idspace equates to an XML/RDF namespace: A mapping between a "local" ID space and a "global" ID space. The value for this tag should be a local idspace, a space, a URI, optionally followed by a quote-enclosed description, like this: idspace: GO urn:lsid::GO: "gene ontology terms".

OWL is layered over RDF/RDFS. The OBO identifier model is consistent with the RDF/XML identifier model.

OBO IDs:

An OBO identifier consists of an idspace and a local ID (eg GO and 0008045). These are normally flattened using a colon separator, e.g. GO:0008045. An idspace can have both short and long forms. The short form would be unique within OBO, the long form would be a guaranteed globally unique URI prefix.

The current OWL mapping simply substitutes the : for a _ in the ID, and prepends a generic URI prefix.

In future this will be changed. The RDF/XML ID will be composed of the idspace URI (corresponding to XML namespace) and the local ID (e.g. , or whatever URI scheme we choose to use). The idspace short form (eg GO) will be used as the XML namespace qname.

1 Capturing the class name and ID using the autoID plugin in Protégé-owl

Within our current ontologies the unique class IDs goes in the rfd:id field. The value of the rdf:id field is restricted and can only contain special characters at special positions. The rdf:id field can contain the following characters:

at the beginning: £ $ _ and :

but not :@[{./=-+ ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download