Business / Engineering Problem
In most industries practitioners use more than one terminology classification class or data model structure.
For instance, in the AEC industry, there are a few ontologies to describe the semantics of building models,
such as the Industry Foundation Classes (IFC), the CIMsteel Integration
Standards (CIS/2), and the OmniClass Construction Classification System. For various model rebuilding and data
exchange purposes, comparison and mapping between heterogeneous ontologies in the same industry are often
inevitable.
The
ontology comparison and mapping problem is commonly performed manually by domain experts, who are familiar
with one or more industry-specific taxonomies. It could be time-consuming, unscalable and inefficient especially
if they start from scratch. Automated comparison and mapping based on the ontology structures and the linguistic
similarity between concepts are therefore growing in popularity in recent years. This research tries to propose a
different approach to achieve ontology mapping by the use of document corpus as a medium for semantic similarity
comparison.
Test Case
In the AEC industry nowadays, the urge for
Building Information Model (BIM) leads to the establishment of various
description and classification standards to facilitate data exchange. OmniClass and IfcXML by far are two of the
most commonly used data models for buildings and constructions.
OmniClass, consisting of 15 tables, categorizes
elements and concepts in the AEC industry and provides a rich pool of vocabularies practitioners can use in legal
documents. It contains a set of object data elements that represent the parts of buildings or processes, and the
relevant information about those parts.
IfcXML, specialized in modeling CAD models and work process, is frequently
used by practitioners to build information-rich product and process models and to act as a data format for
interoperability among different software. It is a single XML schema file comprised of concept terms which are
highly hierarchically structured and cross-linked. This test case focuses on the mapping between OmniClass and IfcXML.
OmniClass |
IfcXML |
Main Idea and Implementation
With the intuition that related terms should appear in the same paragraphs or sections
, concept comparison and matching by co-occurrence is proposed to map different sets of terms in heterogeneous ontologies. The number of
co-occurrence of two concepts in the corpus reveals the closeness of the two topics and acts as a means to
evaluate the relatedness between them.
Preprocessing of the two ontologies, OmniClass and IfcXML, is necessary at the beginning stage. The entity concept terms
of both ontologies are extracted. Unique ID and suffix of the concepts are removed and duplicated concepts are
discarded. The entire preprocessed concept terms of OmniClass and IfcXML are latched to each section of
the
International Building Codes (IBC) XML files. The concept tags, <OMNICLASS> and <IFCXML>, are inserted into the
corresponding sections which match the concepts in the stemmed form.
In the example showed on the right, the concepts "Concrete" and "Steel Decking" from OmniClass and the concepts
"IfcSlab" and "steel" from IfcXML are all matched to the same section 2209.2 of IBC. It implies that they may be
potentially related in some aspects. Further confirmation of their relatedness can be deduced by considering their
co-occurrence in other sections.
Relatedness Analysis
The number of co-occurred sections of the two concepts and the number of times the two concepts are matched to each
of these sections reveal the semantic similarity between the two concepts. Three relatedness analysis measures
have been used for concept comparison between OmniClass and IfcXML. They are cosine similarity measure, Jaccard
similarity coefficient, and market basket model. Cosine similarity is a measure of similarity between two vectors
of n dimensions by finding the angle between them. Jaccard similarity coefficient is a statistical measure, using set
theory, of the extent of overlapping of two vectors in n dimensions compared to union. Market basket model is a
probabilistic data-mining technique to find item-item correlation.
Although in most circumstances related concepts can be captured by treating each section as an independent dimension in
concept co-occurrence comparison, some related concepts rarely co-occur in the same sections. Examples are Is-A-related
concepts (e.g. "concrete" and "building materials") and concepts that are within the same scope (e.g. "steel" and "concrete").
Corpus hierarchical structure is therefore considered in order to capture those related but not co-occurred concepts.
Results
Besides identical concepts such as "curtain walls" from OmniClass and "IfcCurtainWall" from IfcXML, related concepts that
cannot be matched by conventional term matching techniques, for instance, "roof decking" from OmniClass and "IfcSlab" from
IfcXML, are captured via this co-occurrence analysis.
In the test case, market basket model outperforms other two relatedness
analysis approaches in terms of root mean square error (RMSE) as well as F-measure, a combination of precision and recall rate.
In fact, the market basket model shows the highest recall rate and a moderately high precision.
Evaluation results of the three measures using RMSE |
Evaluation results of the three measures using F-Measure |
Contributions
This research proposes a new approach to compare and map heterogeneous ontologies, so as to achieve interoperability between
data models. It enables information exchange and sharing among project stakeholders. Once the mapping between ontologies is
completed, updating and consistency checking of data can also be allowed although the data sources are using different data
models.