SeMi (SEmantic Modeling machIne) is a tool to semi-automatically build large-scale Knowledge Graphs from structured sources such as CSV, JSON, and XML files. To achieve such a goal, SeMi builds the semantic models of the data sources, in terms of concepts and relations within a domain ontology. Most of the research contributions on automatic semantic modeling is focused on the detection of semantic types of source attributes. However, the inference of the correct semantic relations between these attributes is critical to reconstruct the precise meaning of the data. SeMi covers the entire process of semantic modeling:
- it provides a semi-automatic step to detect semantic types;
- it exploits a novel approach to inference semantic relations, based on a graph neural network trained on background linked data.
Semantic models can be formalized as graphs, where leaf nodes represent the attributes of the data source and the other nodes and relationships are defined by the ontology.
Considering the following JSON file in the public procurement domain:
{
"contract_id": "Z4ADEA9DE4",
"contract_object": "Excavations",
"proponent_struct": {
"business_id": "80004990927",
"business_name": "municipality01"
},
"participants":
[
{
"business_id": "08106710158",
"business_name": "company01"
}
]
}And consider the following domain ontology related to public procurement:
the resulting semantic model is:
Before installing SeMi, you need to check the following requirements.
To download SeMi, you can run the commands available here.
To install SeMi, you can use the following instructions.
Using the following scripts, you can generate a semantic model starting from an target source and a domain ontology.
Semantic types (or semantic labels) consist of a combination of an ontology class and an ontology data property. To perform the semantic types detection process you need to execute two different scripts. The first script is the following:
$ node run/semantic_label_indexer.js pc data/pc/input/pcis the Elasticsearch index name.data/pc/input/is the input folder containing files that have to be indexed.
This step is necessary to create the Elasticsearch index used as reference to detect the semantic types. The second script is the following:
$ node run/semantic_label.js pc data/pc/input/Z4ADEA9DE4.json data/pc/semantic_types/Z4ADEA9DE4_st_auto.jsonpcis the Elasticsearch index name.data/pc/input/Z4ADEA9DE4.jsonis the input file.data/pc/semantic_types/Z4ADEA9DE4_st_auto.jsonis the automatically-generated semantic type.
In SeMi, we consider the semantic types detection as a semi-automatic task.
For this reason, the manual-refined version of the semantic type is available in the file:
data/pc/semantic_types/Z4ADEA9DE4_st.json
Below an image that represents semantic types.
The Multi-edge and Weighted Graph (MEWG) includes all plausible semantic models of a data source based on a domain ontology. To create such graph, you can run the following commands:
$ node run/graph.js data/pc/semantic_types/Z4ADEA9DE4_st.json data/pc/ontology/ontology.ttl rdfs:domain rdfs:range owl:Class data/pc/semantic_models/Z4ADEA9DE4data/pc/semantic_types/Z4ADEA9DE4_st.jsonis the input semantic type file.data/pc/ontology/ontology.ttlis the domain ontology file.rdfs:domainis the domain property in the ontology.rdfs:rangeis the range property in the ontology.owl:Classis the property in the ontology to identify classes.data/pc/semantic_models/Z4ADEA9DE4is used as output path for the generation of the graph in different formats.
This script generates two types of graph:
data/pc/semantic_models/Z4ADEA9DE4.graphis the multi-edge and weighted graph.data/pc/semantic_models/Z4ADEA9DE4_graph.jsonis a beautified representation of the weighted graph.
Below an image that represents the MEWG:
To create the Steiner Tree on the MEWG: you can run the following command:
$ node run/steiner_tree.js data/pc/semantic_types/Z4ADEA9DE4_st.json data/pc/semantic_models/Z4ADEA9DE4_graph.json data/pc/semantic_models/Z4ADEA9DE4data/pc/semantic_types/Z4ADEA9DE4_st.jsonis the semantic type file.data/pc/semantic_models/Z4ADEA9DE4_graph.jsonis the beautified representation of the weighted graph.data/pc/semantic_models/Z4ADEA9DE4is used as output path for the generation of the steiner tree in different formats.
This script generates two types of steiner trees:
data/pc/semantic_models/Z4ADEA9DE4.steineris the steiner tree.data/pc/semantic_models/Z4ADEA9DE4_steiner.jsonis a beautified representation of the steiner tree.
Below an image that represents a steiner tree.
For the automatic generation of the semantic model, you can run the following command:
$ node run/jarql.js data/pc/semantic_types/Z4ADEA9DE4_st.json data/pc/semantic_models/Z4ADEA9DE4_steiner.json data/pc/ontology/classes.json data/pc/semantic_models/Z4ADEA9DE4data/pc/semantic_types/Z4ADEA9DE4_st.jsonis the semantic type file.data/pc/semantic_models/Z4ADEA9DE4_steiner.jsonis the beautified representation of the steiner tree.data/pc/ontology/classes.jsonis the list of all classes in the ontology.data/pc/semantic_models/Z4ADEA9DE4.queryis the output JARQL semantic model.
Below an example of the semantic model serialized using SPARQL and JARQL syntax:
CONSTRUCT {
?Contract0 dcterms:identifier ?cig.
?Contract0 rdf:type pc:Contract.
?Contract0 rdfs:description ?oggetto.
?Contract0 rdf:type pc:Contract.
?BusinessEntity0 dcterms:identifier ?strutturaProponente__codiceFiscaleProp.
?BusinessEntity0 rdf:type gr:BusinessEntity.
?BusinessEntity1 dcterms:identifier ?partecipanti__identificativo.
?BusinessEntity1 rdf:type gr:BusinessEntity.
?BusinessEntity1 rdfs:label ?partecipanti__ragioneSociale.
?BusinessEntity1 rdf:type gr:BusinessEntity.
?BusinessEntity1 dcterms:identifier ?aggiudicatari__identificativo.
?BusinessEntity1 rdf:type gr:BusinessEntity.
?BusinessEntity1 rdfs:label ?aggiudicatari__ragioneSociale.
?BusinessEntity1 rdf:type gr:BusinessEntity.
?Contract0 pc:contractingAuthority ?BusinessEntity0.
?Contract0 pc:contractingAuthority ?BusinessEntity1.
}
WHERE {
?root a jarql:Root.
OPTIONAL { ?root jarql:cig ?cig. }
OPTIONAL { ?root jarql:oggetto ?oggetto. }
OPTIONAL { ?root jarql:strutturaProponente ?strutturaProponente. }
OPTIONAL { ?strutturaProponente jarql:codiceFiscaleProp ?strutturaProponente__codiceFiscaleProp. }
OPTIONAL { ?root jarql:partecipanti ?partecipanti. }
OPTIONAL { ?partecipanti jarql:identificativo ?partecipanti__identificativo. }
OPTIONAL { ?root jarql:partecipanti ?partecipanti. }
OPTIONAL { ?partecipanti jarql:ragioneSociale ?partecipanti__ragioneSociale. }
OPTIONAL { ?root jarql:aggiudicatari ?aggiudicatari. }
OPTIONAL { ?aggiudicatari jarql:identificativo ?aggiudicatari__identificativo. }
OPTIONAL { ?root jarql:aggiudicatari ?aggiudicatari. }
OPTIONAL { ?aggiudicatari jarql:ragioneSociale ?aggiudicatari__ragioneSociale. }
BIND (URI(CONCAT('http://purl.org/procurement/public-contracts/contract/',?cig)) as ?Contract0)
BIND (URI(CONCAT('http://purl.org/goodrelations/v1/businessentity/',?strutturaProponente__codiceFiscaleProp)) as ?BusinessEntity0)
BIND (URI(CONCAT('http://purl.org/goodrelations/v1/businessentity/',?partecipanti__identificativo)) as ?BusinessEntity1)
}
In order to create the KG resulting from the initial semantic model, you have to run the JARQL tool with the following command:
$ ./jarql.sh data/pc/input/Z4ADEA9DE4.json data/pc/semantic_models/Z4ADEA9DE4.query > data/pc/output/Z4ADEA9DE4.ttldata/pc/input/Z4ADEA9DE4.jsonis the input file.data/pc/semantic_models/Z4ADEA9DE4.queryis the semantic model in the JARQL format.data/pc/output/Z4ADEA9DE4.ttlis the output RDF file serialized in turtle.
Below an example of the generated RDF file:
<http://purl.org/procurement/public-contracts/contract/Z4ADEA9DE4>
<http://purl.org/dc/terms/identifier>
"Z4ADEA9DE4"^^<http://www.w3.org/2001/XMLSchema#string> ;
<http://purl.org/procurement/public-contracts#contractingAuthority>
<http://purl.org/goodrelations/v1/businessentity/03382820920> , <http://purl.org/goodrelations/v1/businessentity/80004990927> ;
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://purl.org/procurement/public-contracts#Contract> ;
<http://www.w3.org/2000/01/rdf-schema#description>
"C.E. 23 Targa E9688 ( RIP.OFF.PRIVATE ) MANUTENZIONE ORDINARIA MEZZI DI TRASPORTO"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://purl.org/goodrelations/v1/businessentity/03382820920>
<http://purl.org/dc/terms/identifier>
"03382820920"^^<http://www.w3.org/2001/XMLSchema#string> ;
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://purl.org/goodrelations/v1#BusinessEntity> ;
<http://www.w3.org/2000/01/rdf-schema#label>
"CAR WASH CARALIS DI PUSCEDDU GRAZIANO C S N C"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://purl.org/goodrelations/v1/businessentity/80004990927>
<http://purl.org/dc/terms/identifier>
"80004990927"^^<http://www.w3.org/2001/XMLSchema#string> ;
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://purl.org/goodrelations/v1#BusinessEntity> .
The approach for generating the initial semantic model has a main limit: the steiner tree within the graph includes the shortest path to connect semantic type classes, however it does not necessarily express the correct semantic description of the target source. For this reason, a refinement process is required in order to identify a more accurate semantic model.
The semantic model refinement requires to prepare the training, the test, and the validation datasets as input of the deep learning model. Such model is a graph neural network and its main goal is to reconstruct the linked data edges using the latent representation of entities and properties. The architecture of the graph neural network is an auto-encoder composed of:
- An encoder called Relational Graph Convolutional Networks (R-GCNs).
- A decoder called DistMult.
The training, the test, and the validation datasets are built splitting a linked data repository (background knowledge) that is built through the semantic models defined by the domain experts on various sources, which are similar to the target source.
In our example, the input sources are available in the data/pc/input folder and the ground-truth semantic model is available in the semi/data/learning_datasets/pc.query file.
The background linked data is available in the data/pc/learning_datasets/complete.ttl file. This background knowledge is then splitted in the following datasets:
- the training dataset available in the
data/pc/learning_datasets/training.ttlfile; - the validation dataset available in the
data/pc/learning_datasets/valid.ttlfile; - the test dataset available in the
data/pc/learning_datasets/test.ttlfile.
For the graph neural network training, you can launch the following script:
python src/link_prediction/link_predict.py --directory data/pc/learning_datasets/ --train data/pc/learning_datasets/training.ttl --valid data/pc/learning_datasets/valid.ttl --test data/pc/learning_datasets/test.ttl --score pc --parser PC --gpu 0 --graph-batch-size 1000 --n-hidden 100 --graph-split-size 1--directory data/pc/learning_datasets/is the directory in which entity and property dictionaries are stored. In addition, this directory stores also the trained model with its related outputs..--train data/pc/learning_datasets/training.ttlis the file containing the training facts.--valid data/pc/learning_datasets/valid.ttlis the file containing the validation facts.--test data/pc/learning_datasets/test.ttlis the file containing the test facts.--score pcis the subdirectory in which the scores resulting from the training and the evaluation process will be stored.--parser PCis the parameter to drive the construction of the dictionaries of entities and relationships.--gpu 0is the parameter to establish how many GPUs (if available) can be used to train the model.--graph-batch-size 1000is a parameter to indicate the number of edges extracted at each step with the graph sampling process.--n-hidden 100is an hyperparameter of the model to define the number of neurons (and consequently the dimension of the embeddings) at each network layer.--graph-split-size 1is a parameter to establish the portion of edges used as positive examples.
The outputs of the training stage are the following:
entities.dict: dictionary that maps ids to entity URIs.relations.dict: dictionary that maps ids to property URIs.model_state.pth: python version of the trained model.train.npy: numpy representation of the training dataset.valid.npy: numpy representation of the validation dataset.test.npy: numpy representation of the test dataset.emb_nodes.json: JSON with entity embeddings.emb_rels.json: JSON with object property embeddings.score.json: fact scores obtained on the test data set.
The goal of this stage to refine the edge weights of the MEWG exploiting embedding obtained from the graph neural netwrk training. In this way, we incorporate the information from the background knowledge, in order to improve the accuracy of the semantic model.
The first step is to produce the JARQL representation of the MEWG:
$ node run/jarql.js data/pc/semantic_types/Z4ADEA9DE4_st.json data/pc/semantic_models/Z4ADEA9DE4_graph.json data/pc/ontology/classes.json data/pc/semantic_models/Z4ADEA9DE4_plausibledata/pc/semantic_types/Z4ADEA9DE4_st.jsonis the semantic type file.data/pc/semantic_models/Z4ADEA9DE4_graph.jsonis the beautified representation of the weighted graph.data/pc/ontology/classes.jsonis the list of all classes in the ontology.data/pc/semantic_models/Z4ADEA9DE4_plausible.queryis the output JARQL of plausible semantic models.
Then, you can proceed with the refinement process with the following command:
node run/refinement.js data/pc/semantic_types/Z4ADEA9DE4_st.json data/pc/model_datasets/scores/pc/6000/score.json data/pc/semantic_models/Z4ADEA9DE4_steiner.json data/pc/semantic_models/Z4ADEA9DE4_graph.json data/pc/semantic_models/Z4ADEA9DE4data/pc/semantic_types/Z4ADEA9DE4_st.jsonis the semantic type file.data/pc/model_datasets/scores/pc/6000/score.jsonis the score file generated during the training at the epoch 6000data/pc/semantic_models/Z4ADEA9DE4_steiner.jsonis the beautified version of the initial semantic model file generated through the steiner tree algorithm.data/pc/semantic_models/Z4ADEA9DE4_graph.jsonis the beautified version of weighted graph file including all plausible semantic models.
This script generates two different outputs:
data/pc/semantic_models/Z4ADEA9DE4_refined.graphis the refined semantic model file.data/pc/semantic_models/Z4ADEA9DE4_refined_graph.jsonis the beautified version of the refined semantic model file.
Below an image that represents the refined semantic model.
For the generation of the refined semantic model serialized in JARQL, you need to run the following command:
node run/jarql.js data/pc/semantic_types/Z4ADEA9DE4_st.json data/pc/semantic_models/Z4ADEA9DE4_refined_graph.json data/pc/ontology/classes.json data/pc/semantic_models/Z4ADEA9DE4_refineddata/pc/semantic_types/Z4ADEA9DE4_st.jsonis the semantic type file.data/pc/semantic_models/Z4ADEA9DE4_refined_graph.jsonis the beautified version of the refined semantic model file.data/pc/ontology/classes.jsonis the list of all classes in the ontology.
This script generates as output the following file:
data/pc/semantic_models/Z4ADEA9DE4_refined.queryis the JARQL serialization of the refined semantic model.
In order to create the KG resulting from the refined semantic model, you have to run the JARQL tool with the following command:
$ ./jarql.sh data/pc/input/Z4ADEA9DE4.json data/pc/semantic_models/Z4ADEA9DE4_refined.query > data/pc/output/Z4ADEA9DE4_refined.ttldata/pc/input/Z4ADEA9DE4.jsonis the input file.data/pc/semantic_models/Z4ADEA9DE4_refined.queryis the refined semantic model in the JARQL format.data/pc/output/Z4ADEA9DE4_refined.ttlis the output RDF file serialized in turtle.





