Dataset module

Introduction

The dataset class is used to create a valid object with the main idea to be used within the Experiment class. This approach allow us to create easily different models to be trained, leaving the complexity to the class itself.

You have different types of datasets. The main class of datasets is kgeserver.dataset.Dataset, but this will allow to create basic datasets from a csv or JSON file, without any restriction of triples or relations that may not be useful for the dataset and makes the binary file very big.

You have several Datasets that work with some of the most well known and free knowledge graphs on Internet. Those are:

  • WikidataDataset: This class can manage all queries from the Wikidata portal, including wikidata id’s prepended with Q, like in Q1492. It is fully ready to perform any you need with really good results.
  • ESDBpediaDataset: This class is not as well ready as Wikidata, but is able to perform SPARQL Queries to get entities and relations on the Spanish DBpedia.

The most interesting feature that those Dataset class provides is to build a local dataset with making multiple parallel queries to the SPARQL endpoints to retrieve all information about a given topic. You can start by getting a seed_vector of the entities you want to focus on and then build a n levels graph by quering each entity to its relations with other new entities.

The seed vector can be obtained through the load_from_graph_pattern method. After that, you should save it on a variable and pass it as an argument to the load_dataset_recurrently method. This is the function that will make several queries to fill the dataset with the desried levels of depth.

To save the dataset into a binary format, you should use the save_to_binary method. This will allow to open the dataset later without executing any query.

Binary Dataset

The binary file of datasets are created using Pickle. It basicaly stores all the entities, all the relations and all the triples. It also stores some extra information to be able to rebuild the dataset later. The binary file is stored like a python dictionary which contains the following keys: __class__, relations, test_subs, valid_subs, train_subs and entities.

The relations and entities entries are lists, and it’s length indicates us the number of relations or entities the dataset has. The __class__ entry is for internal use of the class kgeserver.dataset. The triples are stored in three different entries, called test_subs, valid_subs and train_subs. Those subsets are created to be used for the next module, the algorithm module, wich will evaluate the dataset. This is a common practice when machine learning algorithms are used. If you need all the triples, they can be joined easily in python by adding the three lists between them:

triples = dataset["test_subs"] + dataset["valid_subs"] + dataset["train_subs"]

The split ratio commonly used is to use the 80% of the triples to train and the rest of triples are divided equally between test and valid triples. You can create a different split providing a value to dataset.train_split. It also exists an dataset.improved_split method wich takes a bit longer to create, but it is better to test the dataset.

Dataset Class

This class is used to create basic datasets. They can be filled with csv files, JSON files or even simple sparql queries.

Methods

Here is shown all the different methods to use with dataset class

class kgeserver.dataset.Dataset(sparql_endpoint=None, thread_limiter=4)[source]

Class used to create, modify, export and import Datasets from Wikidata

__init__(sparql_endpoint=None, thread_limiter=4)[source]

Creates the dataset class

The default endpoint is the original from wikidata.

Parameters:
  • sparql_endpoint (string) – The URI of the SPARQL endpoint
  • thread_limiter (integer) – The number of concurrent HTTP queries
__weakref__

list of weak references to the object (if defined)

_load_elements_into_dict(el_dict, el_list)[source]

Insert elements from a list into dict

Parameters:
  • el_dict (dict) – A dict to be inserted elements
  • el_list (list) – A list where elements are
_process_entity(entity, verbose=None)[source]

Add all relations and entities related with the entity to dataset

Additionally, this method should return a list of the entities it is connected to scan those entities on next level exploration.

This method is not implemented by parent class. MUST be implemented through a child object

Parameters:
  • method (string) – The URI of the element to be processed
  • verbose (int) – The level of verbosity. 0 is low, and 2 is high
Returns:

Entities to be scanned in next level

Return type:

List

add_element(element, complete_list, complete_list_dict)[source]

Add element to a list of the dataset. Avoids duplicate elements.

Parameters:
  • element (string) – The element that will be added to list
  • complete_list (list) – The list in which will be added
  • complete_list_dict (dict) – The dict which represents the list.
  • only_uri (bool) – Allow load objects distincts than URI’s
Returns:

The id on the list of the added element

Return type:

integer

add_triple(subject, obj, pred)[source]

Add the triple (subject, object, pred) to dataset

This method will add the three elements and append the tuple of the relation into the dataset

Parameters:
  • subject (string) – Subject of the triple
  • obj (string) – Object of the triple
  • pred (string) – Predicate of the triple
Returns:

If the operation was correct

Return type:

boolean

build_levels(n_levels)[source]

Generates a simple chain of triplets for the desired levels

Deprecated:
Parameters:n_levels (integer) – Deep of the search on wikidata graph
Returns:A list of chained triplets
Return type:list
build_n_levels_query(n_levels=3)[source]

Builds a CONSTRUCT SPARQL query of the desired deep

Deprecated:
Parameters:n_levels (integer) – Deep of the search on wikidata graph
Returns:The desired chained query
Return type:string
check_entity(entity)[source]

Check the entity given and return a valid representation

The parent class assumes all entities are valid

Parameters:entity (string) – The input entity representation
Returns:A valid representation or None
Return type:string
check_relation(relation)[source]

Check the relation given and return a valid representation

The parent class assumes all relations are valid

Parameters:relation (string) – The input relation representation
Returns:A valid representation or None
Return type:string
control_thread()[source]

Starts a loop waiting for user to request information about progress

This method should not be called from other method distinct than ‘load_dataset_recurrently’

TODO: Should end when parent thread ends...

execute_query(query, headers={'Accept': 'application/json'})[source]

Executes a SPARQL query to the endpoint

Parameters:query (string) – The SPARQL query
Returns:A tuple compound of (http_status, json_or_error)
exist_element(element, complete_list_dict)[source]

Check if element exists on a given list

Parameters:
  • element (string) – The element itself
  • complete_list_dict (dict) – The dictionary to search in
Returns:

Wether the item was found or no

Return type:

bool

get_entity(id)[source]

Gets the entity given an id

Parameters:id (integer) – The id to find
get_entity_id(entity)[source]

Gets the id given an entity

Parameters:entity (string) – The entity string
get_relation(id)[source]

Gets the relation given an id

Parameters:id (int) – The relation identifier to find
get_relation_id(relation)[source]

Gets the id given an relation

Parameters:entity (string) – The relation string
get_status()[source]

Returns a formated string with current progress

This is a helper method and should not be called from other method distinct than dataset.load_dataset_recurrently

Returns:Current download progress
Return type:string
improved_split(ratio=0.8)[source]

Split made with sklearn library, with different split for each label

This split function makes different splits for each label which is present on the dataset. This helps to distribute better all the splits.

Parameters:ratio (float) – The ratio of all triplets required for train_subs
Returns:A dictionary with splited subs
Return type:dict
load_dataset_from_csv(file_readable, separator_char=', ')[source]

Given a csv file, loads into the dataset

This method will not open or close any file, and should be provided only a iterable object which each iteration should provide only one line. Also, if the csv does not use commas, you also should provide the separator char. The code will try to split the line with that separator char. It also will use only the first three columns, being the order: (object, predicate, subject)

Parameters:
  • file_readable (Iterable) – An iterator object
  • separator_char (string) – the separator string used in each line
Returns:

If the process ends correctly

Return type:

boolean

load_dataset_from_json(json)[source]

Loads the dataset object with a JSON

The JSON structure required is: {‘object’: {}, ‘subject’: {}, ‘predicate’: {}}

Parameters:json (list) – A list of dictionary parsed from JSON
Returns:If operation was successful
Return type:bool
load_dataset_from_nlevels(nlevels, extra_params='')[source]

Builds a nlevels query, executes, and loads data on object

Deprecated:
Parameters:
  • nlevels (integer) – Deep of the search on wikidata graph
  • extra_params (string) – Extra SPARQL instructions for the query
  • only_uri (bool) – Allow load objects distincts than URI’s
load_dataset_from_query(query)[source]

Receives a Sparql query and fills dataset object with the response

The method will execute the query itself and will call to other method to fill in the dataset object

Parameters:
  • query (string) – A valid SPARQL query
  • only_uri (bool) – Allow load objects distincts than URI’s
load_dataset_recurrently(levels, seed_vector, verbose=1, limit_ent=None, ext_callback=<function Dataset.<lambda>>, **keyword_args)[source]

Loads to dataset all entities with BNE ID and their relations

Due to Wikidata endpoint cann’t execute queries that take long time to complete, it is necessary to consruct the dataset entity by entity, without using SPARQL CONSTRUCT. This method will start concurrently some threads to make several SPARQL SELECT queries.

Parameters:
  • seed_vector (list) – A vector of entities to start with
  • levels (integer) – The depth to get triplets
  • verbose (integer) – The level of verbosity. 0 is low, and 2 is high
Returns:

True if operation was successful

Return type:

bool

load_entire_dataset(levels, where='', batch=100000, verbose=True)[source]

Loads the dataset by quering to Wikidata on the desired levels

Deprecated:
Parameters:
  • levels (integer) – Deep of the search
  • where (string) – Extra where statements for SPARQL query
  • batch (integer) – Number of elements returned each query
  • verbose (bool) – True for showing all steps the method do
Returns:

True if operation was successful

Return type:

bool

load_from_binary(filepath, **kwargs)[source]

Loads the dataset object from the disk

Loads this dataset object with the binary file

Parameters:filepath (string) – The path of the binary file
Returns:True if operation was successful
Return type:bool
load_from_graph_pattern()[source]

Get the root entities where the graph build should start

This should return a list with elements to start seeking its childs and start building a dataset graph from this root elements. This method will return all the entities on the dataset.

MUST be implemented through a child object

Returns:An entities list
Return type:list
process_entity(entity, append_queue=<function Dataset.<lambda>>, max_tries=10, callback=<function Dataset.<lambda>>, verbose=0, _times=0, **kwargs)[source]

Wrapper for child method dataset._process_entity

Will call self method dataset._process_entity and examine the return value: should return a list of elements to be queried again or None.

This method will run in a single thread

Parameters:
  • element (string) – The URI of element to be scanned
  • append_queue (function) – A function that receives the subject of a triplet as an argument
  • verbose (integer) – The level of verbosity. 0 is low, and 2 is high
  • callback (function) – The callback function. Default is return
  • max_tries (int) – If an exception is raised, max number of attempts
  • _times (int) – Reserved for recursive calls. Don’t use
Returns:

If operation was successful

Return type:

boolean

save_to_binary(filepath, improved_split=False)[source]

Saves the dataset object on the disk

The dataset will be saved with the required format for reading from the original library, and is prepared to be trained.

Parameters:filepath (string) – The path of the file where should be saved
Returns:True if operation was successful
Return type:bool
show(verbose=False)[source]

Show all elements of the dataset

By default prints only one line with the number of entities, relations and triplets. If verbose, prints every list. Use wisely

Parameters:verbose (bool) – If true prints every item of all lists
train_split(ratio=0.8)[source]

Split subs into three lists: train, valid and test

The triplets should have a specific name and size to be compatible with the original library. Splits the original triplets (self.subs) in three different lists: train_subs, valid_subs and test_subs. The ‘ratio’ param will leave that quantity for train_subs, and the rest will be a half for valid and the other half for test

Parameters:ratio (float) – The ratio of all triplets required for train_subs
Returns:A dictionary with splited subs
Return type:dict

WikidataDataset

This class will enable you to generate a dataset from the information present in Wikidata Knowledge base. This class only needs to get a simple graph pattern to get started to build a dataset. An example of graph pattern that should be passed to WikidataDataset.load_from_graph_pattern method:

"{ ?subject wdt:P950 ?bne . ?subject ?predicate ?object }"

It is required to bind at least three variables, because they will be used in the next queries. Those variables should be called ”?subject”, ”?predicate” and ”?object”.

Methods

class kgeserver.wikidata_dataset.WikidataDataset(sparql_endpoint=None, thread_limiter=4)[source]
__init__(sparql_endpoint=None, thread_limiter=4)[source]

Creates WikidataDataset class

The default endpoint is the original from wikidata.

Parameters:
  • new_endpoint (string) – The URI of the SPARQL endpoint
  • thread_limiter (integer) – The number of concurrent HTTP queries
_process_entity(entity, verbose=0, graph_pattern='{0} ?predicate ?object . ?predicate a owl:ObjectProperty . FILTER NOT EXISTS {{ ?object a wikibase:BestRank }}')[source]

Take entity and explore all relations and entities related to it

This will execute the SPARQL query with the params passed to build a dataset with the object elements on triples retrieved from the server.

Returns:A list with new entities to be scanned
check_entity(entity)[source]

Check the entity given and return a valid representation

Parameters:entity (string) – The input entity representation
Returns:A valid representation or None
Return type:string
check_relation(relation)[source]

Check the relation given and return a valid representation

Parameters:relation (string) – The input relation representation
Returns:A valid representation or None
Return type:string
entity_labels(entity, langs=['es', 'en'])[source]

Saves the label for a given entity

Makes a SPARQL query to retrieve the entity’s label(s) requested to use them on other services.

Some SPARQL endpoints may return more languages than requested. E.g: Wikidata will return ‘en-ca’, ‘en-gb’, ‘en-us’ and more if available when ‘en’ has been requested. Those languages will also be returned on this function.

Sample call: wd.entity_labels(“Q1”, langs=[‘en’, ‘es’]) Sample return value: {‘en-ca’: ‘universe’, ‘es’: ‘universo’,

‘en-gb’: ‘universe’, ‘en’: ‘universe’}
Parameters:
  • entity (string) – The entity to query for
  • langs (list) – The languages to be asked for
Returns:

The label on each requested language

Return type:

lang

extract_entity(entity, filters={'literal': False, 'wdt-statement': False, 'wdt-entity': True, 'wdt-reference': False, 'wdt-prop': True, 'bnode': False})[source]

Given an entity, returns the valid representation, ready to be saved

The filter argument allows to avoid adding elements into lists that will not be used. It is a dictionary with the shape: {‘filter’: bool}. The valid filters (and default) are:

  • wdt-entity - True
  • wdt-reference - False
  • wdt-statement - True
  • wdt-prop - True
  • literal - False
  • bnode - False
Deprecated:

-> Must be implemented in child class

Parameters:
  • entity (dict) – The entity to be analyzed
  • filters (dict) – A dictionary to allow filter entities
Returns:

The entity itself or False

extract_from_statement(entity, uri)[source]

Extract triplets from a statement

Should receive the entity which is the subject of the triple and the uri of the statement

Parameters:
  • entity (string) – The entity whic statement is related
  • uri (string) – The uri of the statement
Returns:

The entities statement is related

Return type:

list

get_entity(id)[source]

Gets the entity URI given an id

Parameters:id (integer) – The id to find
get_entity_id(entity)[source]

Gets the id given an entity

Parameters:entity (string) – The entity string
get_relation(id)[source]

Gets the relation URI given an id

Parameters:id (int) – The relation identifier to find
get_relation_id(relation)[source]

Gets the id given an relation

Parameters:entity (string) – The relation string
get_seed_vector(verbose=0, where='?subject wdt:P950 ?bne .')[source]

Auxiliar method that outputs a list of seed elements

This seed vector will be the ‘root nodes’ of a tree with the desired depth on parent class (load_dataset_recurrently)

Parameters:
  • verbose – The desired level of verbosity
  • where (string) – SPARQL where to construct query
Returns:

A list of entities

Return type:

list

is_statement(uri)[source]

Check if an URI is a wikidata statement

Parameters:uri (string) – The uri to test
Returns:If it is an uri or not
Return type:boolean
load_from_graph_pattern(verbose=0, where='', **kwargs)[source]

Auxiliar method that outputs a list of seed elements

This seed vector will be the ‘root nodes’ of a tree with the desired depth on parent class (load_dataset_recurrently)

Parameters:
  • verbose – The desired level of verbosity
  • where (string) – SPARQL where to construct query
  • batch_size (int) – The size of batches queried each time
Returns:

A list of entities

Return type:

list

ESDBpediaDataset

In a similar way that it occurs with WikidataDataset, this class will allow to you to create datasets from the spanish DBpedia. The graph pattern you should pass to ESDBpediaDataset.load_from_graph_pattern method looks like this:

{ ?subject dcterms:subject <http://es.dbpedia.org/resource/Categoría:Trenes_de_alta_velocidad> . ?subject ?predicate ?object" }

As for WikidataDataset, you need to bind the same three variables: ”?subject”, ”?predicate” and ”?object”.

Methods

class kgeserver.dbpedia_dataset.ESDBpediaDataset(thread_limiter=2)[source]
__init__(thread_limiter=2)[source]

Creates WikidataDataset class

The default endpoint is the original from wikidata.

Parameters:
  • new_endpoint (string) – The URI of the SPARQL endpoint
  • thread_limiter (integer) – The number of concurrent HTTP queries
_process_entity(entity, verbose=0, graph_pattern='{0} ?predicate ?object . ')[source]

Take entity and explore all relations and entities related to it

This will execute the SPARQL query with the params passed to build a dataset with the object elements on triples retrieved from the server.

Returns:A list with new entities to be scanned
load_from_graph_pattern(verbose=0, where='', **kwargs)[source]

Auxiliar method that outputs a list of seed elements

This seed vector will be the ‘root nodes’ of a tree with the desired depth on parent class (load_dataset_recurrently)

Parameters:
  • verbose – The desired level of verbosity
  • where (string) – SPARQL where to construct query
  • batch_size (int) – The size of batches queried each time
Returns:

A list of entities

Return type:

list