Dataset module¶
Introduction¶
The dataset class is used to create a valid object with the main idea to be used within the Experiment class. This approach allow us to create easily different models to be trained, leaving the complexity to the class itself.
You have different types of datasets. The main class of datasets is kgeserver.dataset.Dataset, but this will allow to create basic datasets from a csv or JSON file, without any restriction of triples or relations that may not be useful for the dataset and makes the binary file very big.
You have several Datasets that work with some of the most well known and free knowledge graphs on Internet. Those are:
- WikidataDataset:
This class can manage all queries from the Wikidata portal,
including wikidata id’s prepended with Q, like in
Q1492
. It is fully ready to perform any you need with really good results. - ESDBpediaDataset: This class is not as well ready as Wikidata, but is able to perform SPARQL Queries to get entities and relations on the Spanish DBpedia.
The most interesting feature that those Dataset class provides is to build a local dataset with making multiple parallel queries to the SPARQL endpoints to retrieve all information about a given topic. You can start by getting a seed_vector of the entities you want to focus on and then build a n levels graph by quering each entity to its relations with other new entities.
The seed vector can be obtained through the load_from_graph_pattern method. After that, you should save it on a variable and pass it as an argument to the load_dataset_recurrently method. This is the function that will make several queries to fill the dataset with the desried levels of depth.
To save the dataset into a binary format, you should use the save_to_binary method. This will allow to open the dataset later without executing any query.
Binary Dataset¶
The binary file of datasets are created using Pickle. It basicaly
stores all the entities, all the relations and all the triples. It also
stores some extra information to be able to rebuild the dataset later. The
binary file is stored like a python dictionary which contains the following
keys: __class__
, relations
, test_subs
, valid_subs
, train_subs
and entities
.
The relations
and entities
entries are lists, and it’s length indicates
us the number of relations or entities the dataset has. The __class__
entry
is for internal use of the class kgeserver.dataset. The triples are stored
in three different entries, called test_subs
, valid_subs
and
train_subs
. Those subsets are created to be used for the next module, the
algorithm module, wich will evaluate the dataset. This is a common practice
when machine learning algorithms are used. If you need all the triples, they can
be joined easily in python by adding the three lists between them:
triples = dataset["test_subs"] + dataset["valid_subs"] + dataset["train_subs"]
The split ratio commonly used is to use the 80% of the triples to train and the rest of triples are divided equally between test and valid triples. You can create a different split providing a value to dataset.train_split. It also exists an dataset.improved_split method wich takes a bit longer to create, but it is better to test the dataset.
Dataset Class¶
This class is used to create basic datasets. They can be filled with csv files, JSON files or even simple sparql queries.
Methods¶
Here is shown all the different methods to use with dataset class
-
class
kgeserver.dataset.
Dataset
(sparql_endpoint=None, thread_limiter=4)[source]¶ Class used to create, modify, export and import Datasets from Wikidata
-
__init__
(sparql_endpoint=None, thread_limiter=4)[source]¶ Creates the dataset class
The default endpoint is the original from wikidata.
Parameters: - sparql_endpoint (string) – The URI of the SPARQL endpoint
- thread_limiter (integer) – The number of concurrent HTTP queries
-
__weakref__
¶ list of weak references to the object (if defined)
-
_load_elements_into_dict
(el_dict, el_list)[source]¶ Insert elements from a list into dict
Parameters:
-
_process_entity
(entity, verbose=None)[source]¶ Add all relations and entities related with the entity to dataset
Additionally, this method should return a list of the entities it is connected to scan those entities on next level exploration.
This method is not implemented by parent class. MUST be implemented through a child object
Parameters: Returns: Entities to be scanned in next level
Return type: List
-
add_element
(element, complete_list, complete_list_dict)[source]¶ Add element to a list of the dataset. Avoids duplicate elements.
Parameters: Returns: The id on the list of the added element
Return type: integer
-
add_triple
(subject, obj, pred)[source]¶ Add the triple (subject, object, pred) to dataset
This method will add the three elements and append the tuple of the relation into the dataset
Parameters: Returns: If the operation was correct
Return type: boolean
-
build_levels
(n_levels)[source]¶ Generates a simple chain of triplets for the desired levels
Deprecated: Parameters: n_levels (integer) – Deep of the search on wikidata graph Returns: A list of chained triplets Return type: list
-
build_n_levels_query
(n_levels=3)[source]¶ Builds a CONSTRUCT SPARQL query of the desired deep
Deprecated: Parameters: n_levels (integer) – Deep of the search on wikidata graph Returns: The desired chained query Return type: string
-
check_entity
(entity)[source]¶ Check the entity given and return a valid representation
The parent class assumes all entities are valid
Parameters: entity (string) – The input entity representation Returns: A valid representation or None Return type: string
-
check_relation
(relation)[source]¶ Check the relation given and return a valid representation
The parent class assumes all relations are valid
Parameters: relation (string) – The input relation representation Returns: A valid representation or None Return type: string
-
control_thread
()[source]¶ Starts a loop waiting for user to request information about progress
This method should not be called from other method distinct than ‘load_dataset_recurrently’
TODO: Should end when parent thread ends...
-
execute_query
(query, headers={'Accept': 'application/json'})[source]¶ Executes a SPARQL query to the endpoint
Parameters: query (string) – The SPARQL query Returns: A tuple compound of (http_status, json_or_error)
-
exist_element
(element, complete_list_dict)[source]¶ Check if element exists on a given list
Parameters: Returns: Wether the item was found or no
Return type:
-
get_entity_id
(entity)[source]¶ Gets the id given an entity
Parameters: entity (string) – The entity string
-
get_relation
(id)[source]¶ Gets the relation given an id
Parameters: id (int) – The relation identifier to find
-
get_relation_id
(relation)[source]¶ Gets the id given an relation
Parameters: entity (string) – The relation string
-
get_status
()[source]¶ Returns a formated string with current progress
This is a helper method and should not be called from other method distinct than dataset.load_dataset_recurrently
Returns: Current download progress Return type: string
-
improved_split
(ratio=0.8)[source]¶ Split made with sklearn library, with different split for each label
This split function makes different splits for each label which is present on the dataset. This helps to distribute better all the splits.
Parameters: ratio (float) – The ratio of all triplets required for train_subs Returns: A dictionary with splited subs Return type: dict
-
load_dataset_from_csv
(file_readable, separator_char=', ')[source]¶ Given a csv file, loads into the dataset
This method will not open or close any file, and should be provided only a iterable object which each iteration should provide only one line. Also, if the csv does not use commas, you also should provide the separator char. The code will try to split the line with that separator char. It also will use only the first three columns, being the order: (object, predicate, subject)
Parameters: - file_readable (Iterable) – An iterator object
- separator_char (string) – the separator string used in each line
Returns: If the process ends correctly
Return type: boolean
-
load_dataset_from_json
(json)[source]¶ Loads the dataset object with a JSON
The JSON structure required is: {‘object’: {}, ‘subject’: {}, ‘predicate’: {}}
Parameters: json (list) – A list of dictionary parsed from JSON Returns: If operation was successful Return type: bool
-
load_dataset_from_nlevels
(nlevels, extra_params='')[source]¶ Builds a nlevels query, executes, and loads data on object
Deprecated: Parameters:
-
load_dataset_from_query
(query)[source]¶ Receives a Sparql query and fills dataset object with the response
The method will execute the query itself and will call to other method to fill in the dataset object
Parameters:
-
load_dataset_recurrently
(levels, seed_vector, verbose=1, limit_ent=None, ext_callback=<function Dataset.<lambda>>, **keyword_args)[source]¶ Loads to dataset all entities with BNE ID and their relations
Due to Wikidata endpoint cann’t execute queries that take long time to complete, it is necessary to consruct the dataset entity by entity, without using SPARQL CONSTRUCT. This method will start concurrently some threads to make several SPARQL SELECT queries.
Parameters: - seed_vector (list) – A vector of entities to start with
- levels (integer) – The depth to get triplets
- verbose (integer) – The level of verbosity. 0 is low, and 2 is high
Returns: True if operation was successful
Return type:
-
load_entire_dataset
(levels, where='', batch=100000, verbose=True)[source]¶ Loads the dataset by quering to Wikidata on the desired levels
Deprecated: Parameters: Returns: True if operation was successful
Return type:
-
load_from_binary
(filepath, **kwargs)[source]¶ Loads the dataset object from the disk
Loads this dataset object with the binary file
Parameters: filepath (string) – The path of the binary file Returns: True if operation was successful Return type: bool
-
load_from_graph_pattern
()[source]¶ Get the root entities where the graph build should start
This should return a list with elements to start seeking its childs and start building a dataset graph from this root elements. This method will return all the entities on the dataset.
MUST be implemented through a child object
Returns: An entities list Return type: list
-
process_entity
(entity, append_queue=<function Dataset.<lambda>>, max_tries=10, callback=<function Dataset.<lambda>>, verbose=0, _times=0, **kwargs)[source]¶ Wrapper for child method dataset._process_entity
Will call self method dataset._process_entity and examine the return value: should return a list of elements to be queried again or None.
This method will run in a single thread
Parameters: - element (string) – The URI of element to be scanned
- append_queue (function) – A function that receives the subject of a triplet as an argument
- verbose (integer) – The level of verbosity. 0 is low, and 2 is high
- callback (function) – The callback function. Default is return
- max_tries (int) – If an exception is raised, max number of attempts
- _times (int) – Reserved for recursive calls. Don’t use
Returns: If operation was successful
Return type: boolean
-
save_to_binary
(filepath, improved_split=False)[source]¶ Saves the dataset object on the disk
The dataset will be saved with the required format for reading from the original library, and is prepared to be trained.
Parameters: filepath (string) – The path of the file where should be saved Returns: True if operation was successful Return type: bool
-
show
(verbose=False)[source]¶ Show all elements of the dataset
By default prints only one line with the number of entities, relations and triplets. If verbose, prints every list. Use wisely
Parameters: verbose (bool) – If true prints every item of all lists
-
train_split
(ratio=0.8)[source]¶ Split subs into three lists: train, valid and test
The triplets should have a specific name and size to be compatible with the original library. Splits the original triplets (self.subs) in three different lists: train_subs, valid_subs and test_subs. The ‘ratio’ param will leave that quantity for train_subs, and the rest will be a half for valid and the other half for test
Parameters: ratio (float) – The ratio of all triplets required for train_subs Returns: A dictionary with splited subs Return type: dict
-
WikidataDataset¶
This class will enable you to generate a dataset from the information present in Wikidata Knowledge base. This class only needs to get a simple graph pattern to get started to build a dataset. An example of graph pattern that should be passed to WikidataDataset.load_from_graph_pattern method:
"{ ?subject wdt:P950 ?bne . ?subject ?predicate ?object }"
It is required to bind at least three variables, because they will be used in the next queries. Those variables should be called ”?subject”, ”?predicate” and ”?object”.
Methods¶
-
class
kgeserver.wikidata_dataset.
WikidataDataset
(sparql_endpoint=None, thread_limiter=4)[source]¶ -
__init__
(sparql_endpoint=None, thread_limiter=4)[source]¶ Creates WikidataDataset class
The default endpoint is the original from wikidata.
Parameters: - new_endpoint (string) – The URI of the SPARQL endpoint
- thread_limiter (integer) – The number of concurrent HTTP queries
-
_process_entity
(entity, verbose=0, graph_pattern='{0} ?predicate ?object . ?predicate a owl:ObjectProperty . FILTER NOT EXISTS {{ ?object a wikibase:BestRank }}')[source]¶ Take entity and explore all relations and entities related to it
This will execute the SPARQL query with the params passed to build a dataset with the object elements on triples retrieved from the server.
Returns: A list with new entities to be scanned
-
check_entity
(entity)[source]¶ Check the entity given and return a valid representation
Parameters: entity (string) – The input entity representation Returns: A valid representation or None Return type: string
-
check_relation
(relation)[source]¶ Check the relation given and return a valid representation
Parameters: relation (string) – The input relation representation Returns: A valid representation or None Return type: string
-
entity_labels
(entity, langs=['es', 'en'])[source]¶ Saves the label for a given entity
Makes a SPARQL query to retrieve the entity’s label(s) requested to use them on other services.
Some SPARQL endpoints may return more languages than requested. E.g: Wikidata will return ‘en-ca’, ‘en-gb’, ‘en-us’ and more if available when ‘en’ has been requested. Those languages will also be returned on this function.
Sample call: wd.entity_labels(“Q1”, langs=[‘en’, ‘es’]) Sample return value: {‘en-ca’: ‘universe’, ‘es’: ‘universo’,
‘en-gb’: ‘universe’, ‘en’: ‘universe’}Parameters: Returns: The label on each requested language
Return type: lang
-
extract_entity
(entity, filters={'literal': False, 'wdt-statement': False, 'wdt-entity': True, 'wdt-reference': False, 'wdt-prop': True, 'bnode': False})[source]¶ Given an entity, returns the valid representation, ready to be saved
The filter argument allows to avoid adding elements into lists that will not be used. It is a dictionary with the shape: {‘filter’: bool}. The valid filters (and default) are:
- wdt-entity - True
- wdt-reference - False
- wdt-statement - True
- wdt-prop - True
- literal - False
- bnode - False
Deprecated: -> Must be implemented in child class
Parameters: Returns: The entity itself or False
-
extract_from_statement
(entity, uri)[source]¶ Extract triplets from a statement
Should receive the entity which is the subject of the triple and the uri of the statement
Parameters: Returns: The entities statement is related
Return type:
-
get_entity_id
(entity)[source]¶ Gets the id given an entity
Parameters: entity (string) – The entity string
-
get_relation
(id)[source]¶ Gets the relation URI given an id
Parameters: id (int) – The relation identifier to find
-
get_relation_id
(relation)[source]¶ Gets the id given an relation
Parameters: entity (string) – The relation string
-
get_seed_vector
(verbose=0, where='?subject wdt:P950 ?bne .')[source]¶ Auxiliar method that outputs a list of seed elements
This seed vector will be the ‘root nodes’ of a tree with the desired depth on parent class (load_dataset_recurrently)
Parameters: - verbose – The desired level of verbosity
- where (string) – SPARQL where to construct query
Returns: A list of entities
Return type:
-
is_statement
(uri)[source]¶ Check if an URI is a wikidata statement
Parameters: uri (string) – The uri to test Returns: If it is an uri or not Return type: boolean
-
ESDBpediaDataset¶
In a similar way that it occurs with WikidataDataset, this class will allow to you to create datasets from the spanish DBpedia. The graph pattern you should pass to ESDBpediaDataset.load_from_graph_pattern method looks like this:
{ ?subject dcterms:subject <http://es.dbpedia.org/resource/Categoría:Trenes_de_alta_velocidad> . ?subject ?predicate ?object" }
As for WikidataDataset, you need to bind the same three variables: ”?subject”, ”?predicate” and ”?object”.
Methods¶
-
class
kgeserver.dbpedia_dataset.
ESDBpediaDataset
(thread_limiter=2)[source]¶ -
__init__
(thread_limiter=2)[source]¶ Creates WikidataDataset class
The default endpoint is the original from wikidata.
Parameters: - new_endpoint (string) – The URI of the SPARQL endpoint
- thread_limiter (integer) – The number of concurrent HTTP queries
-
_process_entity
(entity, verbose=0, graph_pattern='{0} ?predicate ?object . ')[source]¶ Take entity and explore all relations and entities related to it
This will execute the SPARQL query with the params passed to build a dataset with the object elements on triples retrieved from the server.
Returns: A list with new entities to be scanned
-