Process Functions

Get subclasses

Get a list of WikiData IDs of entities which are subclasses of the subject.

wikidatasets.processFunctions.get_subclasses(subject)[source]

Get a list of WikiData IDs of entities which are subclasses of the subject.

Parameters:subject (str) – String describing the subject (e.g. ‘Q5’ for human).
Returns:result – List of WikiData IDs of entities which are subclasses of the subject.
Return type:list

Query wikidata dump

Go through a Wikidata dump. It can either collect entities that are instances of test entities or collect the dictionary of labels. It can also do both.

wikidatasets.processFunctions.query_wikidata_dump(dump_path, path, n_lines, test_entities=None, collect_labels=False)[source]

This function goes through a Wikidata dump. It can either collect entities that are instances of test_entities or collect the dictionary of labels. It can also do both.

Parameters:
  • dump_path (str) – Path to the latest-all.json.bz2 file downloaded from https://dumps.wikimedia.org/wikidatawiki/entities/.
  • path (str) – Path to where pickle files will be written.
  • n_lines (int) – Number of lines of the dump. Fastest way I found was $ bzgrep -c “.*” latest-all.json.bz2. This can be an upper-bound as it is only used for displaying a progress bar.
  • test_entities (list) – List of entities to check if a line is instance of. For each line (entity), we check if it as a fact of the type (id, query_rel, test_entity).
  • collect_labels (bool) – Boolean indicating whether the labels dictionary should be collected.

Build dataset

Builds datasets from the pickle files produced by query_wikidata_dump.

wikidatasets.processFunctions.build_dataset(path, labels, return_=False, dump_date='23rd April 2019')[source]

Builds datasets from the pickle files produced by the query_wikidata_dump.

Parameters:
  • path (str) – Path to the directory where there should already be a pickles/ directory. In the latter directory, all the .pkl files will be concatenated into one dataset.
  • labels (dict) – Dictionary collected by the query_wikidata_dump function when collect_labels is set to True.
  • return (bool) – Boolean indicating if the built dataset should be returned on top of being written on disk.
  • dump_date (str) – String indicating the date of the Wikidata dump used. It is used in the readme of the dataset.
Returns:

  • edges (pandas.DataFrame) – DataFrame containing the edges between entities of the graph.
  • attributes (pandas.DataFrame) – DataFrame containing edges linking entities to their attributes.
  • entities (pandas.DataFrame) – DataFrame containing a list of all entities & attributes with their Wikidata IDs and labels.
  • relations (pandas.DataFrame) – DataFrame containing a list of all relations with their Wikidata IDs and labels.