Process Functions¶
Get subclasses¶
Get a list of WikiData IDs of entities which are subclasses of the subject.
-
wikidatasets.processFunctions.
get_subclasses
(subject)[source]¶ Get a list of WikiData IDs of entities which are subclasses of the subject.
Parameters: subject (str) – String describing the subject (e.g. ‘Q5’ for human). Returns: result – List of WikiData IDs of entities which are subclasses of the subject. Return type: list
Query wikidata dump¶
Go through a Wikidata dump. It can either collect entities that are instances of test entities or collect the dictionary of labels. It can also do both.
-
wikidatasets.processFunctions.
query_wikidata_dump
(dump_path, path, n_lines, test_entities=None, collect_labels=False)[source]¶ This function goes through a Wikidata dump. It can either collect entities that are instances of test_entities or collect the dictionary of labels. It can also do both.
Parameters: - dump_path (str) – Path to the latest-all.json.bz2 file downloaded from https://dumps.wikimedia.org/wikidatawiki/entities/.
- path (str) – Path to where pickle files will be written.
- n_lines (int) – Number of lines of the dump. Fastest way I found was $ bzgrep -c “.*” latest-all.json.bz2. This can be an upper-bound as it is only used for displaying a progress bar.
- test_entities (list) – List of entities to check if a line is instance of. For each line (entity), we check if it as a fact of the type (id, query_rel, test_entity).
- collect_labels (bool) – Boolean indicating whether the labels dictionary should be collected.
Build dataset¶
Builds datasets from the pickle files produced by query_wikidata_dump
.
-
wikidatasets.processFunctions.
build_dataset
(path, labels, return_=False, dump_date='23rd April 2019')[source]¶ Builds datasets from the pickle files produced by the query_wikidata_dump.
Parameters: - path (str) – Path to the directory where there should already be a pickles/ directory. In the latter directory, all the .pkl files will be concatenated into one dataset.
- labels (dict) – Dictionary collected by the query_wikidata_dump function when collect_labels is set to True.
- return (bool) – Boolean indicating if the built dataset should be returned on top of being written on disk.
- dump_date (str) – String indicating the date of the Wikidata dump used. It is used in the readme of the dataset.
Returns: - edges (pandas.DataFrame) – DataFrame containing the edges between entities of the graph.
- attributes (pandas.DataFrame) – DataFrame containing edges linking entities to their attributes.
- entities (pandas.DataFrame) – DataFrame containing a list of all entities & attributes with their Wikidata IDs and labels.
- relations (pandas.DataFrame) – DataFrame containing a list of all relations with their Wikidata IDs and labels.