Using the Python API

In this tutorial you will learn how to use fairly as a Python package to clone, create and upload datasets to research data repositories.

If you haven not done so, install the fairly package.

Cloning a dataset

The Python API provides the flexibility to explore the metadata of a remote dataset before downloading it. A remote dataset is any dataset which is not stored locally.

  1. In a python script, import the fairly package and open a remote dataset:

[1]:
import fairly

# Open a remote dataset
dataset = fairly.dataset("doi:10.4121/21588096.v1")

  1. You can now explore the metadata of the dataset as follows:

[2]:
dataset.id
[2]:
{'id': '21588096', 'version': '1'}
[3]:
dataset.url
[3]:
'https://data.4tu.nl/datasets/a37120e2-96db-48e4-bd65-a54b970bc4fe/1'
[5]:
print(dataset.size)

# number of files
len(dataset.files)
33339
[5]:
6
[6]:
# complete metadata
dataset.metadata
[6]:
Metadata({'authors': [Person({'fullname': 'Stefan Nielsen', 'orcid_id': '0000-0002-9214-2932', 'figshare_id': 12882551})], 'keywords': ['Earthquakes', 'artificial neural network', 'precursor'], 'description': '<p>These are the accuracy results for the whole dataset A and B together. This is a second batch (2/2) of cycles where network was trained, tested and verified 50 times with different combinations of test, train and verification groups. There is a first batch of 50 in a separate file</p>', 'license': {'id': 2, 'name': 'CC0', 'url': 'https://creativecommons.org/publicdomain/zero/1.0/'}, 'title': 'Earthquake Precursors detected by convolutional neural network', 'doi': '10.4121/21588096.v1', 'type': 'dataset', 'access_type': 'open', 'custom_fields': {'Time coverage': '2012-2022', 'Publisher': '4TU.ResearchData', 'Organizations': 'University of Durham, Department of Earth Sciences.', 'Geolocation Longitude': '138.204', 'Geolocation Latitude': '36.546', 'Geolocation': 'Japan and surrounding area', 'Format': '*.py, *.csv, *.txt'}, 'categories': [13555], 'online_date': '2022-11-24T07:50:39'})
  1. You can save the dataset’s metadata to a file to a local directory as follows. The directory will be created if it does not exist.

[7]:
# store dataset locally (i.e. clone dataset)
local_dataset = dataset.store("./cloned-dataset")

Creating a local dataset

A `local dataset`` is a dataset which is stored locally. When creating our own dataset, we used a local dataset.

  1. Initialize a new dataset:

[2]:
import fairly

# Initialize a local dataset
dataset = fairly.init_dataset("./local-dataset") # path is created if it does not exist
  1. Set the dataset’s metadata attributes by passing a list of attribute names and values to a local dataset:

[9]:
dataset.set_metadata(
    title="My first dataset",
    kewords=[ "fairly", "python", "api" ],
    authors=[ "0000-0002-0516-185X",
             { "name": "Jane", "surname": "Doe" }
             ],
)
[10]:
# Metadata attributes can be passed one by one as follows
dataset.metadata["license"] = "CC-BY-4.0"
  1. Add files and folders to the dataset:

[11]:
dataset.includes.extend([
    "README",
    "*.csv",
     "train/*.jpg"
])
  1. To save values to the dataset’s attributes to the manifest.yaml file, we must call the save() method:

[12]:
# Save changes and update manifest.yaml
dataset.save()

Uploading a dataset

To upload a dataset to a research data repository, we must first register an access token for an account in the data repository. Check the tutorial on the JupyterLab extension to learn how to register an access token.

Once you have registered an access token, you can upload a dataset with a single command:

[ ]:
# Upload dataset to data repository
remote_dataset = dataset.upload('zenodo')

Pushing changes to a data repository

After uploading a dataset to a data repository, you can use the push command to push changes to the dataset’s metadata and files and update the data repository. The push method automatically finds the remote version of a dataset from the information available in the manifest file. It also updates the remote metadata, if any metadata fields are modified locally.

To be able to push updates to an existing dataset in a repository, you need to have write access to the dataset. For most of the repositories this requires you to be the owner of the dataset. Most data repositories prevent updates if a dataset is “published” (i.e. editing is limited to datasets that are not yet published).

Changing metadata in a dataset

For example, to update the title of a dataset for which you have a local copy, you can do the following:

[4]:
ds = fairly.dataset("./local-dataset")
ds.metadata["title"] = "New title"
ds.save_metadata() # save changes to manifest.yaml

ds.push() # push changes to data repository to update an existing dataset

Changing files in a dataset

You can add, remove, or modify files in a local dataset as you wish. If file inclusion or exclusion rules are defined using patterns (e.g. '*.txt'), then fairly automatically identifies added, removed, or modified files. Otherwise, you need to explicitly indicate what needs to be included or excluded. Use the includes.append and excludes.append methods to do so.

[6]:
# include a new file or directory
ds.includes.append("new file.txt")

# remove a file or directory
ds.excludes.append("old file.txt")

ds.save() # save changes to manifest.yaml

Once the changes are saved to the manifest file, the remote version can be updated by calling the push method:

[ ]:
ds.push() # push changes to data repository

To learn more about the Fairly Python API, check the API reference.