fairly.dataset package

Submodules

fairly.dataset.local module

class fairly.dataset.local.LocalDataset(path: str, auto_refresh: bool = True)[source]

Bases: Dataset

_path

Path of the dataset

Type:

str

_manifest_path

Path of the dataset manifest

Type:

str

_includes

File inclusion rules

Type:

set

_excludes

File exclusion rules

Type:

set

_md5s

MD5 checksum cache of the files

Type:

Dict

_yaml

YAML object

Class Attributes:

_regexps (Dict): Regular expression cache of the file rules

property created: datetime

Creation date and time of the dataset

property excludes: Set

Exclusion rules of the dataset files

get_archive_method() str[source]

Returns archiving method to be used for the dataset.

get_archive_name() str[source]

Returns archive name to be used for the dataset.

get_remote_dataset(remote=None) RemoteDataset[source]
property includes: Set

Inclusion rules of the dataset files

property modified: datetime

Last modification date and time of the dataset

property path: str

Path of the dataset

pull(source=None, notify: Callable = None) RemoteDataset[source]

Pulls changes made to metadata and files from the data repository to update the local dataset. Dataset must exits in data repository.

Parameters:
  • source – Source repository identifier or client. If not specified,

  • used. (identifier in manifest is)

  • notify (Callable) – Notification callback function.

Returns:

Remote dataset

Raises:

ValueError("No source dataset") – If source dataset is not specified.

push(target=None, notify: Callable = None) RemoteDataset[source]

Pushes local changes to metadata and files the data repository to update a remote dataset. Dataset must exits in data repository.

Parameters:
  • target – Target repository identifier or client. If not specified,

  • used. (identifier in manifest is)

  • notify (Callable) – Notification callback function.

Returns:

Remote dataset

Raises:

ValueError("No target dataset") – If target dataset is not specified.

property remote_datasets: Dict

Known remote datasets of the dataset.

reproduce() LocalDataset[source]

Reproduces an actual copy of the dataset.

save() None[source]

Saves metadata and file inclusion/exclusion rules.

save_files(force: bool = False) None[source]

Stores dataset file list if exists.

Parameters:

force (bool) – Set True to enforce save even if existing dataset is modified

Raises:

Warning("Existing dataset is modified")

set_remote_dataset(dataset) None[source]
property size: int

Total size of the dataset in bytes.

synchronize(source, notify: Callable = None) None[source]
property template: str

Metadata template of the dataset

property title: str

Title of the dataset.

upload(repository=None, notify: Callable = None, strategy: str = 'auto', force: bool = False) RemoteDataset[source]

Uploads dataset to the repository.

Available upload strategies:
  • auto: Mirror if folders are supported, otherwise archive folders individually.

  • mirror: Upload files and folders as they are.

  • archive_all: Create a single archive file for all files and folders.

  • archive_folders: Create an individual archive file for each folder.

Parameters:
  • repository – Repository identifier or client. If not specified, template identifier is used.

  • notify (Callable) – Notification callback function.

  • strategy (str) – Folder upload strategy (default = “auto”)

  • force (bool) – Set True to upload dataset even if a remote version exists (default = False)

Returns:

Remote dataset

Raises:
  • ValueError("Invalid repository") – If repository argument is invalid.

  • ValueError("Invalid upload strategy") – If upload strategy is invalid.

  • ValueError("Invalid archiving method") – If archiving method is invalid.

  • ValueError("Invalid archive name") – If archive name is invalid.

  • Warning("Remote dataset exists") – If remote dataset exists.

fairly.dataset.remote module

class fairly.dataset.remote.RemoteDataset(client, id=None, auto_refresh: bool = True, **kwargs)[source]

Bases: Dataset

_client

Client object

Type:

Client

_id

Dataset identifier

Type:

str

_details

Dataset details

Type:

Dict

property client: Client

Client of the dataset.

property created: datetime

Creation date and time of the dataset

property doi: str

DOI of the dataset.

get_versions() List[RemoteDataset][source]

Returns all available versions of the dataset.

Returns:

List of remote datasets of all available versions.

property id: Dict

Identifier of the dataset.

property modified: datetime

Last modification date and time of the dataset

property plain_id: str

Plain identifier of the dataset.

reproduce() RemoteDataset[source]

Reproduces an actual copy of the dataset.

property size: int

Total size of the dataset in bytes.

property status: str

Status of the dataset.

Possible statuses are as follows:
  • “draft”: Dataset is not published yet.

  • “public”: Dataset is published and is publicly available.

  • “embargoed”: Dataset is published, but is under embargo.

  • “restricted”: Dataset is published, but accessible only under certain conditions.

  • “closed”: Dataset is published, but accessible only by the owners.

  • “error”: Dataset is in an error state.

  • “unknown”: Dataset is in an unknown state.

store(path: str = None, notify: Callable = None, extract: bool = False, max_workers: int = None) LocalDataset[source]

Stores the dataset to a local directory.

If no path is provided, DOI is used by replacing slashes and backslashes with underscores. Local directory is created if it does not exist.

Parameters:
  • path (str) – Path to the local directory (optional).

  • notify (Callable) – Notification callback method (optional).

  • extract (bool) – Set True to extract archive files (default False).

  • max_workers (int) – Number of workers (optional).

Returns:

LocalDataset object of the stored local dataset.

Raises:
  • ValueError("Empty path")

  • ValueError("Directory is not empty")

property title: str

Title of the dataset.

property url: str

URL address of the dataset.

Module contents

Dataset class module.

Dataset class is used to represent datasets in a standardized manner. It is an abstract class.

Implementations:

LocalDataset RemoteDataset

class fairly.dataset.Dataset(auto_refresh: bool = False)[source]

Bases: ABC

Dataset class.

_metadata

Metadata.

Type:

Metadata

_files

Files list.

Type:

list

_modified

Last known modification date.

Type:

datetime.datetime

_auto_refresh

Auto-refresh flag.

Type:

bool

property auto_refresh: bool

Auto-refresh flag of the dataset.

abstract property created: datetime

Creation date and time of the dataset.

diff_files(dataset: Dataset = None) Diff[source]
diff_metadata(dataset: Dataset = None) Diff[source]
file(val: str) File[source]

Returns specified file of the dataset.

Automatically refreshes file information if dataset is modified.

property files: List[File]

List of files of the dataset.

get_file(val: str, refresh: bool = False) File[source]

Returns specified file of the dataset.

Parameters:
  • val (str) – File identifier.

  • refresh (bool) – Set True to enforce file information retrieval.

Returns:

File object if file is found, None otherwise.

get_files(refresh: bool = False) Dict[str, File][source]

Returns dictionary of files of the dataset.

Parameters:

refresh (bool) – Set True to enforce file list retrieval.

Returns:

Dictionary of files of the dataset. Keys are paths, values are File objects.

get_metadata(refresh: bool = False) Metadata[source]

Returns metadata of the dataset.

Parameters:

refresh (bool) – Set True to enforce metadata retrieval (default False).

Returns:

Metadata of the dataset.

property is_modified: bool

Checks if the existing dataset is modified.

Returns:

True if the existing dataset is modified, False otherwise.

property metadata: Metadata

Metadata of the dataset.

Refreshes metadata automatically if metadata object is not modified by the user, auto-fresh flag is set, and metadata is modified externally.

abstract property modified: datetime

Last modification date and time of the dataset.

abstract reproduce() Dataset[source]

Reproduces an actual copy of the dataset.

save_metadata(force: bool = False) None[source]

Stores dataset metadata if exists.

Parameters:

force (bool) – Set True to enforce save even if existing dataset is modified (default False).

Raises:

Warning("Existing dataset is modified") – If dataset is modified.

set_metadata(**kwargs) None[source]

Sets metadata attributes.

Parameters:

**kwargs – Metadata attributes.

abstract property size: int

Total size of the dataset in bytes.

abstract property title: str

Title of the dataset.