Navigating a SPARC Dataset

What is the structure of a SPARC dataset?
Updated at: 03/02/2020

General Organization

SPARC hosts a growing collection of diverse datasets, spanning anatomy, physiology, and modeling and simulation. Each dataset comprises multiple files, sometimes thousands of them. To make it easier to understand and work with SPARC data, we’ve implemented a consistent file structure and naming convention, based on the Brain Imaging Data Structure (BIDS) so that all data in SPARC is organized in a similar manner. Understanding this file structure will make it easier to browse and use SPARC data.

For each SPARC dataset, you can: Dataset 1

  1. View details: When you first access a SPARC data set, you will be taken to a landing page with basic information about the dataset: Title, subtitle, authors, descriptive image, license and dataset size. Additional details can be found by clicking the Details tab. Here you can find a more detailed description: related publications, a link to an experimental protocol, how do I cite the dataset, what types of files are there, information about the experimental design, and the completeness of the dataset, i.e., is this dataset a slice of a larger dataset.
  1. Download data: From the landing page, you can download the entire dataset from the Get Dataset button. Datasets < 5 Gb can be downloaded free of charge. Datasets >5 Gb require an Amazon S3 account and may incur charges.

  1. Browse data files: You can also browse available data files and metadata through the portal and download individual data files. To access the data files, structure, click on the Files tab.

SPARC Dataset Structure

Under the File tab, you will see a listing of files, e.g., Readme, and folders. Data, code and supporting materials are organized into Folders. Accompanying these data folders are a series of documents and spreadsheets that contain the critical metadata to understand what is in these folders. data structure 2

Metadata Files

The following metadata files are available:

  • Dataset_description: Spreadsheet listing basic details about the dataset included in the the Details tab above.
  • Subjects: Spreadsheet listing subjects by their identifiers along with key details about subject characteristics, e.g., age, weight, experimental groups.
  • Samples (if necessary): Spreadsheet listing specimens used in this study by their identifiers along with key details.
  • Readme: Investigators are encouraged to include a ReadMe files with important details necessary to understand the dataset.

File Folders:

Data, code and additional documents are organized into folders. The only required folder is the primary data folder: the folder where the main data products are found. These data products may be images, spreadsheets, physiological traces, etc. More than one type may be available. Note that these files may not be what is considered “raw” data, that is, data that comes right off the instrument, but may be minimally processed, e.g., a stitched mosaic. In that case, the raw data may sometimes be found in a folder labeled source (not shown).

Data sets may have additional folders, e.g., derivatives, which contain any data products derived from the original data. Such derivatives may include measurements from images, 3D reconstructions from serial sections or file conversions. The docs file contains supplementary materials that will help you understand the data set, e.g, figures or diagrams. Some data sets contain code which will be found in the code folder.

Note: Not all data sets have all of the above folders; it depends on the nature of the data set.

Primary Data Folder

Inside of the primary data folder, data files are organized according to subjects and/or samples. Each subject/sample has its own folder that is named according to the identifiers found in the the subjects/specimen spreadsheets referenced above. If data were derived directly from the subject, e.g., recordings from brainstem neurons in vivo, the data files will be accessible after clicking on the subject. If the data were derived from specimens, e.g., images of slices, then the data files will be accessible within the specimen files. Note that some investigators will further organize their data within the subjects/specimen folder, e.g., according to data type.

Within each folder, you may see a spreadsheet labeled manifest. The manifest contains a list of files within the folder with some helpful description. Data submitters are also encouraged to include a ReadMe file when necessary to provide additional details about the data.

Working with a SPARC dataset

As the SPARC portal evolves, we will make it easier to search for, understand and browse individual SPARC dataset files. Right now, it can be somewhat confusing. The key is to use the metadata files to understand the data folders. Currently, that involves manually opening the files while you browse the data folders.

Step by Step

  1. Open the primary data folder in the main directory.

  2. Open the Subjects.xlsx file from the main directory.

  3. The subfolders available in primary folder should be labeled according to the Subject IDs in the Subjects.xlsx file.

  4. Open a folder for an individual subject.
    a. If data were derived directly from a subject, you will see data files (Note: when multiple types of data are acquired from the same subject, they may be organized into subfolders.

    b. If data are derived from specimens extracted from that folder (not shown above), you will see another folder labeled with the specimen name. Details about the specimen IDs can be found in the Specimens.xlsx file in the main directory.

  5. Within the subjects folder, look for a manifest or ReadMe file to find details about the contents of the directory.

