Skip to content
Bill Katz edited this page Oct 13, 2015 · 2 revisions

DVID (Distributed, Versioned, Image­-oriented Dataservice) was designed as a high-level data service that can be easily installed and managed locally, yet still scale to the available storage system and number of compute nodes. If data transmission or computer memory is an issue, it allows us to choose a local first­ class datastore that will eventually (or continuously) be synced with remote datastores. By “first class”, we mean that each DVID server, even on laptops, behaves identically to larger institutional DVID servers save for resource limitations like the size of the data that can be managed. Our vision is to have something like a github for image-oriented data, although there are a number of differences due to the size and typing of data as well as the approach to transferring versioned data between DVID servers. We hope to leverage the significant experience in crafting workflows and management tools for distributed, versioned operations.

DVID has been designed foremost as a malleable system with exchangeable components:

  • Data type packages -- allow easy addition of new data and access patterns.
  • Storage engines -- allow tradeoffs in access speed, data size, and concurrency.

DVID promotes the view of data as a collection of key­-value pairs where each key is composed of global identifiers for versioning and data identification as well as a datatype­-specific index (e.g., a spatial index) that allows large data to be broken into chunks. DVID focuses on how to break data into these key­-value pairs in a way that optimizes data access for various clients.

A DVID server is limited to local resources and the user determines what repos, data extents, and versions are held within that DVID server. Overwrites are allowed but once a version is locked, no further edits are allowed on that particular version. This allows manual or automated editing to be done during a period without accumulation of unnecessary deltas.

Why is distributed versioning central to DVID instead of a centralized approach?

  • Significant processing can occur in small subsets of the data or using alternative, compact representations: FlyEM mostly works in portions of the data after using full data to establish context. We can't even see the cells if we zoom out to the scale of our volumes. And if we want to work on neurons, it's a sparse volume that can be relatively compact and proofreading occurs in that sparse volume and its neighboring structures. Frequently, we can also transform voxel­-level data to more compact data structures like region adjacency graphs.
  • Research groups may want to silo data but eventually share and sync that data: It's not clear researchers want just "one" centralized datastore but might require one for each institution/group. Researchers don't always want to share data. So as soon as you support more than one centralized location, and think about syncing, you are basically looking at a distributed data problem or you'll be doing some ad hoc solution instead of more elegant git-­like techniques. And sometimes, researchers want to only share a particular version of their repo, e.g., the state of the repo at the time of a publication that requires open access to the data, yet they want to continue to work on the repo privately.
  • As computers increase in power, forcing centralization leads to significant wasted resources: Since significant workflows require only relatively small subsets of data, we can move data to workstations and laptops and use graphics/computation resources of those systems. Also, allowing distributed data persistence lets us explore other funding mechanisms, not just having deep­-pocketed institutions footing the bill for all storage and computation.
  • Large, multi­tenancy datastores can be difficult to optimize for particular use cases and guarantee throughput/latency. Shared resources can be exhausted if many users hit the resource when working toward seasonal deadlines. Aside from this timing issue, certain applications require tight bounds on availability, e.g., image acquisition. Since data access optimization via caching and other techniques is very specific to an application, datastore systems should be (1) relatively simple so systems exclusive to an application can be created and (2) have well­-defined interfaces both to the storage engine and the client, so application­-specific optimizations can be made. A research group can buy servers and deploy a relatively simple system that is dedicated for a particular use case or run applications in lock­step so optimizations are easier to make, e.g., the formatting of data to suit a particular data access pattern. After a particular use case is addressed like image acquisition, some or all of the data can be synced to another DVID server that may be optimized for different uses like parallel proofreading operations in disjoint but small subvolumes.
Clone this wiki locally