2013-10-30 OMERO.features Google Hangout (14:00 GMT / 10:00 EDT)
Chris Coletta, Ivan Cao-Berg, Josh Moore, Lee Kamentsky, Simon Li
S: Aim of these meetings is to come up with a plan for OMERO.features prior to meeting up next year. Consider feature storage, retrieval, transferral between systems/applications, can we use ROIs.
L: Need name of feature and value, and should support multiple features per site. Should be linked to one of more of imaging site, image, segmentation, ROI, analysis protocol, parameters. Square array containing rows of features.
C: No standard for storing features, potential for us to create an industry standard? Can we start with something like Mahotas which has names, Numpy arrays, and stores things such as a regression ground truth?
I: In Mahotas FeatureSet name is linked to a 2D or 3D image. How are we planning to handle transformations of features, such as PCA? Need to support labs who do not want to recalculate their features, i.e. allow direct import without recalculation. Do we want to ensure full reproducibility, down to specifying what hardware was used for the calculations, e.g. include the machine ID in the metadata.
L: Work towards full reproducibility in the long term.
S: Two discussions here, calculate/store and transfer between systems. CRS group has done some work with Neo4j graph DB OMERO.biobank.
L: Start by mapping out a model of how a feature is created, then decide which part of the model we should support.
C: Can everything be done in a 2D matrix, by adding rows or columns to encode each sample?
J: Use ROI spec to define the origin of some features, and support sparse features? Any feature can be linked to an ROI. Do we support linking at a higher level in OMERO such as project/dataset?
I: Sparse feature-sets rather than features, within a single feature-set there should be a fixed number of floats.
L: Linking ROIs and supporting sparse feature-sets should work.
J: Standardise at level of feature-set, e.g. Broad feature-set, WND-CHARM Small feature-set, CMU SLF. In future have a standard list of featuresets. OMERO should present the list to users along with metadata.
L: Make row ID abstract, and point to another table to allow for expansion instead of forcing it to be a ROI ID. Some features are calculated from multiple images, we need to record this. Perhaps just have a rectangle of floats which is agnostic to features and keys.
J: Could have multiple IDs for a row, but how do you encode the relationship between them? By convention based on the feature-set name?
I: In practice most people never query individual features, they retrieve them in bulk.
J: Store in postgres? Requires permissions to allow creation of tables on the fly.
I: What’s the retrieval speed in Postgres? Could you do the calculation within Postgres?
L: Build classifier within Postgres?
J: Could translate similarity algorithm into SQL. But should we stick to just bulk retrieval instead?
I: Bulk retrieval only.
C: Agreed, but also want a pointer to the original data matrix.
J: Pre-generate a HDF5 table for retrieval? Numpy matrix? Something modelled on SciDB in R (very large R matrix backed by remote DB).
S: Comes down to wanting one query to request a list of feature-sets for a specified list of ROIs, HDF5/numpy/etc matrix is generated and returned.
C: Covers most of what we need.
J: Must be compatible across languages.
L: Probably store own IDs for sites, load up site metadata, and join the requested feature table to the site IDs/metadata outside OMERO. Users selects by ID outside OMERO.
S: No need to support ROIs and Image IDS, a whole-field feature is just a ROI covering the whole image.
J: OMERO provides bookkeeping, querying, cross-language support through a common remote API.
L: Should columns be strongly typed, e.g. a ROI instead of just a number?
J: Would a UUID creation service be useful for a site/ROI/etc combination?
L: Probably not, put it all in table: [site1, site2, ROI1, ROI2, ... features ...]
C: How does additional metadata fit in, such as protocols, parameters?
L: Could have a feature-set that is just metadata, containing IDs of protocols etc. Join this metadata-feature-set to the actual feature-sets.
J: Features could be based on anything in the OMERO model, or combinations of them.
C: Associate tags/attributes against each feature row?
L: Comes down to database normalisation.
J: Effectively a (possibly incomplete) key-value map attached to each feature-set row. Query based on key-values, OMERO takes care of the query logistics and data formatting.
L: Could attach protocols at the level of the whole table instead. Include segmentations, parameters, etc.
I: Retrieving features should be sufficient. OMERO should prioritise efficient storage and retrieval. Also would be nice if OMERO could export OME-tiffs including the features, so they could be imported into another server.
L: Next meeting same time next week (Wednesday 10:00EDT)