2016-02-09 Tuesday Team Meeting
Dundee: Dominik, Mark C, Kenny, Gus, Simone, Simon, Petr, June, Helen, Balaji, Roger, Will, Ola, Chris A (14:36 UK), Josh B Remote: Sebastien B, Ian, Kelli, Colin, Harald, Stick, Ilya, David, Melissa, Eleanor, Josh M, Wilma, Chris C, Josh B, Emil (14:33 UK), Rebecca, Jason
Agenda - 2:30pm Start
Accepting minutes from last meeting
- accepted
Project Timelines (2-3 minutes each)
Spaces - 14:30 UK
-
Mainline (J-m)
-
Web download: PR opened
- Ola: testing tomorrow
- Polyline/Polygon (insight): not compatible with model, PR opened
- Units support in Web viewer: Will is looking at it.
- Ice 3.6 J-M to ask on zeroc forum if/when release of 3.6.2
- Blog post about Windows support. Draft under review
-
Balaji: pretty good for 5.1.8 release, e-mail discussion about documentation changes re. MicroManager
- Josh: will need to make a decision ASAP
-
Web download: PR opened
-
Model (Sebastien)
- Mark: now made graph operations Folder-aware and opened related design issue
- Mark: will now be making further changes to Shape properties
- David: current PR finally green so now moving onto nested Folder tags
- Sebastien: measurement tool now working with folders, now client work moving onto nested folders, still on schedule for m1 and demo (Feb 26)
-
Metadata (Josh)
- currently looking into activating web in devspace
- feature calculation bits later
- Eleanor: finished one more dataset, now looking at the next
-
Mainline (J-m)
Other releases/upgrades - 14:38 UK
-
Figure
-
Josh: might unit changes require figure changes?
- Will: units already supported reasonably well, just displays units as saved rather than converting, so probably okay
-
Josh: might unit changes require figure changes?
-
FLIMfit/OPT (Ian)
- FLIMfit - "latest" builds working and tested.
- Localisation - final touches to updated UI and presentation of same.
- ImageJ
- Learning
-
Sysadmin
- Kenny: ironed out some kickstart PXE bugs
- Jason: thanks for help with stats.
-
Figure
Glencoe Update (Chris) - 14:41 UK
-
getting things back into opensource, including fixes to physical pixel sizes
- will need announcement of what may have been affected how badly (scalebars, etc.) since 5.1.0
- images with micron pixel sizes should have been fine, but EM tends to be in nm
- Sebastien: may be able to use configurations for test data to determine which readers may be affected
- PRs coming re. issues with OMERO reader, CellProfiler, etc. on large data with IO via OMERO
- finishing up populate metadata issues
-
getting things back into opensource, including fixes to physical pixel sizes
AOB (5 mins max - technical discussions should be highlighted to relevant people and rescheduled) - 14:46 UK
-
Ilya and Chris C introduction
- OME for machine learning, extracting features from biological images
-
Ilya and Chris C introduction
"Distributed Feature Calculation with Pydoop" presentation by Simone - 14:49 UK
-
Ilya: given performance issues with non-streaming random access to image files on distributed filesystem by Bio-Formats readers, could it be better to use OMERO as image server for cluster nodes running map operation (feature extraction)?
- or, perhaps some not-Hadoop distributed computing framework may be better
-
Josh B: hadoop is designed for text-based streaming, something like SPARK may be a better fit
- Simone: depends in part on available memory per node
-
Chris A: how come performance dropped much further from ideal when splitting by plane instead of series?
- Simone: many more map processes so correspondingly more overhead
- Chris A: but why so much more overhead, for jobs taking many minutes things like JVM startup should be negligible?
- Simone: one issue is “stragglers” where some of the jobs take rather longer than others
- Simone: issue is HDFS making seeks expensive
-
Chris A: is unconvinced that explains the difference, especially given durations and trend: IO is a small fraction of this overhead so look at the framework for the culprit
- Simone: suspects the issue is more one of using so many cores for so little data
- Chris A: though, current datasets are mostly z=1, t=1 in size
- Josh M: some do have plenty of planes though
-
Ilya: With many more cores and fragmenting the data more, overhead would be much worse again?
- Josh: probably yes, guessing that much overhead is Bio-Formats setId when the reader initializes
- Chris A: is still struck by how quickly large HCS planes can be read in other scenarios, so is doubtful that IO is the problem
-
Ilya: better to gather more performance data points before drawing conclusions, especially with more cores still, concerned about trend
- Josh B: expects that more nodes will mean more overhead
- Simone: also experiments were on a relatively small dataset given how many cores were applied
-
Simon: the problem parallelizes well by planes (Josh M: or tiles), so focus on how to get the planes out
- Ilya: also matters how large a plane it makes sense to calculate features on
- Simone: these are 384x384-pixel planes
-
Simon: pull planes from OMERO server?
-
Chris A: too many Bio-Formats instances, would need Java processes on nodes rather than on server
- can initialize Bio-Formats locally
- Mark: series is a property of Image
-
Chris A: too many Bio-Formats instances, would need Java processes on nodes rather than on server
-
Josh M: what infrastructure are we building for testing WND-CHARM across the whole of IDR?
- Simone: just presenting past hadoop work, not sure which is best, but definitely should use some framework rather than something manual
-
Chris A: consider a classical grid approach (rather more lightweight), a “poor man’s parallelization framework” with a scheduler and writing out planes as separate files
- Ilya: now hadoop-based solution is implemented it’s worth trying it on a real problem
- Simon: still some setup to do to achieve that
- Ilya: all rides on having a good distributed filesystem
- Chris A: suggest taking what has already been done, except move to GPFS and local Bio-Formats on nodes
-
Chris C: WND-CHARM currently reads one pixel plane per TIFF. Would it help for WND-CHARM to more tightly integrate with Bio-Formats for reading the pixel data?
- Josh M: would be useful to not rely on reading TIFF, could instead provide numpy arrays
- Ilya: would be good if it effectively wraps a pointer to memory accessible by the local process
-
Chris A: existing Avro-based solution should suffice
- Josh M: except don’t serialize to HDFS
-
Simone: currently going via socket, not file
- Avro-serialized data transferred via hadoop protocol
- Chris A: can omit HDFS from the Java-to-Python code, use local filesystem?
- Simone: absolutely
- Chris A: so we already have the link from Bio-Formats to WND-CHARM
- Simone: some caveats re. RGB, etc.
- Chris A: can use channel splitter, etc. to regularize input of pixel data
- Chris A: so for GPFS just need code to get filepath, series, etc. from OMERO?
- Simone: yes
-
Chris A: given size of problem, okay to make filesets or images the unit of work, still plenty more than available CPUs
- Josh M: initialization of large filesets (plates) takes many seconds even with memo files, may still be negligible against WND-CHARM feature extraction
- Ilya: for resumability need better granularity than fileset for if features are already computed
- Chris A: distributed processing framework should bring us resumability
- Josh M: probably not the biggest problem, even “parallel” command-line utility offers useful logging
- Chris A: could be useful to have longer-running Java process per node and scheduler tends to hand nodes work from same fileset
- Simone: concerned about ending up reimplementing hadoop
- Chris A: if working at fileset level but wanting decent resumability then already a fair bit of coding to be done
-
Simon: what tile sizes make sense for WND-CHARM?
-
Ilya: should compute multiple feature sets per image
- low-res 512^2, as well as tiles of 512 of full size
-
Chris C: some feature values depend on size of pixel plane, so variety in plane size can be an issue
- 200x200 is analyzed quickly, 512x512 may be on the large side
- Ilya: either makes sense, question is more than of if we do multiple overlapping computations at different size
-
Ilya: should compute multiple feature sets per image
-
Ilya: given performance issues with non-streaming random access to image files on distributed filesystem by Bio-Formats readers, could it be better to use OMERO as image server for cluster nodes running map operation (feature extraction)?
Done 17:15 UK