2016-02-09 Tuesday Team Meeting

Dundee: Dominik, Mark C, Kenny, Gus, Simone, Simon, Petr, June, Helen, Balaji, Roger, Will, Ola, Chris A (14:36 UK), Josh B Remote: Sebastien B, Ian, Kelli, Colin, Harald, Stick, Ilya, David, Melissa, Eleanor, Josh M, Wilma, Chris C, Josh B, Emil (14:33 UK), Rebecca, Jason

Agenda - 2:30pm Start

Accepting minutes from last meeting
- accepted
Project Timelines (2-3 minutes each)
1. Spaces - 14:30 UK
  1. Mainline (J-m)
    1. Web download: PR opened
      - Ola: testing tomorrow
    2. Polyline/Polygon (insight): not compatible with model, PR opened
    3. Units support in Web viewer: Will is looking at it.
    4. Ice 3.6 J-M to ask on zeroc forum if/when release of 3.6.2
    5. Blog post about Windows support. Draft under review
    6. Balaji: pretty good for 5.1.8 release, e-mail discussion about documentation changes re. MicroManager
      - Josh: will need to make a decision ASAP
  2. Model (Sebastien)
    - Mark: now made graph operations Folder-aware and opened related design issue
    - Mark: will now be making further changes to Shape properties
    - David: current PR finally green so now moving onto nested Folder tags
    - Sebastien: measurement tool now working with folders, now client work moving onto nested folders, still on schedule for m1 and demo (Feb 26)
  3. Metadata (Josh)
    - currently looking into activating web in devspace
    - feature calculation bits later
    - Eleanor: finished one more dataset, now looking at the next
2. Other releases/upgrades - 14:38 UK
  1. Figure
    - Josh: might unit changes require figure changes?
      - Will: units already supported reasonably well, just displays units as saved rather than converting, so probably okay
  2. FLIMfit/OPT (Ian)
    - FLIMfit - "latest" builds working and tested.
    - Localisation - final touches to updated UI and presentation of same.
  3. ImageJ
  4. Learning
  5. Sysadmin
    - Kenny: ironed out some kickstart PXE bugs
    - Jason: thanks for help with stats.
3. Glencoe Update (Chris) - 14:41 UK
  - getting things back into opensource, including fixes to physical pixel sizes
    - will need announcement of what may have been affected how badly (scalebars, etc.) since 5.1.0
    - images with micron pixel sizes should have been fine, but EM tends to be in nm
    - Sebastien: may be able to use configurations for test data to determine which readers may be affected
  - PRs coming re. issues with OMERO reader, CellProfiler, etc. on large data with IO via OMERO
  - finishing up populate metadata issues
AOB (5 mins max - technical discussions should be highlighted to relevant people and rescheduled) - 14:46 UK
- Ilya and Chris C introduction
  - OME for machine learning, extracting features from biological images
"Distributed Feature Calculation with Pydoop" presentation by Simone - 14:49 UK
- Ilya: given performance issues with non-streaming random access to image files on distributed filesystem by Bio-Formats readers, could it be better to use OMERO as image server for cluster nodes running map operation (feature extraction)?
  - or, perhaps some not-Hadoop distributed computing framework may be better
  - Josh B: hadoop is designed for text-based streaming, something like SPARK may be a better fit
    - Simone: depends in part on available memory per node
- Chris A: how come performance dropped much further from ideal when splitting by plane instead of series?
  - Simone: many more map processes so correspondingly more overhead
  - Chris A: but why so much more overhead, for jobs taking many minutes things like JVM startup should be negligible?
  - Simone: one issue is “stragglers” where some of the jobs take rather longer than others
  - Simone: issue is HDFS making seeks expensive
  - Chris A: is unconvinced that explains the difference, especially given durations and trend: IO is a small fraction of this overhead so look at the framework for the culprit
    - Simone: suspects the issue is more one of using so many cores for so little data
    - Chris A: though, current datasets are mostly z=1, t=1 in size
    - Josh M: some do have plenty of planes though
- Ilya: With many more cores and fragmenting the data more, overhead would be much worse again?
  - Josh: probably yes, guessing that much overhead is Bio-Formats setId when the reader initializes
  - Chris A: is still struck by how quickly large HCS planes can be read in other scenarios, so is doubtful that IO is the problem
  - Ilya: better to gather more performance data points before drawing conclusions, especially with more cores still, concerned about trend
    - Josh B: expects that more nodes will mean more overhead
    - Simone: also experiments were on a relatively small dataset given how many cores were applied
- Simon: the problem parallelizes well by planes (Josh M: or tiles), so focus on how to get the planes out
  - Ilya: also matters how large a plane it makes sense to calculate features on
  - Simone: these are 384x384-pixel planes
  - Simon: pull planes from OMERO server?
    - Chris A: too many Bio-Formats instances, would need Java processes on nodes rather than on server
      - can initialize Bio-Formats locally
      - Mark: series is a property of Image
- Josh M: what infrastructure are we building for testing WND-CHARM across the whole of IDR?
  - Simone: just presenting past hadoop work, not sure which is best, but definitely should use some framework rather than something manual
  - Chris A: consider a classical grid approach (rather more lightweight), a “poor man’s parallelization framework” with a scheduler and writing out planes as separate files
    - Ilya: now hadoop-based solution is implemented it’s worth trying it on a real problem
    - Simon: still some setup to do to achieve that
    - Ilya: all rides on having a good distributed filesystem
- Chris A: suggest taking what has already been done, except move to GPFS and local Bio-Formats on nodes
- Chris C: WND-CHARM currently reads one pixel plane per TIFF. Would it help for WND-CHARM to more tightly integrate with Bio-Formats for reading the pixel data?
  - Josh M: would be useful to not rely on reading TIFF, could instead provide numpy arrays
  - Ilya: would be good if it effectively wraps a pointer to memory accessible by the local process
  - Chris A: existing Avro-based solution should suffice
    - Josh M: except don’t serialize to HDFS
    - Simone: currently going via socket, not file
      - Avro-serialized data transferred via hadoop protocol
    - Chris A: can omit HDFS from the Java-to-Python code, use local filesystem?
    - Simone: absolutely
    - Chris A: so we already have the link from Bio-Formats to WND-CHARM
    - Simone: some caveats re. RGB, etc.
    - Chris A: can use channel splitter, etc. to regularize input of pixel data
    - Chris A: so for GPFS just need code to get filepath, series, etc. from OMERO?
    - Simone: yes
    - Chris A: given size of problem, okay to make filesets or images the unit of work, still plenty more than available CPUs
      - Josh M: initialization of large filesets (plates) takes many seconds even with memo files, may still be negligible against WND-CHARM feature extraction
      - Ilya: for resumability need better granularity than fileset for if features are already computed
      - Chris A: distributed processing framework should bring us resumability
      - Josh M: probably not the biggest problem, even “parallel” command-line utility offers useful logging
      - Chris A: could be useful to have longer-running Java process per node and scheduler tends to hand nodes work from same fileset
      - Simone: concerned about ending up reimplementing hadoop
      - Chris A: if working at fileset level but wanting decent resumability then already a fair bit of coding to be done
- Simon: what tile sizes make sense for WND-CHARM?
  - Ilya: should compute multiple feature sets per image
    - low-res 512^2, as well as tiles of 512 of full size
  - Chris C: some feature values depend on size of pixel plane, so variety in plane size can be an issue
    - 200x200 is analyzed quickly, 512x512 may be on the large side
    - Ilya: either makes sense, question is more than of if we do multiple overlapping computations at different size

Done 17:15 UK

Document Actions

Print this

Sections

Personal tools

2016-02-09 Tuesday Team Meeting

Agenda - 2:30pm Start

Document Actions