2011.03.14 Followup

Attending: Andy, Josh

AJ: Josh, I've marked up your message with my comments. However, I'm concious that this email list and thread is in danger of being used for ongoing discussions about this work which may not be welcomed by everyone. Therefore, I'm suggesting we reduce the circulation for these ongoing discussions to the core development group (OMERO team, me & Alison) by default and then let others opt in? So if you do want to be included in these types of discussions, can you reply to me by the end of the week. Don't worry we'll still engage with everyone but we'll try not to bombard your inbox, and remember you can always get to the tickets using the link Josh has provided.

JM: The versioning of data sets/sources will eventually be of interest, according to Helen. Though this is likely much farther down the road, it would be good if we could prevent doing anything that would preclude it.

AJ: In terms of the pilot, it is plausible that we may have multiple versions of data extracts (e.g. bmi readings for the cohort upto 2009 and then another upto 2011).

JM: from Andrew's 4 safe factors -- "safe people accessing safe data in a safe setting to produce safe outputs", code development in OMERO will initially focus on safe outputs. Though the system that is developed must not do anything to undermine the other 3. At the same time, the overall project should provide guidelines and best practices on safe settings, etc.

AJ: I'd disagree, and say our focus is on safe setting not safe outputs. Safe setting is the secure management and access of the data - e.g. single copy of data held centrally, accessed over secure protocols, audit logs, unable to download actual data etc. Safe outputs is the disclosure control - e.g. does a cell in R/STATA output table potentially identify an individual or small group of people.

JM: A (the?) primary goal is to increase the community's confidence in sharing data for population research while minimizing the risk of identifying individuals. An open question for me is to what extent the risk possible with OMERO must be quantified or at least formalized. Any pointers to existing guidelines for clinical data software would be appreciated.

AJ: It is one of the goals (see research focus group feedback for others - e.g. data prep/cleaning, computation, collaboration...). I think the minimum we need is the detailed audit log of users actions which can then be referenced if there is any concern. In terms of current practice - ISD have information on their website about the disclosure control they apply but i don't have the link to hand as my main computer has died this morning. A term that seems to also get used in the research literature is 'obfuscation'. HIC don't currently check researcher analyses & outputs - hence an audit log of what users do is marked improvement. Having intelligent tools to make sense of that information may well be out of scope of this work.

JM: Current practice includes the emailing of PGP encrypted data sets. The initial handshake with an OMERO server (in which passwords, etc. are sent) is guaranteed to be SSL-encrypted. Later communication, however, can be sent in plain text. For OMERO/HIC, either communication with the server which is handling sensitive data or the data itself should be encrypted, possibly both. Does anyone have an idea of which?

AJ: A minor point, but the PGP files tend to get FTP'd. We should use encryption for all communication not just the handshake. Or are you suggesting SSL for the handshake and then encrypting the data using some other algorithm (e.g. pgp)? It would also be nice to be able to demonstrate the 2-factor authentication - is this possible within omero just now?

JM: Where appropriate, I've added the questions above as "tasks" under the "stories" which make up the OMERO/HIC "requirement" (http://trac.openmicroscopy.org.uk/ome/ticket/4625). That's all OME-speak for our todo list. Feedback is very welcome.

AJ: I've had a look over the tickets/wiki and this is a good start. One task that isn't listed (and arguably the overal objective) is the development of the APIs for STATA/R so that these packages can talk directly to the OMERO architecture i.e. without the export step. This ticket would have a higher priority than the web api work.

JM: Understood. For whatever reason, I had the impression that the web interface had become prioritized, so it's good you spoke up. http://trac.openmicroscopy.org.uk/ome/ticket/4672 added.

AJ: In terms of the export - once 1 or more file(s) is generated how do the researchers access them - do they have to download them to their local machines, or do they map a drive e.g. using expandrive, or do we install a set of software tools onto the server so they can work on them on server?

JM: Initially, I would have expected something like exporting a zip (in the case of multiple files). During server-side execution, the files could automatically be unpacked for the script.

AJ: In terms of the AAA security at column/row level - for now the model is to mimic the current HIC data release practice where we build a project specific subset of data. In which case all members of that project will have full access to that extract. There will be no master prescribing, smr01, scidc datasets in the omero/hic bubble - these will remain on the NHS environment and subsets will be generated by me/alison and imported into the omero/hic tables for use by the researchers.

AJ: Aggregates via a website maybe possible against some datasets (e.g. godarts) and not possible against others - long list of political reasons come to mind. At the moment we do offer a boxi (business objects) environment against some data but this is only accessible on the nhs network so it has had limited uptake and requires significant investment. Some researchers have suggested they'd like to see something similar to the genetics world of the hapmap / zebrafish / 1958 birth cohort data browsers. I'm not sure if we'd want to provide a predefined set of Qs or an interactive exploration environment similar to the above or boxi/olap. Anyway, I'm meeting with Andrew next week to discuss the web frontend work, but the priority is always going to be the APIs not the website.

JM: Alright.

AJ: As far as the import goes, we currently generate an xml file and a csv file. The xml file lists number of rows, columns names, data types, descriptions etc. So it would be good if the import tool could work either with just a csv, or a csv & xml file pair. I'll generate an csv+xml example pair in the next few days.

JM: Looking forward to it. Rules for proper handling of these files should accompany them! :)

AJ: Do I need an account setup for trac to make comments etc or do i use my uni login, or do you prefer emails?

JM: You can't set up a trac account, but Chris can tell us how easy it is to get one. Emails are a good fallback. Is there anything else we need in place in terms of project tools: mailing lists and/or email aliases, private file sharing location, etc.?

Document Actions

Print this

Sections

Personal tools

2011.03.14 Followup

Document Actions