2011.04.01 Inaugural Meeting
Attending: Josh, Jason, Simon, Andy
Agenda
- Matters Arising (<10 mins)
- New Hires (5 mins)
-
Hardware
- "4 CPUs (up to ~3.3GHz each), 50 - 100GB storage, 8GB memory."
- "Initially this would share the host machine's network card but we would look to order a separate 1GB card (in the coming days) which would be dedicated to the VM, cost is likely to be £100-200 (charged to SHIP). If you think this isn't going to meet the immediate resource needs then we will need to purchase a new host server."
- "The VM will reside inside the university firewall but outside of ours, and not on our domain - thus it will have a 134.36.204."
- "IP and be accessible across the uni network, but not externally. If external access is needed then this would be need to be authorised by ICS."
- "We can provide a Windows 2008 R2 VM or we can host a Linux VM, either way you can have full admin rights. Thus if you have an existing template VM that we can host within HyperV - please make this available to us. If you don't then please indicate which OS is preferred - n.b. we can setup the win2k8 instance very quickly as we have templates."
- Access to Datasets (10 mins)
-
Requirements for initial prototype (20 mins)
- group-based v. individual based
- requirements on auditing, who's reviewing how; per row
- Requirements for Wellcome visit demo (20 mins)
-
Any other business (<5 mins)
- Process & visibility of web resources
Notes
* Matters arising
* Don't need privileges on plone (yet); have privileges on trac
* New Hires
* Job ad is out on monster
* Has it been posted to NoSQL? Not yet
* http://www.lifesci.dundee.ac.uk/vacancies/2011/03/30/software-developer-2-posts
* Simon also looking into known developer groups
* Andy: http://careers.stackoverflow.com/
* Simon official? As far as we know.
* Hardware
* VM with full admin rights
* Hosted by HIC (physical environment - similar to J. Monk's)
* Can move the VM elsewhere
* One genetics file is a Gig.
* NIC to the cluster.
* Should only be able to see 134.36.204.*
* Andy: May need to pester Simon to get it configured.
* Simon: patched & updated debian (with our keys), put up somewhere so Andy can access it
* leading into to dataset...
* Datasets
* Josh: what can't we do with the data on the VM?
* Andy: "don't download data" / "data should stay on the VM"
* Can pipe it, but ...
* SOPs should protect us
* Not a real issue
* Perhaps a list of what people can and can't do
* Simon: shell into box, but don't scp all the data
* Andy: just behave and their are no problems
* At the moment, we don't have any control anyway (what we're trying to fix)
* Jason: everyone sign?
* Andy: anyone who can access VM needs to sign.
* Josh: use SSH keys. Simon in charge; only people who Andy has ok'd.
* Prototype
* Andy: group-based at the moment (within GoDARTs)
* smallest group can be one person
* never heard of a project with individual permissions
* for pilot, group is fine
* Auditing system
* Andy: never done this before
* Just setting it up on the Oracle box
* Auditing of every query which is run
* login
* select of one record, etc.
* admins are tracked as well
* Reviewing by default only done by HIC data coordinator (Alison & Andy)
* Already have external audits by independent company
* But they may do it through the HIC offices
* researcher signs SOP, need to make sure data is used appropriately
* i.e. visualize spot check
* Level 1
* human recalling time-series of what researcher did
* Level 2 could be audited, but outside the scope for the moment
* Simon: ok to instant message you to clarify points?
* We spent about 30 minutes talking
* Andy: just email us!
* Previous projects had IRC open 24/7
* Jason: Simon, do you want jabber?
* Not it's ok.
* Andy: NHS blocks lots.
* Can join IRC when pinged.
* Status
* Josh:
* Simon is commiting the start of the researcher tools
* Josh will start working on the auditing bits
* Andy: got the data from HIC, but they've lost the scripts with linkages / aliases
* ## aliases may be a problem
* Having to reverse engineer their data
* Luckily had a bit of code to do this.
* Having to go in a fill in XML by hand (tedious)
* Sample file with 100 rows per file, plus schema file
* They will be popped in as their down.
* PLINK is completely command-line (a bit messy)
* Visit demo
* Josh: any way to pin down what that needs to be
* Andy: loading in / pulling data out probably won't show up there
* Josh: could at least mock up a auditing web page
* "Oh my god, Bob did X..."
* Jason: thinking...
* What would Andrew (who has an iPad) think of...
* walking with visitors around HIC
* he's able to get a web form, securely login to site Y,
* and can see 2 other investigators,
* click on one of them,
* see series of datasets associated with them,
* click on dataset,
* see some listing of types of data (or something like that)
* graphical representation of some of that data
* would that make the point?
* Andy:
* going into provenance / governance of the data usage
* nice way of going about it.
* login, see a couple of users
* see how much data that is in
* queries over a couple of days, number of bytes they've moved
* looking at metadata for web pages
* pick couple of samples, subsets to graph X people with type 2
* and these are the queries run against this variable
* demoing the audit trail is fine
* Jason
* the problem is that it's AAA, important in safe haven
* but we also want to show some science
* Andy: number of summary statistics for whole database
* that would be useful
* Jason: few pages of a presentation like an admin interface
* Safe haven is shown by administrator (Prof. Morris) monitoring what's done
* could we also have someone write a query against the database to get some value
* Andy: for the researchers,
* running a query to get a subset to do analysis on
* or pulling genetics data and running against PLINK
* Simon
* Core thread is governance and adminstration
* but with some demonstration of how a scientist may use it
* Andy
* genetics file formats are regularly transformed
* perhaps that would be an example
* Jason: the point would be they don't do that anymore
* Jason: Back to "The Day"
* 2 users, click on user, see a dataset
* 2 datasets and a fused dataset (incorporates calculation which occured)
* ...
* once we get the data, then we'll be complete about what we're looking at
* we have reasonable SNP analysis people (D.M.) for how to process that data
* then we'll figure out if we can run that analysis and show the result
* need to know what the analysis and the visualization look like.
* how do we show that we did it, and that Andrew can review it.
* AOCB
* Genetics
* Jason: didn't understand operation
* Andy: don't yet either
* 18K patients (10K cases)
* 3 chip array analyses (blood tests), 750K SNPs
* metab. chip did 120K (different locations)
* looking at overlap from chips for people who are in both, inputing the gap
* one of the files is binary (reversing it led to increase in size 20M -> 1GB)
Action Items
* Simon: talk to Chris about which NoSQL boards to post job URL to
* Jason: post job URL to the OME list
* Josh: look into careers at stackoverflow for Dundee
* Simon: provide VM to Andy (Fusion)
* Andy: draw up a list of best practices for data on VM
* Simon, Josh, Jason: get signed forms.
* Andy: confirm that public notes is ok
* Next meeting in 2 weeks on 15th.