On Saturday, at the Quantified Self (QS) Conference at Stanford, I had the opportunity to lead a small discussion on the topic of a Data Commons for QS-type data. I would really like to thank all of the participants. It was a very productive and vibrant conversation. We had people from big companies, from startups, anthropologists, an epidemiologist, health data veterans and people new to the conversation. It was really interesting to see how other’s views and conceptions of a QS Data Commons differed from mine, and I learned a lot through the discussion.
The general concept is that we should create a common repository for personal data where I can ‘donate’ by data. For example, I should be able to send my activity data (FitBit, Jawbone UP, Nike, etc) into this repository to join the activity data streams of others, where the value of the aggregate data will very quickly surpass the value of any individual data stream.
I think there are a lot of potential use cases and places where this repository would add value. Much like the Zeo sleep data repository, this Data Commons would be open to academic researchers to create and execute novel studies. It could also be open to everyday hackers, companies and aspiring entrepreneurs, which would accelerate progress in this field. Innovators need raw materials, often data sets in this field, that they can play with, test hypotheses, build models and build prototypes.
Some interesting discussions and debates came up, which will need much further debate and discussion: minimal feature set needed, identity vs. anonymity, upfront data standardization vs. evolving standards, value to individuals vs. value to developers/academics. I personally believe this should be a very minimal repository to start, with little standardization or functionality. I believe innovators will see the potential value of the Commons and create the tools they need to clean and map the datasets. Waiting for agreement on these details will slow progress. Beau Gunderson from Singly says it best in this tweet:
Others believed that the data would be far less valuable without the addition of structure and metadata. I think some metadata are needed, like provenance of the data (device, model, etc), so that good tools can be created, but agreeing to, and adding metadata can come later. With a first implementation, we would quickly learn the right things to capture going forward, and probably have better ideas about how to capture them.
One other main question that I, and the group, struggled with was identity. On the one hand, if you do not include identity, there is little need for much security, encryption, privacy controls, account handling, etc., so development will be much faster and simpler. On the other, without identity, there may be little direct value to the users donating data, and different data streams from the same user might not be linked. This might be another area where we start with anonymity and add identity later. But I do think there must be a way to link data streams back to the same person if the Commons is to reach its full potential. Without this, finding unique and interesting correlations between streams will not be possible, and great insights and findings will remain unmined.
Then, like all great discussions, I had numerous places where I learned an assumption I had was wrong. I have seen discussions of a Data Commons turn into detailed discussions of data standards and data formats many times. I tried, from the outset, to guide this discussion away from Standards. But then over the weekend, the topic of standards kept coming up, and I had some great discussions with Nitin Borwankar, David Lee, David Spira, Jon Crosby, Beau Gunderson and others. Many of these talks focused on adoption of a simple and common format of data export from any device or app (often JSON was suggested), but not yet worrying about systematic header fields and metadata. Everyone would just agree to use JSON (or other transfer format), use existing headers and publish the schema. Then we can see what people are using and standardize as we progress. I like this approach and came around that standards are needed, and are actually key. If we adopt some minimal export standard from the outset, the Commons will be cleaner and data will flow much easier. 100Plus is committed to this cause and will participate to try to foster a community around a common data export format, a Data Commons, and general agreement that exporting data should be a core feature of any product in the fields of QS, mHealth, Digital Health, [next week’s name here]…