Last night we hosted a Data for Good (@Data_for_Good) event entitled “Top Tips for Data Quality Assurance”, a great talk that brought together all kinds of people in the datasphere to listen to some data experts talk about…well, data!
We began the night’s events with Neity Kic (@peaceforlives) discussing the all important issue: Cleaning Dirty Data. Dealing with various formats and merging datasets can be rather complicated (especially with near identical records), but luckily she takes us through various methods on how to get around them.
@peaceforlives kicks off the meetup with the avoid Excel tip if you want cleaner #data, R-friendly/CSV format better. pic.twitter.com/Z9ZxQ0vLhB
— Viafoura (@viafoura) October 21, 2014
The evening carried on with Kry Lui who spoke to us on the all important issue of filling in missing data in datasets, definitely not a favourite of our data scientists. He starts off by explaining patterns that missing values can create, and insisting that those patterns themselves can be utilized to help fill in the gaps. By looking at various data points, you can infer missing data using K-means, K-Nearest Neighbour and other more advanced methods, while also considering the weight and distance of the surrounding points.
@VictorFAnjos excited, and all smiles as Kry Lui hits the whiteboard. #bigdata #cleandata pic.twitter.com/8Y8YZBfg7m
— Viafoura (@viafoura) October 21, 2014
Our evening continued with Adam Jacobs from MaRS Discovery District, who gave us a non-profit’s perspective on data – specifically, survey data and how difficult it can be to weigh responses based on extreme variables in the data returned. For example, some industries vary in their response rate which can make it difficult to get a true picture across the board. The problem in the end? Not enough data. With such little data, we’re left with a limited descriptive base and when dealing with surveys, that can be a real problem.
Adam Jacobs of @MaRSDD provides a real world example of #data cleaning via their survey data! pic.twitter.com/tfHDoSZmQq
— Viafoura (@viafoura) October 22, 2014
Last, but certainly not least, Samara (Canada) (@SamaraCDA), a Not-For-Profit, stepped forth to describe their plan for using data to increase political participation in Canada. They’ll be working with Data for Good in hopes of cleaning up and deriving meaningful data out of their datasets regarding Contribution Data in Canada, which until now, has been largely unstudied. They plan to attack survey data at all levels to create a comprehensive donation map of financial contributions made to political parties. We wish luck to both them and the Data for Good volunteers!
As usual, after the talks we launched into Q&A and had a lively evening of networking and data discussion. See our live recaps at @viafoura on Twitter and at @Data_for_Good.