The Preparation and Exploitation of Information in Data Science

John Coles; Ron Rudnicki

Senior Research Scientists, CUBRC

The flood of data from collection devices, social media and traditional data sources holds the potential of revealing great insights if only properly exploited. There are an increasing number of data mining libraries available for use and development (e.g., Weka, Google, R), but the real work of data exploitation is knowing, when, why, and how to use these capabilities. This talk will discuss how to take full advantage of data for exploitation by approaching preparation from two different angles: enhancing data representation structures and finding naturally occurring data structures. To realize the promise of data understanding, a shift needs to occur from data representations of individual systems to common representations that cut across these systems. Ontologies are well suited to accomplishing this shift, provided that they are developed using a set of best practices that reduces development cycles and improves adaptability to new data sources while still providing coverage of the models of all the applications within the ecosystem. However, the vast array of information being stored and disparate systems that generate data also demand a paradigm shift in how data is accessed and prepared for analysis. To gain access to the spectrum of data available it is critical that data analytics methods and systems are capable of finding the data models used instead of assuming them a priori.