Data Science Graduate Education: Content, Projects, and Computer Languages

Randy Paffenroth

Department of Mathematical Sciences, Worcester Polytechnic Institute

In this talk I will discuss the entire lifecycle of a graduate program in Data Science. At Worcester Polytechnic Institute (WPI), the Data Science Program lay at the intersection of Business, Computer Science, and Mathematics and in this talk I will demonstrate how such an interdisciplinary program can be put together. In particular, I will attempt to show how we keep the program faithful to all of its component parts and WPI’s focus on project based education. In interdisciplinary fields, such as Data Science, the students often come from a variety of different backgrounds where, for example, some students may have strong mathematical training but less experience in programming. Accordingly, we have attempted to design a program that addresses the needs of many different types of students and provide them with a sound foundation in all aspects of Data Science. As one particular focus, I will discuss how the Python computer languages ease of use, open source license, and access to a vast array of libraries make it particularly suited for such students. In particular, I will discuss how Python, IPython notebooks, scikit-learn, NumPy, SciPy, and pandas can be used in several phases of graduate Data Science education, starting from introductory classes (covering topics such as data gathering, data cleaning, statistics, regression, classification, machine learning, etc.) and culminating in degree capstone research projects. Having access to such a library allows interesting problems to be addressed early in the educational process and the experience gained with such "black box" routines provides a firm foundation for the students own software development, analysis, and research later in their academic experience.