Introduction to Data Science and Analytics

Learn the fundamentals of data science and analytics, from problem formulation through to model building and interpretation of results.

Introduction to Data Science and Analytics is aimed at the professional who wants an introduction to data science with a strong focus on business applications. By the end of the course, beginners will be capable of building, tuning and deploying regression and classification models for a variety of business problems. Participants will also gain an understanding of unsupervised learning techniques and big data architecture.

Taught using a variety of open source and cloud technologies, the course teaches techniques for handling, manipulating and analyzing high volume (1m + row), high dimension business data.

Some familiarity with R or Python (in the form of self study or an online MOOC) is recommended prior to starting the course.

Who Should Attend?

This course is suitable for anyone who would like a better understanding of data science and how it can be used to make better decisions with data.

Prerequisites

It is recommended that participants have completed an introductory R programming course or MOOC and at least one introductory statistics unit at the university level.

Laptop Required Specs

  • Intel i3 processor, 4GB RAM
  • Either Mac or Windows operating system

Software Requirements

  • Excel 2010 or 2013 or 2016
  • R or RStudio Latest version
  • A free trial or paid subscription to Microsoft Azure ML Studio

Course Objectives

  • Provide participants with an understand of the fundamentals of data science and analytics, from a commercial perspective
  • Provide an overview of the current state of big data and implications for business
  • Introduce machine learning to beginners using applied examples
  • Give participants a framework for selecting the most appropriate data science method for a given use case

Upon successful completion of this course, participants will be able to:

  • Build, tune and deploy basic machine learning models in the cloud
  • Reduce the dimensionality of large datasets using unsupervised learning techniques
  • Use Azure ML Studio and R to analyze and perform predictions on large datasets
  • Understand distributed computing and how it is used by businesses to query large datasets

Dataset

This course utilizes the following datasets as learning tools:

  • A 750,000 row, 30 variable digital marketing dataset from the insurance sector
  • A 227,000 row, 21 variable airlines dataset

Data Fields

The digital marketing dataset includes information about customer demographics, product category purchased and the digital marketing channel the customer engaged with at each respective online touchpoint.

The airlines dataset includes information on domestic US flights that departed Houston in 2011. The fields include departure time, arrival time, flight number and destination location (alongside 17 other fields).

Data Format

The data is provided to participants in unstructured .dat format. Participants are taught how to import the dataset into Excel and convert the .dat file into an .xlsx file.

Day 1

+

I. Introductions, Ice Breaker (9:00am – 9:15am)

II. Introduction to Data Science, Big Data and Analytics (9:15am – 10:00am)

  • What is data science? What does a data scientist do?
  • What is analytics? What is Predictive Analytics?
  • The analytics value chain – Myth and reality
  • Mapping business problems to data science problems
  • The 5 V’s of big data
  • The current state and future of machine learning and AI

III. Q&A / Break (10:00am – 10:15am)

IV. The Data Science Process (10:15am – 11:00am)

  • Ask an interesting question
  • Get the data
  • Explore the data
  • Model the data (including comparison of regression and classification problems)
  • Communicate and visualize the results
  • Walkthrough of process using a real world DataSeer data science consulting project
  • Technology overview – from Azure to Hadoop to RStudio

V. Q&A / Break (11:00am – 11:15am)

VI. Linear Regression I – ‘Breaking Open the Blackbox’ (11:15am – 12:00nn)

  • Recognizing regression model applications in business
  • Why is it called ‘ordinary least squares’ (OLS) regression?
  • Comparison of OLS with Least Absolute Deviations (LAD) regression
  • A ‘simple’ linear regression model calculated from first principles using Solver
  • A ‘multiple’ linear regression model calculated from first principles using Solver
  • Caution on extrapolating outside the range of provided data
  • Using a linear regression model to make predictions

VII. LUNCH (12:00nn – 1:00pm)

VI. Workshop: Team Activity – Fitting a Regression Model to Business Data (1:00pm – 2:15pm)

VII. Machine Learning Fundamentals and Linear Regression II (2:15pm – 3:00pm)

  • What is a machine learning model?
  • What is a test and training dataset?
  • Polynomial regression
  • Variable transformations
  • Feature selection and feature engineering
  • Overfitting and underfitting

VIII. Q&A / Break (3:00pm – 3:15pm)

IX. Workshop: Apply Your Regression Skills in a Kaggle Style Competition (3:15pm – 5:00pm)

  • In this workshop, small groups use their new regression skills to build models on real business data with the aim of achieving the lowest possible root mean square error on a hold out test dataset

X. Workshop Feedback, Presentation from Winning Model and Day 1 Wrap Up (5:00pm – 5:15pm)

Day 2

+

I. Introduction to Azure ML Studio (9:00am – 9:45am)

  • Azure ML Studio – Overview, capabilities, limitations
  • Business considerations – data center locations, data confidentiality, cost
  • Connecting ML Studio to a data source
  • Exploring and pre-processing data in ML Studio
  • Running experiments and setting up workflows
  • Supervised and unsupervised learning in ML Studio
  • Deploying and productionizing an ML Studio model as a service

II. Workshop: Team Activity – Fitting Yesterday’s Regression Model in Azure ML Studio and Comparing Results (9:30am – 10:30am)

III. Q&A / Break (10:30am – 10:45am)

IV. Overview of Other Regression Models and Fitting in ML Studio (11:15am – 12:00nn)

  • Decision trees
  • Random forest
  • Neural networks
  • Parametric and non-parametric model
  • Comparing models in Azure ML Studio

V. LUNCH (12:00nn – 1:00pm)

VI. Logistic Regression I – ‘Breaking Open the Blackbox’ (1:00pm – 1:45pm)

  • Recognizing classification model applications in business True positives, false positives, true negatives and false negatives
  • Walkthrough of a classification model case study from business
  • What does ‘maximum likelihood’ mean?
  • A logistic regression model calculated from first principles using Solver
  • Using a logistic regression model to make probability predictions
  • Converting probability predictions into labelled predictions

VII. Q&A / Break (1:45pm – 2:00pm)

VIII. Comparing Classification Models and Logistic Regression II (2:00pm – 2:30pm)

  • Assessing model accuracy
  • True positives, false positives, true negatives and false negatives
  • The Receiver Operator Characteristic (ROC) curve and Area Under the Curve (AUC)
  • A brief introduction to other classification models – Decision trees, support vector machines, gradient boosting and neural nets

IX. Workshop: Apply Your Classification Skills in a Kaggle Style Competition (2:30pm – 5:00pm)

  • In this workshop, small groups use their new classification skills to build models on real business data in Azure ML Studio with the aim of achieving the lowest possible root mean square error on a hold out test dataset

X. Workshop Feedback, Presentation From Winning Model and Day 2 Wrap Up (5:00pm – 5:15pm)

Day 3

+

I. Decision Trees (9:00am – 9:45am)

  • Introduction to decision trees
  • Advantages and disadvantages compared to statistical models
  • Advantages and disadvantages compared to statistical models
  • Decision tree algorithms
  • Prediction using decision trees
  • Example using business data

II. Q&A / Break (10:15 – 10:30am)

III. Workshop: Team Activity – Decision Trees (10:00am – 11:15am)

IV. Unsupervised Learning and PCA (11:15am – 12:00nn)

  • Unsupervised learning models
  • Introduction to Principal Components Analysis
  • Visualizing PCA results
  • Interpreting PCA results

V. LUNCH (12:00nn – 1:00pm)

VI. Workshop: Team Activity – PCA with a Customer Demographics Dataset (1:00pm – 2:45pm)

VII. Fundamentals of Big Data Engineering (2:45pm – 3:30pm)

  • Introduction to distributed computing
  • MapReduce
  • Hadoop and HDFS
  • Hive
  • Mahout
  • Spark
  • Event ingestion and stream processing
  • How will things change as IoT ramps up?

VIII. Q&A / Break (3:30pm – 3:45pm)

IX. Workshop: Big Data Engineering (3:45pm – 4:45pm)

  • In this workshop, small groups will create and query a Hadoop cluster

X. Workshop Feedback, Course Wrap Up and Awarding of Certificates (4:45pm – 5:15pm)