Are you realising the full potential of Microsoft SQL Server?

Thursday, February 23, 2017

One of our Sydney consultants, Dawie Kruger, shares some of his thoughts on the features available in Microsoft SQL Server that he feels are not being fully utilised by users. 

I was privileged enough to attend Microsoft Ignite 2017 this year on the Gold Coast. There was considerable talk about the new features that will be offered in SQL Server 2016 Service Pack 1.  If you are not familiar with this as yet Microsoft is releasing over 15 Enterprise Edition features in SQL Server Standard and Express editions. These features include In-Memory OLTP, Partitioning, Compression and also additional security features like Always Encrypted. 



These are some great new features and like me, my fellow attendees agreed this is a fantastic step forward. 

One of the sessions I attended was Explore In-Memory OLTP Architectures and Customer Case Studies, presented by Sunil Agarwal (Principal Program Manager at Microsoft Corporation). One of the first questions in this session was “Who is currently using In-Memory or Compression in SQL Server?” I was surprised by how few hands were raised.

This got me thinking, “How many companies are running the latest versions of software but they are not utilising it to its full potential?” At one of my previous clients the CIO embarked on a strategy to identify how the existing applications could be used to their full potential as an alternative to upgrading to the “Latest and Greatest”. We found that in most cases that features and capabilities were not being utilised, and once configured properly these had a significant impact on performance and operational support so as to reduce the urgency of upgrading the applications.

In other cases, I have seen how applying features like Compression or Partitioning on tables in Microsoft SQL Server not only had a performance and stability improvement but also won back gigabytes of hard disk space that is still a hot commodity on today’s VM Hosts.

In this blog I will focus on two technologies, namely Compression and In-Memory OLTP in the Microsoft SQL Server Database that I feel are underutilised in the DW/BI space.

Compression

Table compression in SQL Server has been available in two flavours since SQL Server 2012 and in my personal experience is always something to consider. The example below is from a site where compression resulted in an average 38% reduction in disc space required by the solution.

We did see a slight increase in the build time of some of the compressed tables during the daily ETL run however the improved query speed more than compensated for this with cube processing times decreasing by 50%.

In-Memory OLTP

In-Memory OLTP is a technology that allows you to store database tables in server memory (RAM). A demonstration at Ignite highlighted the improvements made in SQL Server 2016 to in memory tables and the benefits to be gained if used correctly. In SQL Server 2016 there are 3 different durability levels (Fail Safe) for In-Memory tables:

  • Full Durability – Data is stored in memory with every transaction committed to disk. (Full Redundancy)

  • Delayed Durability – Data stored in memory with a delay in writing to disk.

  • Non-Durable - Data stored in memory only.


These options allow for multiple different uses in handling and processing data. For example, instead of adding pressure to TempDBb with Temp Tables why not create a Non-Durable in-memory table. Or why not land data in an in-memory table allowing you to load multiple sources at the same time into a single table without locking or latching. 

To see the full benefits offered by the use of In-Memory OLTP there are some development practices to take into consideration. At present you can get substantially higher performance gains when you use In-Memory compiled stored procedures to load your data into In-Memory tables. On a test bench we saw record inserts jump from 3,000 to 8,000 per second by simply converting the SQL table to In-Memory, then again jump to 12,000 inserts per second by compiling the stored procedures.

One aspect that is truly exciting is that Table Compression and In-Memory Tables are compatible technologies and can be used in conjunction with each other and I strongly encourage you to explore these more if you aren’t already leveraging this functionality.


A disclaimer here is that the figures seen in this blog are indicative examples only and are dependent on the type and volume of data used. Just as you would investigate and test practices like indexing strategies, these features should be properly tested before being implemented in production systems.







A Crash Course in Survival Analysis: Customer Churn (Part III)

Friday, February 17, 2017

Joshua Cortez, a member of our Data Science Team, has put together a series of blogs on using survival analysis to predict customer churn. This is the third and final blog of this series.
Survival Regression

Knowing how long customers stay is all well and good, but what if we want to know the factors that influence churn? What if we want to predict how long a given customer will churn? Survival regression help us do just that.

Here we'll specifically focus on Cox regression, which uses Cox's Proportional Hazard Model to perform survival regression. The main insight the Cox regression model gives us are its coefficients. The (exponential of the) coefficients correspond to the hazard ratios. What does the hazard ratio mean? It is a relative measure of the instantaneous rate of failure. Don't worry if that sounds confusing, it's better to consider an example.

For example, let's say we've fitted a Cox regression model to our example telco data set, and one of the variables is gender. This variable takes on two values: 1 for male, and 0 for female. What does it mean if hazard ratio of gender is 1.10? It means that at any time, whether it is 6 months since signing up or 12 months since signing up, males are 10% more likely to churn versus females.

The hazard ratio is a relative measure, not an absolute measure. So it should be looked at to as how females fare in relation to males (or vice versa).

Proportional hazards assumption

Cox regression has a very important assumption, the proportional hazards assumption. The variables in this model should first be tested on whether or not they follow this assumption.

It means that the hazard ratio of all variables should be constant over time. For example, we have the variable "dependents" that is 1 if the customer has dependents (e.g. children) and 0 if he/she doesn't have dependents. If the proportional hazards assumption holds, then, at any time, those with dependents are 30% more likely to churn than those without dependents. It doesn't happen that for the first month up to the third month, those with dependents are 30% more likely, then from the third month to the sixth month, it changes to 15%. The hazard ratio should be more or less the same across time.

(Source)

It's possible to test this assumption using a statistical test in the survival package in R.

The test says that only the following variables satisfy the proportional hazards assumption: PaperlessBilling, SeniorCitizen, Dependents, and Gender. Let's now use these variables to fit a model.

Side note: It's unfortunate that we'll have to leave out the other variables, but there are other methods (stratified cox regression, cox regression with time-dependent covariates, pseudo-observations) that can incorporate variables that don't follow the proportional hazards assumption. These however are out of scope of these blog posts.

R's Cox regression results

After filtering the variables, we can (finally) fit a model and interpret its results. Here are the results from calling the coxph function in R.



Here's what the cox regression model tells us:

  • Gender isn't a good indicator of churn. This confirms what we saw earlier when we compared the survival curves between female and male customers. They're equally likely to churn.

  • Senior citizenship, having dependents, and having paperless billing are indicative of churn. We can also quantify their effects.

    1. Senior citizens are 30% more likely to churn than non-senior citizens

    2. Customers without dependents are twice as likely to churn more than those with dependents. This also validates the survival curves earlier. The difference is that now we have a number to compare both groups.

    3. Customers with paperless billing are 1.8 times more likely to churn than those without paperless billing.

Taking these insights to action

We saw three significant factors to churn. We can leave out the Senior citizen factor and that leaves us with two to examine. Senior citizens might be more at risk of churning not because they're willingly opting out of the subscription service, but because they're passing away. Furthermore, the hazard ratio of the senior citizen factor is lower (30%) compared to the other two anyway.

The paperless billing factor is surprising since you'd expect customers to prefer the convenience and speed of receiving their bills online. There are various possibilities as to why this is happening. However, the business should investigate if their paperless billing processes are properly implemented. Also, as the majority of the customers are under paperless billing this is an important issue.



We see from the graph below that there are more than twice as many customers without dependents than with dependents. At the same time, those without dependents are twice as likely to churn. This means that there should be more efforts to retain this segment of customers.

These are just some of the ways that survival analysis can be used to address business problems. Barry Leventhal has recommended other use cases:

  • Business Planning
    • Forecast monthly number of lapses and use to monitor current lapse rates.

  • Lifetime Value (LTV) prediction

    • Derive LTV predictions by combining expected survival times with monthly revenues.


  • Active customers

    • Predict each customer's time to next purchase, and use to identify "active" vs. "inactive" customers.
  • Campaign evaluation

    • Monitor effects of campaigns on survival rates.

With that, I hope this branch of statistics can be useful to a problem that you're solving. 

(Source)

Additional Resources

The following resources were extremely helpful in making these posts.  Check them out if you want to learn more about survival analysis.

Microsoft Ignite

Friday, February 17, 2017

Two of our consultants, Michael Betterton and Dawie Kruger, are attending Microsoft Ignite on the Gold Coast this week, and in this blog they share some of the insights from the event.

The opening key note delivered by Scott Guthrie, the VP of cloud and Enterprise for Microsoft had a strong cloud message to the keynote with a key takeaway being Microsoft's focus on a hybrid cloud/on-premise solution.  Themes of cloud, analytics, internet of things and new infrastructure approaches prevailed. One of the case studies discussed was Rolls Royce, who have an IoT solution on all of the airplane engines they make that stream data to Azure and uses Azure ML to predict engine failure, displaying results in PowerBI. 

The keynote finished with a demo using Azure Cognitive services. This demo had two components, the first was real time facial recognition. The second component was the server-less cognitive services offering which is a lego-brick style drag-and-drop code editor. In the demo the presenter built in sentiment analysis of tweets that fed into a Dynamics instance in Azure, creating cases for negative tweets, all without a single VM provisioning. 

Also discussed was the vision for SQL Server & beyond presented by Lindsay Allen – the program manager for SQL.  The approach is to modernise business apps by moving them to the cloud either through IaaS (infrastructure as a service) or through PaaS (platform as a service). The PaaS offering was hosting SQL DB’s in the cloud. This is a new offering by MS being released soon that is the “CL Series Databases” PaaS. This lets you easily move on-premise databases to the cloud with little effort into a single elastic cluster. This cluster can scale to 150pb.  The smart database features in Azure have also been updated with the ability to suggest index creation/drops, schema issues and parameterisation of queries for PaaS DB’s. 

Understanding Windows Containers - using containers you are able to create a base OS or Software layer environment for developers that can be deployed to UAT and Production servers and they assist to eliminate discrepancies that caused application to crash or “work differently” in prod. It takes the concept of virtualisation to the software layer instead of the server layer. Some benchmarks saw new containers being spun up in less than 3 seconds using PowerShell scripts. Containers are not meant to be used for databases as they are blown away and have very small footprints however it does make for some interesting concepts for app development and deployment. More on container can be read here.

There were a few sneak peeks of upcoming features in SQL coming in June. In no particular order they are:

  • High Availability, Disaster Recovery & Availability Groups will be available across OS (window + Linux), with any combination of OS’s.

  • Clustered Columnstore indexes available from LOB data types.

  • The ability to pause and resume online index rebuilds

  • Adding of machine learning to the query optimizer to attempt to learn from its mistakes. This is a first step only and the key implementation here is Table Valued Functions and assessing the size of the result set. 

  • Python Integration - Apparently R wasn’t enough for MS and soon python will have full integration with the DB engine.

  • Graph Data Structures – this involves full support for CRUD of graphs & db-engine integration. This had a very short demo only but the query of the graph structure was written into normal T-SQL with its own syntax. It will be interesting to see applications of this in the future. 

Look out for the next series of blogs that go into more detail on some of these topics.

A Crash Course in Survival Analysis: Customer Churn (Part I)

Wednesday, February 01, 2017

Joshua Cortez, a member of our Data Science Team, has put together a series of blogs on using survival analysis to predict customer churn. This is part one of the blog series.

Introduction

Customer churn is familiar to many companies offering subscription services. Simply put, customer churn is the event of a customer opting out of their subscription. They may do so for different reasons including: dissatisfaction with their plan because of consistently poor reception for a mobile phone network, the allure of better subscription packages/plans from competitors, or a variety of other reasons.

 

(1)

Businesses want to understand how and why their customers churn to improve their profits and deliver better services. In this blog post series, we'll explore a branch of statistics called survival analysis to uncover insights that will be useful to understand and curb churn. We'll use a churn dataset from a blog in the IBM Watson analytics community that describes a fictitious telco's customers and how long they stayed before they churned.

Survival Analysis

Survival analysis has been traditionally used in medicine and in life sciences to analyse how long it takes before a person dies - hence the "survival" in survival analysis. The field however can be used to model other events that organisations care about, such as the failure of a machine, or customer churn. Okay cool. But what are the kinds of insights we can get from survival analysis? 

We'll talk about two main ideas in more detail in future blog posts: survival curves (in part II), and survival regression (in part III). We'll discuss what they are, and what kinds of insights they bring to the table.

For today, an introduction to these concepts and an overview of our test dataset.

1. Survival Curves


An example survival curve – by charting the results we can visualise the changes over time and likelihood of churn (2).

What we can do with it: 

i. Show how the likelihood of customer churn changes over time.
ii. Determine the optimal intervention point.

Questions it can answer:

i. How many years/months on average do our customers stay?

ii. How long do male customers stay compared to female customers?
iii. Is our understanding of our customer lifecycle accurate with reality?

Survival Regression allows us to apply a model to the survival analysis to
predict when an event is likely to occur.

What we can do with it: 

i. Model the relationship between customer churn, time, and other customer characteristics.

Questions it can answer:

i. What's the probability that this customer who is a female non-senior citizen with dependents will stay for 2 years?
ii. What are the significant factors that drive churn?

Examples of how survival analysis can be applied to other industries beyond telecommunications (2)
- Insurance - time to lapsing on policy
- Mortgages - time to mortgage redemption
- Mail Order Catalogue - time to next purchase
- Retail - time till food customer starts purchasing non-food
- Manufacturing - lifetime of a machine component
- Public Sector - time intervals to critical events

A worked example

Let's get started by examining our sample churn dataset. Our dataset has 7043 customers and 20 variables. Most of the variables are categorical and can be used to describe attributes about a customer.

Categorical Variables:

- Gender, SeniorCitizen, Partner, Dependents, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, Churn

Numeric Variables:

- Tenure, MonthlyCharges, TotalCharges


Here's a simple exploratory plot to get to know our data- a histogram of monthly charges. We can see how monthly charges are distributed across customers. A large proportion of customers are paying around $20 per month.

If you want to see more of the data, you can download the csv file from here.

In the next post we’re going to talk about survival curves and apply these to our dataset.


Sources:

(1) http://www.superoffice.com/blog/wp-content/uploads/2015/05/reduce-customer-churn.png
(2) http://www.barryanalytics.com/Downloads/Presentations/Survival Analysis.pdf




Australian Utility Week - Key Takeaways

Tuesday, December 06, 2016

Our Utility Practice Lead, Rhubesh Gommiah, attended The Australian Utility Week 2016 conference held in Sydney last week. 

The conference focused on the theme of ‘Building the Digital Utility’, in this post Rhubesh shares some of his key takeaways from this event.

This was my third consecutive attendance at this event and this year Matt Feltham from our NSW/QLD region co-presented with Ivan Beirne from Unitywater as part of the ‘Analytics in Action’ track. Their presentation focused on the tangible benefits delivered by the business intelligence solution that Altis has assisted Unitywater in implementing. Ivan gave some great examples of how the solution is being used in his area to help improve asset performance and utilisation and better target asset investment.

As in the previous years, there were multiple tracks available as part of the two-day conference programme. The sessions that I ended up attending covered a broad range of topics from the Internet of Things (IOT) to water smart meters and the decentralisation of the energy sector. Some of the presentations that caught my attention included:

  • A long-term Altis client, TransGrid, demonstrating how they are experimenting with Augmented Reality technologies to facilitate the work of field employees and improve their safety.

  • The CEO of another Altis client, Vector, describing how his organisation is embracing the change in the energy sector and adopting new technologies like solar, battery storage and electric vehicles instead of being disrupted by these technologies.

  • A university researcher highlighting the future potential of water smart meters from a customer and smart home perspective. Interestingly, some of the use cases highlighted such as leak detection and appliance fingerprinting are already being trialled by one of our Victorian water clients as part of a project that Altis is currently involved in.

There were also a number of start-ups present at the event, with most of these new entrants in the sector being technology companies which have been capitalising on opportunities around IOT and data analytics. For example, a new electricity retailer in NSW is offering an integrated solution of advanced pool automation services and electricity to swimming pool owners and operators. Their cloud-based solution integrates data from installed sensors with AEMO and weather data to optimise the operation of pools and reduce electricity costs for their customers. Another start-up is building a database of addresses with rooftop solar after having developed a machine learning algorithm to scan satellite images and identify rooftops with solar panels.

For traditional utilities, embarking on the digital journey clearly offers a range of opportunities for innovation. In particular, by using analytics to capitalise on the ever-increasing amounts of data that they are collecting, they can re-invent the way they engage with their customers and manage their assets. 

To find out more about how the changing landscape of data and analytics can benefit your utility organisation, please contact us.


AWS re:Invent

Tuesday, December 06, 2016

This week, Guillaume Jaudouin our AWS Practice Leads takes us through some of the announcements made by AWS during the re:Invent week.


As in every year, the AWS re:Invent week came with a host of new announcements, of which you can get the complete list here.

At Altis Consulting we focus on data and what we can do with it, so don’t expect full hardware specs and comparison of the new EC2 instance families that have just been released.  If you are still keen to talk gear, drop me an email and we can chat when no one is watching, instead, in this blog we will talk about 2 new services arriving in the Big Data and Analytics arena. 

The first one, AWS Athena is already available in a few regions in the US. This service allows direct query of data stored in S3.  Queries are written in standard SQL and supported file formats include CSV, Json and Parquet. Athena integrates with AWS QuickSight, or any JDBC compatible BI tool, for visualisation of the output. 

In conversations about Data Lake, customers often have the following question: “Once I’ve got all my data stored in my Data Lake on S3, how do I access it?”  The answer often is that further processing needs to be done on EMR or that data can be pushed to RedShift and then analysed with a data discovery tool. Well, AWS Athena will certainly change this conversation. 

Of course there is still a place for Data Analytics heavy weights like Redshift and EMR. Athena certainly won’t replace any of these two. They rather work together in a bi-modal way with Mode 1 being addressed by EMR and RedShift and Mode 2 supported by Athena for ad-hoc querying of the data. If you are unfamiliar with bi-modal BI, I suggest you check out point 6 of this blog post by my colleague James Black. 

(source: twitter.com/AWSreinvent) 

The other service that was announced is AWS Glue. The service is not available yet, but it is presented as ETL as a service. There is not a lot of information available at the moment, but it seems that ETL processes can be created without the need of scripting, although generated scripts can be modified if needed. As expected, it will integrate with most AWS Data Stores (S3, RDS, RedShift) and crawl them to generate a catalogue of data available in your account. 

(source: AWS.AMAZON.COM) 

These are exciting times as we can now envisage a complete fully-managed bi-modal big data and analytics platform on AWS – all of it without the need to spin up a single EC2 instance. 
One last thing: you can now move loads of data to AWS using a truck!

(source: AWS.AMAZON.COM) 

Fixed Price Agile Scrum for a Data Warehousing Project - Can it work?

Wednesday, November 30, 2016

Alok Joshi from our Melbourne office shares his insights on whether Agile Scrum can work on Fixed Price Data Warehouse projects.

I've often heard the argument that Agile Scrum cannot fundamentally work with a fixed price project due to the nature of the predetermined scope of work. Aside from the price, what are you actually fixing? Is it requirements or outcomes? Outcomes are subject to the interpretation and skill of the person / team tasked to complete the project, whereas requirements can be agreed and clarified. For example, if the requirement from accounting is to have a quarterly report that shows month end expenses by each department, is the outcome then to show the information in a bar chart? Tabular structure? Line graph? This can be clarified with accounting, but at the end of the day it is up to the team executing to provide the capability. This especially goes for more complex scenarios requiring specialised skills like dimensionally modelling a data warehouse. A requirement may be to model the data to optimise for scalability and performance - there could be a dozen ways to achieve this outcome when modelling a data warehouse. So let's talk about what we should be actually fixing here: The requirements to be met, and how the core components of Agile Scrum can be adapted.
Requirements and Product Backlog
The product backlog in a fixed price engagement is your scope of work and in Agile Scrum, this is your master list of requirements to be prioritised. What this means is that for fixed price, all items in the product backlog need to be completed by the end of the project. This is in contrast to a scrum backlog, where the highest priority tasks are “picked” at the start of each sprint. Therefore, all components of the development lifecycle are captured here in the form of tasks (or stories if applicable) from requirements, analysis and design, to build, test and deploy. This is your work break down structure and your map to follow for the project.
 
In Agile Scrum, Product Owners have the opportunity to revisit the product backlog anytime and reprioritise the tasks that they want the team to tackle in the next sprint. So why not in a fixed price project?  The only difference is if new items are brought in to the backlog and prioritised or removed, this equates to a change in scope and the relevant change control processes would be engaged i.e. change request and subsequent agreed re-scoping. This shouldn’t be seen in a negative light, but as an essential step in ensuring customer requirements are met.
Planning and Estimation
In Agile Scrum projects, sprint planning is where the team would come together and decide as a team, what is achievable for the duration of the sprint, with a sense of priority and guidance from the product owner. The team picks tasks from the top of the product backlog and estimate the effort required to complete it.
 
In a fixed price project however, the sprint goals are predetermined, the backlog priorities are already set, and the tasks to be completed are already estimated to be feasible for the sprint duration. So why hold a sprint plan? The simple answer is to create the opportunity to challenge the timeline, the work being performed and the outcomes the sprint is looking to achieve. Are the goals we set 2 months ago for this sprint still relevant? Are there more pressing items to address? It is better to answer these questions intermittently than at the end!
Showcases and Retrospectives
Showcases enable the team to demonstrate the capability that has been delivered in the sprint. This does not need to change in a fixed price project. The team can organise showcases so that whatever part of the project they are working on can be demonstrated to stakeholders. It is better to get feedback (good or bad) early so you know what works and what needs to be addressed.
 
This is where retrospectives in Agile makes the process of continual improvement within the team much more achievable. It is important for the team to reflect on what went well and what could have been done better. Holding these sessions is just as important in a fixed price engagement. If the processes or products being developed are found to be not working, it provides the opportunity once more for change control processes to kick-in and allow the team to continue down the right path.
 
At the end of the day, it's about expectations that are agreed upon and being able to manage this as flexibly as possible when something does come up (and it will!). Having a framework to help keep everyone (delivery team, project manager, customers/stakeholders) on board and aligned makes for much happier outcome.

What do you think about applying Agile Scrum to Fixed Price Data Warehouse projects? Have you used this approach? Share your thoughts.

Is my team going to make the playoffs?

Wednesday, October 26, 2016

Joshua Cortez, a member of our Data Science Team, looks at using analytics to predict sports leagues outcomes.

You are half way through a particular sport season and you want to know where your team will finish.  Here is a statistical way to determine that.  This framework was inspired by a blog by Daniel Weitzenfeld and a paper from Baio and Blangiardo. Originally, they modeled football games and I have now created a generic framework that can be applied to any sports league.  In this case I applied it to the current Philippine University basketball league.  

Using Bayesian methods, I predict that DLSU is most likely to win the league and that UE and UP will finish 7th and 8th.  Below is the heat map representing the probabilities of where each team will finish.

Data: 

Scores of games from a season of Philippine collegiate basketball. Format is double round robin and half of the season’s schedule is finished. 

Framework for Modelling:

Scored points depend on each team’s offensive and defensive strengths. A team’s offence increases their own total points, while defence decreases the total points of the opposing team. An additional factor is considered, home court advantage. But in the Philippine collegiate basketball setting, courts are neutral, so home court advantage has zero effect. 

The mathematical model relies on total points scored per team - no need for other statistics such as steals, rebounds, blocks etc. We can estimate offensive and defensive strengths based only on points scored.

Visualisation of Framework:



Alternative visualisation of framework

Results:

After fitting a Bayesian model to the data, the underlying offence and defence attributes (median) of each team are uncovered. We can see that DLSU’s dominance in the season is due to its offensive dominance.

Teams ranked by offence. There is a 95% chance that the true value lies within each bar.  

Teams ranked by defence. There is a 95% chance that the true value lies within each bar.

Using the uncovered offensive and defensive ratings for each team, we can simulate the second half of the season. The darker squares have higher probability.

It’s clear that DLSU has the highest chance of finishing the tournament in 1st place. UE and UP have the highest chances of finishing in either 7th or 8th place. 

Summary:  

This generic model can be applied to any sports league to predict the likelihood of where you favorite team will finish.








Six takeaways from the TDWI Conference

Wednesday, October 19, 2016

James Black from our Sydney office recently attended the TDWI conference "Analytics across the Enterprise" earlier this month in San Diego, California. Here he shares his key takeaways from the conference.

The main theme of the conference was developing insight-driven organisations that succeed in using analytics to create real business impact. 

 

As with all TDWI conferences, there was a myriad of courses on offer across a variety of learning tracks, from 'Getting Started with Analytics', to 'Managing Analytical Data' and 'Visualising and Communicating using Data' through to a first ever TDWI Data Scientist boot camp. 

Attending the conference was a rewarding experience - getting to hear and interact with the industry thought leaders who present the courses; and share experiences with my fellow attendees from the front-line of Analytics and Data Warehousing. For me, the overarching theme of the courses I attended was the challenge for traditional DW/BI practitioners to adapt our structured methods and processes to the more fluid and changeable landscape of the Big Data world.

Here are my six key takeaways from the conference:

1. ‘Traditional’ DW/BI is still important, but it’s no longer enough

Data Warehousing and Business Intelligence as we know it – reports, dashboards, cubes, ETL, 3NF, Star Schemas – has been around for decades and is to some extent overshadowed these days by all the hip, exciting, new-fangled stuff going on related to Big Data. But it remains hugely important to how most organisations make business decisions.

However, many businesses are moving beyond conventional BI to gain a competitive edge not merely by seeking to answer the ‘who’, ‘what’ and ‘when’ of DW/BI; but the analytical questions of ‘what if..’, ‘what should..’ and ‘what will happen in the future’. This is the power of Advanced Analytics – analytics that predict what is likely to happen and future behaviours; and analytics that prescribe future courses of action.

2. The enterprise is no longer the centre of the data universe

The question of what data of value sits outside of an organisation’s core transactional and customer management systems is of course nothing new. But those halcyon days, where the biggest concern for Data Architects was how to get all that data residing in spreadmarts and Access databases into the Data Warehouse are over. 

Data that is required for Analytics is everywhere – inside and outside of the enterprise. The data internal to the enterprise remains important, but the data external to the enterprise is becoming just as important – web data, social media, open data and government datasets, text and sensor data, system logs and machine generated information. 

As well as the location of the data, the type of data available for analysis has also changed – not just greater volumes at a lower latency but non-relational, inconsistent and without metadata. Data that doesn’t fit neatly into the structured, architected models of our usual relational databases - streaming and telemetry data, social media data, JSON, XML, textual data. New types of databases are being used to store this non-relational information – Key Value, Document and Graph databases – the ‘No SQL’ databases.

3. A new paradigm for architecting and modelling BI solutions

So the data is not always relational, we won’t always know the exact format before we receive it, and the format may change – resulting in a shift in how we architect and model future DW/BI solutions.

Does unstructured data make data modelling impractical? Does NoSQL imply no data modelling? Do E-R and star schema models still matter? The answer to these questions is that data modelling is still an important process; data modelling for relational structures is not going away; but the data modelling process must change to keep pace with the rapidly evolving world of data.

In traditional BI, we model before storing data. For these new data sources, we can no longer model and architect all the data in advance - we will have to store the data first and then model it. This is the ‘Schema on Read’ paradigm versus the ‘Schema on Write’ process of traditional Data Warehousing.

The process of architecting DW/BI solutions also needs to become more adaptive. This means no ‘one size fits all’ architecture with rigid ‘enterprise’ standards but allowing greater access to the data at earlier stages of its processing i.e. rather than waiting for the data to be transformed and loaded to the DW, users get access to it in its rawer forms. 

While Data Architecture has usually involved control and centralisation, the requirements of Data Scientists are for empowerment and decentralisation. Architecture and governance remain hugely important but some happy medium is required in relation to rigour and security, so for example, sensitive customer data isn’t just dumped in to the data lake for anyone to access.

4. Big Data – moving beyond the hype and into the mainstream

Big Data has justified the hype and it was clear at the conference that many attendees were from organisations that were already utilising the data, technologies and techniques of Big Data. Big Data is not the goal, nor is it the question – it’s not data for the sake of data. Instead, Big Data enables business strategies through Analytics. 

However, to fully grasp the opportunities that Big Data may provide, businesses are moving beyond the 3 original “V”s of Big Data – Volume, Velocity and Variety. There are now additional “V”s that need to be taken into account:

  • Veracity – the data must be trustworthy and credible.

  • Visualisation – after processing the data, the importance of visualisation is that it makes the data comprehensible. This also doesn’t mean simply graphs and charts but visualisations that enable data discovery and exploration.

  • Value - Data in and itself has no intrinsic value. The business must be able to use the data for Advanced Analytics in order to get value from the data.

 

5. Hadoop is still the elephant in the room

A number of the TDWI courses focused on the dizzying array of technologies that have sprung up to enable the delivery of Big Data solutions. It is clear that Hadoop (and its huge ecosystem of components) is still the major platform for Big Data applications across both analytical and operational implementations. This is due to its storage and processing capabilities, high performance and availability, and competitive pricing.

Rick van der Lans, in his course on New Data Storage Technologies, stated that there are a number of barriers to widespread Hadoop usage - the complexity of the APIs for accessing data on Hadoop which requires programming skills (making Hadoop unsuitable for non-technical users); low productivity of developing in Hadoop APIs; and limited tool support – as many reporting and analytics don’t support interfaces to Hadoop via MapReduce and HBase.  

This has led to the growth of ‘SQL-on-Hadoop’ engines which, according to Rick, offer the performance and scalability of Hadoop with a programming interface – SQL – which is known to non-technical users, requires a lot less coding and is supported by many more reporting and analytical tools. The first SQL-on-Hadoop engine was Hive and since then the demand is growing as is the choice of engines – Drill, Impala, Presto, Jethro, Splice and a myriad of other applications.

Whether its applications on Hadoop; alternatives to Hadoop such as Spark; high performing SQL-based appliances like Greenplum, HANA or Netezza; or the large number of NoSQL products out there - the choice is seemingly endless and confusing. The recommendation coming out of the conference is to do your research into the technologies and undertake Proof of Concepts before settling on a Big Data technology. It’s very much a case of ‘caveat emptor’ – buyer beware – you don’t just want an application with a funky name – it’s got to meet the operational, or analytical needs (or both) of your organisation, and there has to be resources available who have the skills to utilise it.  

 

6. Stop being the Department of No and become the Department of Yes

So how does a BI department or team evolve to cope with the brave new world of Analytics? One of the complaints from analysts and data scientists is that the traditional method of BI delivery takes too long. In a world where we don’t always know the requirements, the model, or the solution until we have loaded the data, how does the BI team prevent itself from being the bottleneck?

‘Stop being the Department of No and become the Department of Yes’ was a great quote I heard from Mark Madsen, one of the TDWI course instructors. By giving people a place for their ungoverned data and providing access to everything, even in its raw form - the sandbox approach for quick data load and analytics - we become adaptable to change and disruption. This is the ‘Bi-Modal’ approach:

  • Mode 1 is the traditional, well governed, secure, end-to-end DW/BI solution. This is where the repeatable processes that have standards, rules and models applied will exist and these projects will still offer huge value.

  • Mode 2 is the ‘sandbox approach’ – an area where the data can be loaded quickly to let the data scientists get to work and the business play with the data and decide if there is merit/value in transitioning to Mode 1 

This ‘two-stream’ approach to development and governance is certainly a theme coming through in the various courses. As well as allowing rapid access to the data for quick wins, this also allows for prototyping, PoCs and a ‘Fail Early’ approach for deciding on the best Big Data technology for an organisation.

Do you rely on TM1 for your financial budgeting and planning or reporting?

Friday, October 14, 2016

Do you know that on 24th November 2016 all TM1 servers will stop working? This is due to the SSL certificates expiring on this day.  In order for you to have TM1 up and running you will need to ensure that new SSL certificates are in place.


What are your options:

Disable SSL (insecure)
Switch to V2 certificates (recommended) 
Wait to see if IBM releases a fix pack for 10.2
Upgrade TM1 to 10.2.2 install FP6 and then SSL certificate fix (time consuming)

How can we help?

Altis Managed Services have developed a solution based on Option 2, where we will remotely manage the loading of V2 certificates through your development, test and production environments.  For just $5,000 we can remove your risk and ensure TM1 is working when you need it.

Moving forward, we can manage your entire TM1 environment for as little as $5,000 per month and give you absolute confidence and peace of mind in the safety and management of your data.

Want to learn more?

Contact us at Connect@altis.com.au or call us at 02 9211 1522.



Would you like to be kept in the loop on courses, events and other related topics?

Simply complete your details and we’ll add you to our list.