James Black from our Sydney office recently attended the TDWI conference "Analytics across the Enterprise" earlier this month in San Diego, California. Here he shares his key takeaways from the conference.
The main theme of the conference was developing insight-driven organisations that succeed in using analytics to create real business impact.
As with all TDWI conferences, there was a myriad of courses on offer across a variety of learning tracks, from 'Getting Started with Analytics', to 'Managing Analytical Data' and 'Visualising and Communicating using Data' through to a first ever TDWI Data Scientist boot camp.
Attending the conference was a rewarding experience - getting to hear and interact with the industry thought leaders who present the courses; and share experiences with my fellow attendees from the front-line of Analytics and Data Warehousing. For me, the overarching theme of the courses I attended was the challenge for traditional DW/BI practitioners to adapt our structured methods and processes to the more fluid and changeable landscape of the Big Data world.
Here are my six key takeaways from the conference:
1. ‘Traditional’ DW/BI is still important, but it’s no longer enough
Data Warehousing and Business Intelligence as we know it – reports, dashboards, cubes, ETL, 3NF, Star Schemas – has been around for decades and is to some extent overshadowed these days by all the hip, exciting, new-fangled stuff going on related to Big Data. But it remains hugely important to how most organisations make business decisions.
However, many businesses are moving beyond conventional BI to gain a competitive edge not merely by seeking to answer the ‘who’, ‘what’ and ‘when’ of DW/BI; but the analytical questions of ‘what if..’, ‘what should..’ and ‘what will happen in the future’. This is the power of Advanced Analytics – analytics that predict what is likely to happen and future behaviours; and analytics that prescribe future courses of action.
2. The enterprise is no longer the centre of the data universe
The question of what data of value sits outside of an organisation’s core transactional and customer management systems is of course nothing new. But those halcyon days, where the biggest concern for Data Architects was how to get all that data residing in spreadmarts and Access databases into the Data Warehouse are over.
Data that is required for Analytics is everywhere – inside and outside of the enterprise. The data internal to the enterprise remains important, but the data external to the enterprise is becoming just as important – web data, social media, open data and government datasets, text and sensor data, system logs and machine generated information.
As well as the location of the data, the type of data available for analysis has also changed – not just greater volumes at a lower latency but non-relational, inconsistent and without metadata. Data that doesn’t fit neatly into the structured, architected models of our usual relational databases - streaming and telemetry data, social media data, JSON, XML, textual data. New types of databases are being used to store this non-relational information – Key Value, Document and Graph databases – the ‘No SQL’ databases.
3. A new paradigm for architecting and modelling BI solutions
So the data is not always relational, we won’t always know the exact format before we receive it, and the format may change – resulting in a shift in how we architect and model future DW/BI solutions.
Does unstructured data make data modelling impractical? Does NoSQL imply no data modelling? Do E-R and star schema models still matter? The answer to these questions is that data modelling is still an important process; data modelling for relational structures is not going away; but the data modelling process must change to keep pace with the rapidly evolving world of data.
In traditional BI, we model before storing data. For these new data sources, we can no longer model and architect all the data in advance - we will have to store the data first and then model it. This is the ‘Schema on Read’ paradigm versus the ‘Schema on Write’ process of traditional Data Warehousing.
The process of architecting DW/BI solutions also needs to become more adaptive. This means no ‘one size fits all’ architecture with rigid ‘enterprise’ standards but allowing greater access to the data at earlier stages of its processing i.e. rather than waiting for the data to be transformed and loaded to the DW, users get access to it in its rawer forms.
While Data Architecture has usually involved control and centralisation, the requirements of Data Scientists are for empowerment and decentralisation. Architecture and governance remain hugely important but some happy medium is required in relation to rigour and security, so for example, sensitive customer data isn’t just dumped in to the data lake for anyone to access.
4. Big Data – moving beyond the hype and into the mainstream
Big Data has justified the hype and it was clear at the conference that many attendees were from organisations that were already utilising the data, technologies and techniques of Big Data. Big Data is not the goal, nor is it the question – it’s not data for the sake of data. Instead, Big Data enables business strategies through Analytics.
However, to fully grasp the opportunities that Big Data may provide, businesses are moving beyond the 3 original “V”s of Big Data – Volume, Velocity and Variety. There are now additional “V”s that need to be taken into account:
Veracity – the data must be trustworthy and credible.
Visualisation – after processing the data, the importance of visualisation is that it makes the data comprehensible. This also doesn’t mean simply graphs and charts but visualisations that enable data discovery and exploration.
Value - Data in and itself has no intrinsic value. The business must be able to use the data for Advanced Analytics in order to get value from the data.
5. Hadoop is still the elephant in the room
A number of the TDWI courses focused on the dizzying array of technologies that have sprung up to enable the delivery of Big Data solutions. It is clear that Hadoop (and its huge ecosystem of components) is still the major platform for Big Data applications across both analytical and operational implementations. This is due to its storage and processing capabilities, high performance and availability, and competitive pricing.
Rick van der Lans, in his course on New Data Storage Technologies, stated that there are a number of barriers to widespread Hadoop usage - the complexity of the APIs for accessing data on Hadoop which requires programming skills (making Hadoop unsuitable for non-technical users); low productivity of developing in Hadoop APIs; and limited tool support – as many reporting and analytics don’t support interfaces to Hadoop via MapReduce and HBase.
This has led to the growth of ‘SQL-on-Hadoop’ engines which, according to Rick, offer the performance and scalability of Hadoop with a programming interface – SQL – which is known to non-technical users, requires a lot less coding and is supported by many more reporting and analytical tools. The first SQL-on-Hadoop engine was Hive and since then the demand is growing as is the choice of engines – Drill, Impala, Presto, Jethro, Splice and a myriad of other applications.
Whether its applications on Hadoop; alternatives to Hadoop such as Spark; high performing SQL-based appliances like Greenplum, HANA or Netezza; or the large number of NoSQL products out there - the choice is seemingly endless and confusing. The recommendation coming out of the conference is to do your research into the technologies and undertake Proof of Concepts before settling on a Big Data technology. It’s very much a case of ‘caveat emptor’ – buyer beware – you don’t just want an application with a funky name – it’s got to meet the operational, or analytical needs (or both) of your organisation, and there has to be resources available who have the skills to utilise it.
6. Stop being the Department of No and become the Department of Yes
So how does a BI department or team evolve to cope with the brave new world of Analytics? One of the complaints from analysts and data scientists is that the traditional method of BI delivery takes too long. In a world where we don’t always know the requirements, the model, or the solution until we have loaded the data, how does the BI team prevent itself from being the bottleneck?
‘Stop being the Department of No and become the Department of Yes’ was a great quote I heard from Mark Madsen, one of the TDWI course instructors. By giving people a place for their ungoverned data and providing access to everything, even in its raw form - the sandbox approach for quick data load and analytics - we become adaptable to change and disruption. This is the ‘Bi-Modal’ approach:
Mode 1 is the traditional, well governed, secure, end-to-end DW/BI solution. This is where the repeatable processes that have standards, rules and models applied will exist and these projects will still offer huge value.
Mode 2 is the ‘sandbox approach’ – an area where the data can be loaded quickly to let the data scientists get to work and the business play with the data and decide if there is merit/value in transitioning to Mode 1
This ‘two-stream’ approach to development and governance is certainly a theme coming through in the various courses. As well as allowing rapid access to the data for quick wins, this also allows for prototyping, PoCs and a ‘Fail Early’ approach for deciding on the best Big Data technology for an organisation.