Solutions
Datasets
Download CSV, ORC, and Parquet data files.
Analytics
Connect your BI tools to our analytical query service.
Integrations
Enhance your analytics solutions with our datasets.
Insights
Interactive reports with actionable insights.
Use Cases
Learn how you could unlock value from our datasets.
Consulting
Transform our datasets into your competitive edge.
PricingAboutContact
Resources
Help Centre
Find answers to the most frequently asked questions.
Documentation
Learn everything you need to know about Open Data Blend.
Blog
Keep up to date with our latest news, updates, and thoughts.
Get Involved
Help to improve the Open Data Blend services for everyone.
Affiliates
Supplement your business with a new recurring revenue stream.
Manage Subscription
Solutions
Datasets
Download CSV, ORC, and Parquet data files.
Analytics
Connect your BI tools to our analytical query service.
Integrations
Enhance your analytics solutions with our datasets.
Insights
Interactive reports with actionable insights.
Use Cases
Learn how you could unlock value from our datasets.
Consulting
Transform our datasets into your competitive edge.
PricingAboutContact
Resources
Help Center
Find answers to the most frequently asked questions.
Documentation
Learn everything you need to know about Open Data Blend.
Blog
Keep up to date with our latest news, updates, and thoughts.
Get Involved
Help to improve the Open Data Blend services for everyone.
Affiliates
Supplement your business with a new recurring revenue stream.
Manage Subscription

5 Leading Data Lake Analytics Platforms and Services: Part 1

Recent articles
Open Data Blend February 2023 Update
10th March 2023
Open Data Blend January 2023 Update
10th February 2023
6 Traits of a Great Data Engineer
26th January 2023
8 Traits of a Great Data Scientist
12th January 2023
Open Data Blend December 2022 Update
3rd January 2023

31st December 2021

By Michael A

In recent years there has been an insatiable hunger for data lakes because of their ability to store data regardless of whether it is structured, semi-structured, or unstructured. This capability is especially important because the rate of increase in the volume of unstructured and semi-structured data far outweighs that of structured data. You will find that most data lakes are built on one of three cloud object stores: Amazon Simple Storage Service (Amazon S3), Azure Data Lake Storage Gen2 (ADLS Gen2), or Google Cloud Storage (GCS). It is also becoming increasingly more common for data lakes to span multiple-cloud providers, and as a result, more than one storage service.

Why Cloud Object Stores are Important


Cloud object stores are typically low cost, highly scalable, extremely secure, compliant with several international standards, and provide virtually unlimited storage capacity. Implementing an on-premises data lake with all these attributes would be impractical for all but the largest of companies, not only because of the associated operational costs, but also the exceptional degree of skill and experience that is required to configure, maintain, and support the data infrastructure.

One of the biggest challenges when creating and maintaining a data lake is to avoid it turning into a data swamp. Companies can avoid creating data swamps by carefully organising the data that is ingested, and ensuring there are effective data governance controls, policies, and procedures in place.

Another significant challenge, once a data lake has been successfully established, is enabling data engineers, data analysts, data scientists, machine learning (ML) or artificial intelligence (AI) engineers, and other members of the analytics team to unlock value from the data. They need flexible, scalable, and highly performant platforms and services that allow them to analyse and transform the data into solutions that provide value to their organisation.

Leading Data Lake Analytics Platforms and Services


This five-part blog series will introduce what are arguably the five best-of-bread data lake analytics platforms or services right now and present some ideas around how they could be used by an analytics team to deliver business value.

The platforms and services that will be explored in this blog series are:

  1. Azure Synapse Analytics
  2. Amazon Athena
  3. Databricks
  4. Google BigQuery
  5. Dremio



Azure Synapse Analytics


Azure Synapse Analytics is a unified analytics platform for building end-to-end analytics solutions. It is a superset of several well-integrated services that enable members of the analytics team to work with the same set of data in a way that is aligned with their analytic workflows. The data lake is implemented using one or more ADLS Gen2 accounts, and the services require data to be ingested there first before the full spectrum of Azure Synapse Analytics capabilities can be used.

Significant Capabilities

Azure Synapse Pipelines is a data orchestration service that enables data engineers to build data pipelines at varying levels of complexity. It is based on the Azure Data Factory service, a mature cloud data orchestration service that has a proven track record.

Azure Synapse Serverless SQL Pools is a scalable data lake SQL query engine that makes it possible to flexibly project a schema over semi-structured and structured files, enabling them to be queried like relational database tables, with a comparable level of performance There is also an Azure Synapse Dedicated SQL Pools service, which is a massively parallel relational database service but, because this blog post focuses on services that can query the data lake directly, it has only been mentioned to make you aware of it.

Azure Synapse Spark Pools is based on a special version of Apache Spark enabling data in the data lake to be handled using powerful programming languages including the Scala, Python, R, and SQL. What makes this version of Apache Spark unique is it also includes support for C#.

Power BI is one of the leading business intelligence software-as-a-service (SaaS) platforms, and this is integrated into the Azure Synapse Analytics service in a way that enables reports and dashboards to be quickly created from the refined data assets in the data lake.

Delivering Business Value

Azure Synapse Analytics caters to the many different roles found in a modern analytics team including the data engineer, data analyst, data scientist, and ML/AI engineer. Data engineers will spend most of their time ingesting data into the data lake (i.e. one or more ADLS Gen2 storage accounts) using Azure Synapse Pipelines and then transforming it for downstream use using Azure Synapse serverless SQL pools, Azure Synapse serverless Spark pools, or a combination of both.

Data scientists will typically use Azure Synapse serverless Spark pools to perform iterative tasks on the data ingested data such as feature engineering, ML/AI model training, and ML/AI model evaluation. This could be done using Spark MLlib, a machine learning library built into Apache Spark, or with Azure Machine Learning services, a cloud-based machine learning service that enables data scientists and ML/AI engineers to rapidly iterate on machine learning models using automated ML (AutoML). Once trained and validated, the ML/AI engineers can take these production-ready ML/AI models and integrate them into their data pipelines using Azure Synapse Pipelines.

Data analysts can use the integrated Power BI experience to create business intelligence semantic data models, build reports and dashboards on top, making it possible to share insights across their organisation. The Power BI data model would use refined data from the data lake as a primary source and augment this with external data from other systems and data services, enabling the data analyst to stay on top the ever-evolving reporting requirements. They could also use the Power BI composite models and recently announced hybrid tables features to combine an in-memory analytics cache, to provide consistently quick query response times, with massive data volumes that are queried directly in the data lake. These features can also be used to satisfy real-time reporting requirements.

Coming Up Next

The next instalment of this five-part blog series will explore how an analytics team could use Amazon Athena and other complimentary AWS services to deliver business value from a data lake implemented with Amazon S3.

Follow Us and Stay Up to Date

Keep up to date with Open Data Blend by following us on Twitter and LinkedIn. Be among the first to know when there's something new.

Blog hero image by Joshua Sortino on Unsplash.

Got feedback?
Get involved.
Get our latest updates
We'll use the information you provide through this form to send you Open Data Blend related news and updates. View our privacy policy
Operated by

Copyright © 2019-2023 Nimble Learn Ltd. All rights reserved unless otherwise stated. Company Registration Number 08637310. VAT Number 174 9728 60.

Open Data Blend®, the Open Data Blend® logo, Nimble Learn®, and the Nimble Learn® logo are registered trademarks of Nimble Learn Ltd. All other product names, logos, and brands are the property of their respective owners, and their use does not imply endorsement.

Terms    Privacy    Cookies    SLA    Licensing    Docs    Status