Analytics on Open Data
11th June 2021
By Michael Amadi
Analytical workloads can generally be divided into three main areas: data engineering, business intelligence, and advanced analytics.
Data engineers are responsible for creating data pipelines that enable data consumers, such as data analysts, data scientists, and machine learning engineers to deliver insightful reports and machine learning, or artificial intelligence, models. It could easily be argued that, without some form of data engineering, getting significant value from complex or large data can quickly become an inefficient and overwhelming task.
The most ideal scenario for a data engineer when it comes to data acquisition is to be provided with a frictionless and consistent bulk data API that can be used to quickly ingest the required data into a data lake or analytics platform. To support this scenario, the API must provide metadata about when the data was last updated and include versioned data file endpoints.
In addition to a bulk data API, the data would ideally be modelled for analytical workloads (i.e. as star schemas) and made available in one or more open data file formats that are optimised for interactive querying patterns, such as Apache ORC and Apache Parquet. Once ingested, queries could be executed directly against the data files using open source data lake query engines such as Apache Drill, Apache Spark, Dremio, and Trino. An experience of this nature would significantly reduce the time to data insights.
Data analysts are responsible for surfacing insights through business intelligence (BI) reports and dashboards. Before they can do this, they must first acquire the relevant domain knowledge and transition through different phases of data exploration and data understanding. The less friction that they face, the faster they can deliver the valuable, impactful, and actionable data insights that their organisation needs.
Recent trends have seen an increase in the number of data analysts who can transform, analyse, and report on data using languages like Python, R, and Julia. This elite group of data analysts can use these languages to query data in data lakes or on their local machines. It can be argued that, ideally, most data analysts would prefer to use BI tools like Excel, Power BI Desktop, and Tableau Desktop to analyse data and build interactive reports and dashboards for their organisation. This is primarily because these tools offer simple drag and drop user interfaces that make performing these tasks very efficient.
Data scientists typically ask for data at the most granular level available. Like the data analyst, they must have or acquire the relevant domain knowledge and progress through phases of data exploration and data understanding. Once a sufficient level of data understanding has been achieved, a data scientist will transition to a data interpretation phase where they perform hypothesis testing. Their conclusions inform the choice of the most appropriate machine learning or artificial intelligence models for their business problem.
It is not uncommon to hear that data scientists spend over 80% of their time acquiring and preparing data, and less than 20% of their time performing actual data science work. A more ideal scenario would be for nearly all of this time to be spent on the tasks that deliver the real business value.
Use Our Open Data Services
Open Data Blend Datasets
Data engineers can use the Open Data Blend Dataset API to programmatically ingest our datasets into their analytics platform. They can choose to get the data in CSV, ORC, or Parquet format, or all three, depending on their downstream data consumption requirements. Each of our datasets is accompanied with details of when the data was last updated, what the data types are for each column, and descriptions of each column to provide some additional context.
Data analysts can use the Open Data Blend Dataset UI to download data files in CSV, ORC, or Parquet format onto their local machines and analyse them in programming tools like R Studio, Azure Data Studio, Power BI Desktop, and Tableau Desktop.
Data scientists can use the Open Data Blend Dataset UI to download the data locally, or the Open Data Blend Dataset API to programmatically pull data into their analytics platform. Once the data has been acquired, they could use notebooks like Jupyter to explore the data, test their hypothesis, and train machine learning models. Because our datasets have already been optimised for analytical workloads, the data scientist only needs to perform trivial joins and light data transformations (e.g. feature engineering) to prepare the data for machine learning models.
Experience our open data catalogue first hand.
Open Data Blend Analytics
Data analysts can connect to the Open Data Blend Analytics model from BI tools like Excel, Power BI Desktop, and Tableau Desktop, and dive straight into their data analysis. Because there is no need to download or model any of the data upfront, data analysts can begin creating reports and dashboards as soon as they have a good understanding of the data.
Like data analysts, data scientists can connect to the Open Data Blend Analytics Model from any of the supported BI tools and start their exploratory data analysis (EDA). This means they can obtain the required level of data understanding at an accelerated rate. After the EDA, data scientists can download the data files, from the corresponding Open Data Blend Datasets, and use languages like Python, R, and Julia to build and train their machine learning models.
Learn more about our interactive analytics service.