Talking to data specialists is like spending an evening with doctors. We hear a lot of complicated words whose structure can seem so convoluted that their meaning escapes us completely.
Because Wikipedia often tends to present things in a complete but rough way for understanding, I propose to you a simplified definition of the main terms used in Data, what they imply and how what they represent can serve you and your business. With a little luck, you will also become a connoisseur and will be able to laugh naturally at your Data team's jokes...
To better explain the concepts, let's take a classic situation as an example, that of Alice: a young entrepreneur, dynamic and full of enthusiasm. She recently decided to launch an international video streaming platform. Her users can upload videos, watch them, comment on them... Her success is now worldwide and she is seriously thinking about getting interested in Data, this pseudo black gold as they talk about it...
The Data Lake is nothing more than a massive (theoretically an indefinite amount) inexpensive storage space for data (whatever their format!) in a "free" organization. To understand simply: your computer's hard disk is literally a Data Lake: it stores a potentially very large amount of raw data and you can explore its structure with your explorer. Its querying performance is usually not extraordinary, and depends on how you will divide the data into directories (also called partitions). Thanks to the great diversity of the data, it will also be the basis of work for the Data Scientists who will for example explore the data, work on Machine Learning, experiment with transformations, ...
To take our initial example: a data lake can be a good solution for storing videos. Since users upload in large quantities, Alice will need an elastic storage space, whose cost will not explode with this quantity. Because the videos are indexed by user and Alice would like to have a minimum of performance, she will organize her Data Lake with one directory per user, each directory storing all the videos of that user.
Anecdotally, since the Data Lake does not impose any structure by nature, it can turn into a Data Swamp for some people, translating into a mess of data in which order is only a distant memory...
Examples of data lake solutions : AWS S3, Azure Blob Storage, GCP Cloud Storage, Apache Hadoop, …
For ExtractTransformLoad. ETL characterizes the actions of retrieving data from a stream (real-time or not), applying transformations, and then retransmitting it to another location. These operations can correspond to any data-processing function: an aggregation on a temporal window, an enrichment of records, an extraction of a subset of data, ...
Let's take Alice's situation again: her platform is used a lot, and for improvement purposes she would like to get some key information. For example, she would like to know in real time how many users are watching a video on her platform every minute that passes.
To do this, it will need an ETL chain: its mobile application constantly sends events when its user launches a video. This last chain intervenes by aggregating the events over a one-minute time window. Very simply, it is nothing more than a computer program that will retrieve the events for one minute, count everything and then transmit its result. That's it! A first simple metric that she will be able to reuse in her sublime dashboards.
Examples of ETL solutions :AWS Glue, Azure Data Factory, GCP DataFlow, Apache Spark, …
The data warehouse is a form of enhanced or specialized data lake. Everything is well organized, easily accessible, with a very high volume (comparatively lower than the Data Lake, however). You can retrieve very large amounts of data at any time and the information it contains is constantly accumulating. Its main distinction lies in the temporal structuring of its storage and the particular form of its data.
More practically, a data warehouse is actually a database that has been enhanced to be able to easily handle queries that require the analysis of potentially Gigabytes of data. To do this, it distributes the information over several competing disks, allowing it to read several storage sectors at the same time at the slightest request.
In addition to its hardware structure, the denormalized form of its data also allows it to optimize its performance. Indeed, even if it takes up more space, the Data Warehouse will store records where part of the information can be repeated, to avoid costly matching operations that initially save space.
To take our situation again: since Alice likes nice synthetic visualizations, she would like to see the distribution of video views over the year, by content category. To do this, each time one of her users finishes watching a video, an event is sent to her data architecture. An ETL pipeline enriches the data of each recording in order to integrate information about the user and the video (age range for example, the category of the video, ...). All this corresponds to a massive quantity of recordings. They will therefore be integrated into a data warehouse. Alice will then be able to innocently ask her Warehouse for the recordings of the whole year to realize her visual "elementary".
To summarize, a data warehouse is simply a huge database that can read structured data from many SSDs concurrently and is optimized for mass querying.
Examples of Data Warehousing solutions : AWS RedShift, Azure Synapse Analytics, GCP BigQuery, Apache Hive, Snowflake, …
The Data Hub is a virtual space in which it is possible to reference and query all your data sources. Its objective is nothing more than to centralize the information that your systems are capable of providing you with. Its advantage lies in its capacity to know the structure (or not) of the data of each source, and the possibility of querying it in order to obtain information quickly.
Let's go back to Alice's company, as feisty as the originality of its concept: its data is located in a plurality of places. The Video Data Lake, the business database (users and video information), the real-time event data warehouse, but also (for example) the files of each country's standards on what it authorizes to broadcast or not, its CRM referencing interested advertisers, the Google Analytics of your application's usage, ... . All this represents a potential great diversity of sources, which the Data Hub is there to list. From now on, Alice will be able to automatically know from the CRM customer information the quantity of videos that would correspond to his expectations, and the mass of users likely to be interested by them.
Examples of Data Hubs :Azure Purview, Data Hub, …
Every structure generates data by necessity. Whether it comes from the business domain, from interactions with terminals, or even from the Internet of Things. All this data is in essence a source of value. In our example of video streaming, knowing more about user habits, but also the main metrics of the platform in general, would allow us to obtain precise indicators that are essential for the future of the business. Whatever the associated remuneration model, knowing more about your business can only be beneficial in a Data Driven approach to continuous improvement.
Everything in its own time! You don't need to start a huge data warehouse and hundreds of ETL jobs right away.
The most important thing is to collect the data you generate in order to make an exploratory study. The objective is, for example, to create or use connectors to export your data to a Data Lake, without planning to work on it for the moment. This way, you will store a history of knowledge at low cost, which will then be ready to be used.
Once enough data has been collected (it depends on you: three days, several weeks, a few months), it's up to you and your teams to look at what's going on. Using tools like PowerBI, Apache Superset, QLik (business intelligence tools) or simply a Python/SQL Notebook, you will be able to try to put information together, transform it on the fly and look for relationships that make sense for your business. The idea is simply to create the blueprint that will allow you to add value to the data you generate. Don't force yourself to use everything you could use, and instead use data that is really going to benefit you on a daily basis.
Let's take an example: You can (from now on) export all your users' interactions to the AWS Data Lake, S3, on a daily basis. After a month, you can connect a Business Intelligence tool (PowerBI, Quicksight) to this Data Lake so that you can query it live and create new business from your data without delay!
If your business generates small amounts of data, it is likely that you have already reached the right level of optimization between data valuation and infrastructure costs. That said, for more comfort or if you want to go further, you can continue by creating ETL jobs that will export the data you have selected (by structuring it) to your brand new data warehouse. The same Business Intelligence tools will prefer to connect to it rather than to the Data Lake, and will benefit from much improved performance.
Once your data pipelines are well defined, maybe it's time to think about using your data for something else than Business Intelligence. For example, make your data available to your users in Open Data, or create your own Artificial Intelligence and Machine Learning models. In short, the world is open to you...
Thank you for reading this article.