The leaders of organizations have to make important decisions daily to ensure the success and sustainability of their organizations. When doing so they tend to make decisions based on an analysis done using all sorts of data that are related to the organization without blindly making the decisions.
Nowadays organizations collect massive amounts of data and they are mostly not stored in a single repository. Therefore initially, we have to store all these scattered data in a single repository so that it will be more usable by the data analysts in preparing the analysis.
This is where we need Data Warehousing. We mostly talk about the ETL processes under the Data Warehousing concept. So let’s look at what is a Data Warehouse and Data Warehousing briefly, and then let’s get back to our main topic, the ETL processes.
What is a Data Warehouse (DW) and Data Warehousing?
A Data Warehouse is a central repository where we can store data from one or multiple data sources for reporting and data analysis.
Data Warehousing is the process of gathering data from one or multiple data sources and store it in a way that can be used for reporting and data analysis. This is where ETL comes into play.
What is ETL?
The three letters E, T, and L stands for Extract, Transform, and Load. Now let’s discuss these three terms in detail.
This is the initial step of the ETL process. In this step, the process extracts data from different data sources and stores them in a staging area. These data sources can be of different formats hence, we can’t store them directly in the DW. First, we have to transform the data in a way that can be stored in the DW. Therefore, the staging area plays an important role in an ETL process.
Transformation happens inside the staging area. This is the step where we apply different functions on the extracted data to transform them into a single format that can be stored in the DW. Filtering, merging splitting and many more functions can be applied here to ultimately have a meaningful dataset as per our requirements. A simple example of this can be given as follows.
In the extracted data, if the name data is available in two different columns as firstName and lastName, but you want it in a single column as full name, then you can merge these two in the transform step to store the full name in the DW.
From something simple like this to very complex transformations can be done in the Transform step, so that the data is much more useful for the analysts.
This is the final step of the ETL process. In this step, the transformed data will be simply loaded into the DW.
ETL vs ELT
There is another process called ELT. These two processes are not the same. As the names suggest, ETL is Extract, Transform, and Load while ELT is Extract, Load, and Transform. The key difference between these two is unlike in ETL, ELT does not have a staging area. In ELT, the extracted data will be directly loaded into the DW and simple transformations (not very complex transformations like in ETLs) will be done inside the DW. Most ELTs use cloud-based DW solutions for the DW to be compatible with different data formats.
ETL is best suited for a small amount of structured data while ELT is optimal for a large amount of unstructured data.
In this article, we have discussed the basic concepts of ETL processes without going into deep technical definitions to get a general understanding of these concepts. However, if someone is trying to learn about Data Warehousing and ETL in detail, there are valuable resources out there, which I suggest them to read and learn.