Have you ever stood up a datamart that was needed to build a handful of analytical reports, but then that repository sits idle until the next time those reports need refreshing (which may be a week, a month or several months…)? At many points in my career I have built data warehouses and datamarts for that exact scenario and have been frustrated by the length of time that database sits idle…it seemed like such a waste or energy and technical resources.
For example, in my previous job at a large pharmaceutical company, we bought huge 3rd party files and loaded them into an operational data store (ODS) in order to mine those files for insights such as what were the top prescribed deciles, or the most popular physicians. Once that was completed, the ODS and reports sat unused for several weeks until the next set of files arrived. This is true for many different industries that purchase marketing, or security and access data for example.
Now, IMAGINE IF... you could just stop the underlying database of the datamart or the processing engine behind the analytics when you don’t need them to be running specific reports or job? What IF you could even build a ‘start and stop’ functionality for the database and that could automatically manage the data processing resources into throughout the entire integration procedure?
With Talend Integration Cloud and AWS Services, you can now easily build the loading and processing of 3rd party files into a datamart and stop and start the processing power when needed. Upon receipt of the 3rd party load files, a cloud component called ‘tAmazonRedshiftManage’ automatically ‘fires up’ your AWS Redshift Database using the last snapshot so that it has all historical data for processing. Once the database is up and running, the integration starts to load any raw data. When the Redshift database starts you can even start up an EMR Cluster to perform other needed analytics that can be targeted to Redshift afterwards. The Talend processes needed to start and stop an AWS EMR Cluster for processing or an AWS Redshift for the data warehouse are as simple as the Talend job below.
Now I want to dive into each one of the varying Talend Integration Cloud components and explain the ease of setting up the process to start or stop each AWS service. Let’s start with the tAmazonEMRManage component, which is used to start or stop an EMR cluster. An EMR Cluster is Amazon’s version of Hadoop-as-a-service or Elastic MapReduce (EMR), even though you can now run Spark on EMR, so it’s not just for MapReduce! Without doing any AWS CLI programing or Python Scripts I can use a visual tool (Talend) to start or stop an EMR cluster by providing some simple parameters like those shown below.
On the Advanced setting tab, you can provide other important parameters like security groups, or whether the dataset needs to be in a VPC subnet. However, more importantly, on the Advance setting you can specify “steps” that you know are predefined programs or processes that need to be executed on a cluster after it is started.
All this provides you with a fast and easy way to dynamically spin up and spin down a very powerful data processing Hadoop environment without a single line of code. I have not provided one line of CLI to create the cluster, nor did I need to login to AWS to manually start up or change any settings on the cluster. All of these capabilities are open and available in the Open Talend Studio for Big Data. So check out your free version today by downloading a copy here.
In the next installment of this blog, we’ll take a look at the cool things you can do with AWS Redshift and EMR using Talend Integration Cloud.