Today’s world has recently taken up an increased focus on machine learning and with data scientists/data miners/ predictive modellers / *whatever new job term may emerge* operating at the cutting-edge of technology, it cannot be forgotten that machine learning needs to be implemented in such a way to aid in the solution of real business problems.
In day-to-day machine learning (ML) and the quest to deploy the knowledge gained, we typically encounter these three main problems (but not the only ones).
Data Quality – Data from multiple sources across multiple time frames can be difficult to collate into clean and coherent datasets that will yield the maximum benefit from machine learning. Typical issues include missing data, inconsistent data values, autocorrelation and so forth.
a. The reason why these are important is that these issues affect the statistical properties of the datasets and interfere with the assumptions made by algorithms when run against these dirty data sets. This results in reduced effectiveness of the models and thus the potential return on
Business Relevance – While a lot of the technology underpinning the machine learning revolution has been progressing more rapidly than ever, a lot of the application today occurs without much thought given to business value. The process of defining business problems and translating them into analytical problems (that might or might not be solved by machine learning) through frameworks of rigor such as CRISP-DM and SEMMA seems to have taken a back seat to the unstoppable gadgetry of technologies (R, Scikit-Learn, Spark ML, TensorFlow, SAS, SPSS, Julia, FlinkML, Java etc..).
a. For example, a customer churn model built with deep learning techniques might provide fantastic prediction accuracy but at the expense of interpretability and understanding how the model derived the answer. The business may have originally wanted a high accuracy model as well as an understanding into why customers churn. The original objective may have been to gain behavioral insight and improve interactions with the customer rather than making critical decisions based on trust in a black box of code.
Operationalizing Models – This is business relevant in that once models have gone through the build and tuning cycle, it is critical to deploy the results of the machine learning process into the wider business. This is a difficult bridge to cross as predictive modelers are typically not IT solution experts and vice versa.
a. Closing this gap in disparate skill sets is necessary to exploit the benefits of machine learning to be reused by business applications downstream. For example, think of an inbound customer feedback system routing complaints and feedback to the correct channel and consultant with the explicit aim of churn prevention. This could be achieved through real time interaction with an NLP + neural network pipeline all wrapped into a neat REST API.
Moving Machine Learning into Business Application
In this blog, we will focus on the operational aspect of data science as well as how Talend can assist this process by bringing together IT and Data Science to ensure that critical machine learning models can be deployed seamlessly to downstream business applications and thus bridge the skill gap that exists between data scientist and IT developer.
To begin, the scenario will be a simple one involving three departments:
- Data science
The goal will be to provide movie recommendations to the users of WebFlicks (our nice fictional version of a real giant). Management has hypothesized that increasing the dispersal of consumption of WebFlicks content will encourage customers to consume content that they ordinarily wouldn’t have for various reasons (they weren’t aware of the content, they typically didn’t like the genre, etc.)
Using this tactic will ideally increase the daily average of minutes spent consuming content and thus drive loyalty in order to reduce the threat of churn to the competing service NetFlink. The centerpiece of this offering will be a recommendation engine that will provide movie recommendations to customers.
The IT developer coordinates with the data scientist and together they agree that the best way forward will be to develop the ALS (Alternating Least Squares algorithm) model using Spark and Apache Zeppelin while after persisting the ML model to a parquet file on the enterprise Big Data platform. The developer can then consume this model file using Talend into a wider data pipeline that can be deployed downstream for the business to consume in an easy manner.
Please see below for the code snippet of a very basic ALS model written in PySpark, trained and persisted into Parquet format using Spark’s built-in persistence functionality. The dataset that has been used is the MovieLens dataset (available here https://grouplens.org/datasets/movielens/).
Machine Learning Made Easy
In this scenario, the data scientist’s tool of choice is Apache Zeppelin so that way they can leverage their expertise in their preferred technology. If the model can be persisted in a format that can be consumed by IT, the requirements are satisfied.
Talend then simply uses this model file in a visual data pipeline which scores the input data (customer A). Within this pipeline, Talend is generating the entire pipeline in transparent Spark code under the hood without the IT developer having to write a single line of the code themselves.
In the screenshot below we can see that there are 3 recommended movies for Customer A (Plan 9 from Outer Space, Safe and Spun). In this scenario, the recommendations are being also being written to HDFS for archiving and to MapR DB to power a business application. These are just examples, but the target could be virtually any relational database as well as many different file formats. In Talend, we also have the capability to build a REST Endpoint over this flow to enable other applications to connect to this recommendation pipeline in real time.
In contrast, had the IT developer not been using Talend, they would have required in depth knowledge of Spark APIs (a time-consuming investment in a project with tight timelines) to implement their custom data pipeline in a similar fashion. Talend has not only increased developer productivity here by reducing barriers to entry and development time but also increased maintainability through the powerful visual data pipeline that can be inherited and enhanced by other developers over time without a costly hand over/training process.
Rather, in this exercise, Talend’s extensibility has enabled IT and data science to collaborate by encouraging them to focus on their respective areas of expertise to deliver a critical capability to the business and power their strategy going forward.
About the Author
As Solution Engineer at Talend, Marko brings years of experience working with customers across multiple industries, and helping them with their journey to becoming data driven businesses. In recent years Marko has focused on data science but now works with a holistic view of treating data as an imperative asset from collection through to deployment and application in the wider business. Outside of the office, he likes to spend time with his family as well playing soccer, table tennis and cycling.