Apache Spark Data Engineering Data Science Deep Learning Machine Learning Tech

Distributed Deep Learning on Apache Spark with Keras

Distributed Deep Learning on Apache Spark with Keras

Deep learning has been shown to produce highly effective machine studying fashions in a various group of fields. A number of the most fascinating are: pharmaceutical drug discovery [1]detection of illegal fishing cargo [2]mapping dark matter [3]tracking deforestation within the Amazon [4]taxi destination prediction [5]predicting carry and grasp movements from EEG recordings [6]and medical analysis for cancer [7, 8]. Distributed deep learning allows for web scale dataset sizes, as exemplified by corporations like Facebook, Google, Microsoft, and different large enterprises. This blog submit demonstrates how any organization of any measurement can leverage distributed deep learning on Spark because of the Qubole Knowledge Service (QDS).

This demonstration makes use of the Keras [9] framework for describing the construction of a deep neural network, and subsequently leverages the Dist-Keras [10] framework to realize knowledge parallel mannequin coaching on Apache Spark. Keras was chosen in large part resulting from it being the dominant library for deep studying on the time of this writing [12, 13, 14].

The Aim: Distributed Deep Learning Integrated With Spark ML Pipelines

Upon completion of this blog submit, an enterprising knowledge scientist ought to be capable of prolong this demonstration to their software particular modeling needs. Within a Qubole pocket book [19]you need to be capable of cross validate your deep neural networks utilizing the Spark ML Pipeline interface [18], with an software particular parameter grid just like the following

df_train = process_csv(“/absolutely/certified/path/to/training/knowledge”)
df_test = process_csv(“/absolutely/qualified/path/to/check/knowledge”)

param_grid = tuning.ParamGridBuilder()
.baseOn([‘regularizer’, regularizers.l1_l2])
.addGrid(‘activations’, [[‘tanh’, ‘relu’]])
.addGrid(‘initializers’, [[‘glorot_normal’,
.addGrid(‘layer_dims’, [[input_dim, 2000, 300, 1]])
.addGrid(‘metrics’, [[‘mae’]])
.baseOn([‘learning_rate’, 1e-2])
.baseOn([‘reg_strength’, 1e-2])
.baseOn([‘reg_decay’, 0.25])
.baseOn([‘lr_decay’, 0.90])
.addGrid(‘dropout_rate’, [0.20, 0.35, 0.50, 0.65, 0.80])
.addGrid(‘loss’, [‘mse’, ‘msle’])

estimator = DistKeras(trainers.ADAG,
‘batch_size’: 256,
‘communication_window’: three,
‘num_epoch’: 10,
‘num_workers’: 50,

evaluator = evaluation.RegressionEvaluator(metricName=’r2′)

cv_estimator = tuning.CrossValidator(estimator=estimator,
cv_model = cv_estimator.match(df_train)

df_pred_train = cv_model.rework(df_train)
df_pred_test = cv_model.rework(df_test)

In an effort to make your cross validation work as illustrated above, you have to first configure your Qubole cluster [20]and secondly arrange your Qubole pocket book [19]. The instructions for doing so are contained within the remainder of this weblog.

Configuring Your QDS Cluster

Add the following strains to your node_boostrap script

# mechanically installs latest model of Keras as dependency
pip install dist-keras
# for GPU clusters, swap out default dependency tensorflow
# with tensorflow for GPU nodes
pip uninstall tensorflow
pip set up tensorflow-gpu

and restart your cluster. Word the default back-end for Keras is Tensorflow. It supports any of the following back-ends as properly: CNTK, MXNET, Theano [15, 16]. To use any of the opposite back-ends, you need to pip install them in the node_bootstrap script and subsequently inform Keras to which back-end to modify [17].

Setting Up Your QDS Pocket book

First import the required libraries

from keras import layers, fashions, optimizers, regularizers, utils
from pyspark.ml import analysis, function, tuning
from distkeras import predictors, trainers
from pyspark.sql import features, varieties
from pyspark import ml
import numpy as np
import matplotlib
import StringIO

after which you must outline these wrappers to tightly combine with Spark ML pipelines [18]. The wrappers are taken immediately from an open supply gist [11].

class DistKeras(ml.Estimator):

def __init__(self, *args, **kwargs):
self.__trainer_klass = args[0] self.__trainer_params = args[1] self.__build_trainer(**kwargs)
tremendous(DistKeras, self).__init__()

def __build_keras_model(klass, *args, **kwargs):
loss = kwargs[‘loss’] metrics = kwargs[‘metrics’] layer_dims = kwargs[‘layer_dims’] hidden_activation, output_activation = kwargs[‘activations’] hidden_init, output_init = kwargs[‘initializers’] dropout_rate = kwargs[‘dropout_rate’] alpha = kwargs[‘reg_strength’] reg_decay = kwargs[‘reg_decay’] reg = kwargs[‘regularizer’] keras_model = fashions.Sequential()
for idx in vary(1, len(layer_dims)-1, 1):
alpha *= reg_decay
return keras_model

def __build_trainer(self, *args, **kwargs):
loss = kwargs[‘loss’] learning_rate = kwargs[‘learning_rate’] lr_decay = kwargs[‘lr_decay’] keras_optimizer = optimizers.SGD(learning_rate, decay=lr_decay)
keras_model = DistKeras.__build_keras_model(**kwargs)
self._trainer = self.__trainer_klass(keras_model, keras_optimizer,
loss, **self.__trainer_params)

def _fit(self, *args, **kwargs):
data_frame = args[0] if len(args) > 1:
params = args[1] self.__build_trainer(**params)
keras_model = self._trainer.practice(data_frame)
return DistKerasModel(keras_model)

class DistKerasModel(ml.Mannequin):

def __init__(self, *args, **kwargs):
self._keras_model = args[0] self._predictor = predictors.ModelPredictor(self._keras_model)
super(DistKerasModel, self).__init__()

def _transform(self, *args, **kwargs):
data_frame = args[0] pred_col = self._predictor.output_column
preds = self._predictor.predict(data_frame)
return preds.withColumn(pred_col,

cast_to_double = features.udf(lambda row: float(row[0]), varieties.DoubleType())

Last but not least, you need to define some essential helper features, starting with the present() perform which displays arbitrary and generic matplotlib figures. This perform is tailored from the Qubole blog on integrating the alternate library Plotly into our notebooks [22].

# should do before importing pyplot or pylab
from matplotlib import pyplot as plt

def show(fig):
image = StringIO.StringIO()
fig.savefig(image, format=’svg’)

“+ picture.buf +”


One other necessary helper perform is process_csv() which automates the highly redundant activity of creating a knowledge body with renamed columns (corresponding to ‘label’ for the label column) and with excluded columns (resembling unused ID columns) from a CSV file in cloud storage [21].

def process_csv(fully_qualified_path, columns_renamed=tuple(),
excluded_columns=tuple(), num_workers=None):
if num_workers is None:
increase NotImplementedError

excluded_columns = frozenset(excluded_columns)
data_frame = sqlContext.read.format(‘com.databricks.spark.csv’)
.options(header=’true’, inferSchema=’true’)
for (old_name, new_name) in columns_renamed:
data_frame = data_frame.withColumnRenamed(old_name, new_name)
data_frame = data_frame.repartition(num_workers)

feature_columns = tuple(frozenset(data_frame.columns)
transformer = function.VectorAssembler(inputCols=feature_columns,
data_frame = transformer.rework(data_frame)

return data_frame

Now you’re ready to configure, practice, and consider any distributed deep studying mannequin described in Keras!

About The Writer

Horia Margarit is a profession knowledge scientist with business expertise in machine learning for digital media, shopper search, cloud infrastructure, life sciences, and shopper finance business verticals [23]. His experience is in modeling and optimization on internet scale datasets, particularly leveraging distributed deep studying among other methods. He earned dual Bachelors degrees in Cognitive and Pc Science from UC Berkeley, and a Grasp’s diploma in Statistics from Stanford University.


  1. Deep Learning How I Did It: Merck 1st place interview
  2. The Nature Conservancy Fisheries Monitoring Competition, 1st Place Winner’s Interview: Workforce ‘In the direction of Strong-Optimum Learning of Studying’
  3. DeepZot on Darkish Matter: How we gained the Mapping Darkish Matter challenge
  4. Planet: Understanding the Amazon from Area, 1st Place Winner’s Interview
  5. Taxi Trajectory Winners’ Interview: 1st place, Staff ?
  6. Grasp-and-Carry EEG Detection Winners’ Interview: third place, Workforce HEDJ
  7. Intel & MobileODT Cervical Most cancers Screening Competitors, 1st Place Winner’s Interview: Staff ‘In the direction of Empirically Secure Coaching’
  8. 2017 Knowledge Science Bowl, Predicting Lung Cancer: 2nd Place Answer Write-up, Daniel Hammack and Julian de Wit
  9. Keras Documentation
  10. Dist-Keras Documentation
  11. Dist-Keras Spark ML Pipelines
  12. Examine Keras and Pytorch’s reputation and exercise
  13. Examine Keras and Caffe’s reputation and exercise
  14. Examine Keras and Theano’s reputation and exercise
  15. Keras shoot-out: TensorFlow vs MXNet
  16. The Seek for the Quickest Keras Deep Studying Backend
  17. Keras: Switching from one backend to a different
  18. Spark ML Pipelines
  19. Introduction To Qubole Notebooks
  20. Configuring A Spark Cluster
  21. The Cloud Advantage: Decoupling Storage And Compute
  22. Creating Custom-made Plots In Qubole Notebooks
  23. Writer’s LinkedIn Profile