Tutorial 2: Machine Learning

How to use this tutorial

This tutorial is also available in Jupyter notebook format. To access and run the Jupyter notebook version of the tutorial, please sign up for free developer access by following instructions at https://github.com/juliustechco/juliusgraph.
Additional resources (video demos & blogs) are available at http://juliustech.co.
To report bugs or request new features, please raise an issue here. To schedule a live demo, please go to http://juliustech.co. Please check out this FAQ page or email us at info@juliustech.co for other general inquiries.
This tutorial is copyrighted by Julius Technologies, its use is governed by the terms of use.

This tutorial shows how to use the Julius Graph Engine to set up the training and validation of a machine learning model. We will compare several different ML models to predict (or postdict) the survival of Titanic passengers using the classic Titanic data set.

Julius provides a DataScience package, which contains a rich set of functionalities for data sourcing, cleansing, and machine learning. In this tutorial, we will show how to use the DataScience package to quickly build a transparent and sophisticated ML pipeline. This tutorial broadly follows the steps of a data scientist when building a new ML model.

1. Data Processing

1.1 Data Sourcing & Visualization

A data scientist typically starts their project by exploring and visualizing data of various sources. Julius provides a rich set of connectors to multiple data sources and formats, such as CSV, web url, relational databases, hadoop or other NoSQL Databases, etc. Julius also offers many data visualization tools in its interactive web UI.

We start by including the necessary Julia and Julius packages and set up some basic configurations.

# Julia packages
using Base.CoreLogging
using DataFrames, StatsBase

# Julius Packages
using GraphEngine: RuleDSL, GraphVM
using DataScience, AtomExt, GraphIO

# turn off informational logging output
disable_logging(CoreLogging.Info)

# extend the number of displayed columns in Jupyter notebooks
ENV["COLUMNS"] = 100;

# the project is used for web UI display
config = RuleDSL.Config(:project => "Titanic");

The dataset can be loaded from a url or a local CSV file via rules in the ds namespace, which is part of Julius' DataScience package. The line commented out is a rule to load the same data from a URL.

rawsrc = RuleDSL.@ref ds.csvsrc("../data/titanic.csv", true; label="raw csv");
# rawsrc = RuleDSL.@ref ds.urlsrc("https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv", true; label="raw url")

ds:csvsrc/raw csv

The first thing a data scientist often does is to get a summary of the dataset. The following cell shows how it can be done using the ds.datasummary rule in the DataScience package.

rawsummary = RuleDSL.@ref ds.datasummary(rawsrc; label="data summary")

gs1 = GraphVM.createlocalgraph(config, RuleDSL.GenericData());
GraphVM.calcfwd!(gs1, Set([rawsummary]));

The data summary results can be retrieved using the GraphVM.getdata method. The data cached in individual graph nodes are all vectors. The last argument 1 is optional, as it selects a given element from the data vector of the node. Without it, the entire vector will be returned.

RuleDSL.getdata(gs1, rawsummary, 1)

12 rows × 7 columns

	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	Type
1	PassengerId	446.0	1	446.0	891	0	Int64
2	Survived	0.383838	0	0.0	1	0	Int64
3	Pclass	2.30864	1	3.0	3	0	Int64
4	Name		Abbing, Mr. Anthony		van Melkebeke, Mr. Philemon	0	String
5	Sex		female		male	0	String7
6	Age	29.6991	0.42	28.0	80.0	177	Union{Missing, Float64}
7	SibSp	0.523008	0	0.0	8	0	Int64
8	Parch	0.381594	0	0.0	6	0	Int64
9	Ticket		110152		WE/P 5735	0	String31
10	Fare	32.2042	0.0	14.4542	512.329	0	Float64
11	Cabin		A10		T	687	Union{Missing, String15}
12	Embarked		C		S	2	Union{Missing, String1}

1.2 Data Cleansing & Imputation

We observe that some columns in the raw data set have missing values. Data imputation and cleansing is the next step of the workflow. Julius' DataScience package provides common data imputation methods, which can be easily invoked using the ds.fillmissing rule with the desired imputation method for each missing field, i.e., we use median value of Age of all passengers for any missing Ages, and the mode value (which is true) for any missing Embarked. The rule ds.fillmissing is generic, it can use any Julia method to fill in missing values, e.g. the StatsBase.median and StatsBase.mode below.

After data imputation, we recompute the data summary, with all the missing values for both Age and Embarked features populated.

cleansrc = RuleDSL.@ref ds.fillmissing(
    rawsrc, Dict(:Age => StatsBase.median, :Embarked => StatsBase.mode); label="imputation"
);

cleansummary = RuleDSL.@ref ds.datasummary(cleansrc; label="clean summary")
GraphVM.calcfwd!(gs1, Set([cleansummary]))
RuleDSL.getdata(gs1, cleansummary, 1)

12 rows × 7 columns

	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	Type
1	PassengerId	446.0	1	446.0	891	0	Int64
2	Survived	0.383838	0	0.0	1	0	Int64
3	Pclass	2.30864	1	3.0	3	0	Int64
4	Name		Abbing, Mr. Anthony		van Melkebeke, Mr. Philemon	0	String
5	Sex		female		male	0	String7
6	Age	29.3616	0.42	28.0	80.0	0	Float64
7	SibSp	0.523008	0	0.0	8	0	Int64
8	Parch	0.381594	0	0.0	6	0	Int64
9	Ticket		110152		WE/P 5735	0	String31
10	Fare	32.2042	0.0	14.4542	512.329	0	Float64
11	Cabin		A10		T	687	Union{Missing, String15}
12	Embarked		C		S	0	String1

1.3 Feature Engineering

Once the data scientist is happy with the results of the data cleansing and imputation, the next step is feature engineering, which is to add or remove columns from the data set.

In the Titanic data set, we want to drop the columns that should have no correlation to a passenger's survival outcome, such as a passenger's ticket id, name and IDs. Including irrelevant data in the training of a ML model may degrade its performance. The Cabin has also been dropped because it has too many missing values to be useful.

Here we create two additional features: 1) the z value of the ticket fare, which is the difference of a passenger's ticket price from the mean price in the unit of standard deviation of all ticket prices; 2) the total number of relatives onboard for a given passenger, which is the sum of the number of siblings (:SibSp) and parents/children (:Parch) onboard.

Feature engineering is supported generically by a rule ds.coltransform in the DataScience package. The following cell shows its usage. The feature engineering can be easily entered as formulae operating on the columns (named by those variables start with :).

newfeatures = quote
    :Zfare = (:Fare .- mean(:Fare)) ./ std(:Fare)
    :Relatives = :SibSp .+ :Parch
end

dropfeatures = [:Cabin, :Ticket, :PassengerId, :Name]
features = RuleDSL.@ref ds.coltransform(cleansrc, :feature, newfeatures, dropfeatures; label="feature eng")

featuresummary = RuleDSL.@ref ds.datasummary(features; label="feature summary")
GraphVM.calcfwd!(gs1, Set([featuresummary]));

The data summary results after feature engineering is therefore:

RuleDSL.getdata(gs1, featuresummary, 1)

10 rows × 7 columns

	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	DataType
1	Survived	0.383838	0	0.0	1	0	Int64
2	Pclass	2.30864	1	3.0	3	0	Int64
3	Sex		female		male	0	String7
4	Age	29.3616	0.42	28.0	80.0	0	Float64
5	SibSp	0.523008	0	0.0	8	0	Int64
6	Parch	0.381594	0	0.0	6	0	Int64
7	Fare	32.2042	0.0	14.4542	512.329	0	Float64
8	Embarked		C		S	0	String1
9	Zfare	-1.76938e-17	-0.648058	-0.35719	9.66174	0	Float64
10	Relatives	0.904602	0	0.0	10	0	Int64

The entire data processing steps we performed so far can be visualized interactively in the Julius web UI by clicking the link below. All the intermediate data is accessible from the web UI.

# start data server for web UI
gss = Dict{String,RuleDSL.AbstractGraphState}()
port = GraphVM.drawdataport()
@async GraphVM.startresponder(gss, port)

svg = GraphIO.postlocalgraph(gss, gs1, port, true; key="data");
GraphIO.postsvg(svg, "titanic_1.svg")

view graph data at http://127.0.0.1:8080/ui/depgraph.html?dataurl=127.0.0.1:7321_data
starting data service at port 7321

Figure 1 - Data Sourcing, Cleansing & Feature Engineering

2. Experiment with multiple ML models

Once the data scientist is happy with the results of data cleansing, imputation and feature engineering, the next step is to try multiple ML models and see how they perform on the data set.

Julius Graph Engine can interop with existing Python, Java, C++, .Net and R libraries via the generic Atom interface, making it seamless to access the rich set of ML models in these ecosystems.

For example, the following rules leverage the Python ML libraries, such as sklearn and xgboost, by using the PyTrain atom provided in the DataScience package. The first parameter of the PyTrain atom is the full name of the Python ML class to use. The second parameter is a Dictionary with the corresponding parameters/options/arguments to that Python ML class.

@addrules ds begin
    classifiertrain(model::Val{:SVC}, options::Dict, traindat::NodeRef) = PyTrain["sklearn.svm.SVC", options](traindat...)
    classifiertrain(model::Val{:DecisionTree}, options::Dict, traindat::NodeRef) = PyTrain["sklearn.tree.DecisionTreeClassifier", options](traindat...)
    classifiertrain(model::Val{:RandomForest}, options::Dict, traindat::NodeRef) = PyTrain["sklearn.ensemble.RandomForestClassifier", options](traindat...)
    classifiertrain(model::Val{:AdaBoost}, options::Dict, traindat::NodeRef) = PyTrain["sklearn.ensemble.AdaBoostClassifier", options](traindat...)
    classifiertrain(model::Val{:MLPC}, options::Dict, traindat::NodeRef) = PyTrain["sklearn.neural_network.MLPClassifier", options](traindat...)
    classifiertrain(model::Val{:GaussianNB}, options::Dict, traindat::NodeRef) = PyTrain["sklearn.naive_bayes.GaussianNB", options](traindat...)
    classifiertrain(model::Val{:XGBoost}, options::Dict, traindat::NodeRef) = PyTrain["xgboost.XGBClassifier", options](traindat...)
    classifiertrain(model::Symbol, options::Dict, traindat::NodeRef; label = "$model-train") = Alias(classifiertrain(val(model), options, traindat))
end

We now proceed to train multiple ML models and compare their in-sample and out-of-sample performance using various performance metrics, such as Gini. We first define the list of models we want to compare and their hyperparameters.

The ML models are trained to predict the survival probability of Titanic passengers. The target variable name for ML prediction is also given below.

models = [
    :DecisionTree => Dict(:min_samples_leaf => 0.1),
    :LogisticRegression => Dict(:solver => "saga", :max_iter => 200),
    :AdaBoost => Dict(),
    :XGBoost => Dict(),
    :GradientBoost => Dict(:min_samples_leaf => 0.1),
    :RandomForest => Dict(:min_samples_leaf => 0.1),
    :GaussianNB => Dict(),
];

yname = :Survived;

To divide the input dataset for training and validation, we use the randrowsel rule from the DataScience package, which randomly selects a portion of the input data as the validation set, while the rest is used for training. The parameter 1/3 is the fraction of rows that are reserved for validation.

valind = RuleDSL.@ref ds.randrowsel(cleansrc, 1 / 3);

DataScience.ClassifierSpec is a generic struct that holds all the configurations for training and validating binary classifiers, such as those we have defined so far. It is more convenient and readable to pass a ClassifierSpec object to a rule than having to pass five separate parameters. The ClassifierSpec can be used for any binary classifier problems or data sets. The last parameter to the ClassifierSpec constructor is a tuple representing the feature engineering.

cspec = DataScience.ClassifierSpec(models, cleansrc, yname, valind, (:feature, newfeatures, dropfeatures));

Now we can proceed and use the ds.classifiermetrics rule, which is also part of DataScience, to compute in-sample and out-of-sample metrics for each model.

metrics = [:gini, :roc, :accuracyrate, :accuracygraph]
basem = RuleDSL.@ref ds.classifiermetrics(cspec, metrics)
gs2 = GraphVM.createlocalgraph(config, RuleDSL.GenericData())
@time GraphVM.calcfwd!(gs2, Set([basem]));

 27.113508 seconds (38.96 M allocations: 2.223 GiB, 4.10% gc time, 37.37% compilation time)

We can retrieve in-sample and out-of-sample performance metrics. For example, the GINIs:

giniref = RuleDSL.@ref ds.classifiermetric(cspec, :gini)
gini = GraphVM.getdata(gs2, hash(giniref), 1)
ginidf = DataFrame(model=gini[:InSample][!, :Model], InSample_GINI=gini[:InSample][!, 2], OutSample_GINI=gini[:OutSample][!, 2])

7 rows × 3 columns

	model	InSample_GINI	OutSample_GINI
	String	Float64	Float64
1	DecisionTree	0.72731	0.623504
2	LogisticRegression	0.517883	0.498775
3	AdaBoost	0.855168	0.621244
4	XGBoost	0.997058	0.673669
5	GradientBoost	0.851753	0.653556
6	RandomForest	0.706872	0.601507
7	GaussianNB	0.69204	0.60796

The entire data and logic can be visualized by clicking on the URL below.

svg = GraphIO.postlocalgraph(gss, gs2, port, false; key="ml");
GraphIO.postsvg(svg, "titanic_2.svg")

view graph data at http://127.0.0.1:8080/ui/depgraph.html?dataurl=127.0.0.1:7321_ml

Figure 2 - Machine Learning

The entire ML pipeline includes all the steps we have defined so far, such as data sourcing, imputation, feature engineering, training of multiple ML models and the computation and reporting of performance metrics. A data scientist only needs to invoke a few rules defined in DataScience package to construct this realistic ML pipeline, with a total of 83 nodes in the graph, as shown below.

dg = GraphVM.mygraph(gs2)
println(length(dg._items))

3. Hyperparameter Tuning

Once a data scientist narrows down the choice of ML models to a few, the next step is to select the optimal hyperparameters for these candidate ML models.

The Julius Graph Engine provides a generic rule hypertune for hyperparameter tuning of any ML model, which shows the power and expressiveness of high order rules.

For a given machine learning model, we can select a range for a set of hyperparameters and easily perform a grid search and report the corresponding metric results:

ht_1 = RuleDSL.@ref ds.hypertune(cspec, :XGBoost,       Dict(), :gini, :n_estimators => 50:50:200);
ht_2 = RuleDSL.@ref ds.hypertune(cspec, :AdaBoost,      Dict(), :gini, :n_estimators => 50:50:200);
ht_3 = RuleDSL.@ref ds.hypertune(cspec, :GradientBoost, Dict(), :gini, :n_estimators => 50:50:200, :min_samples_leaf => .05:.05:.2);
ht_4 = RuleDSL.@ref ds.hypertune(cspec, :RandomForest,  Dict(), :gini, :n_estimators => 50:50:200, :min_samples_leaf => .05:.05:.2);

The hypertune rule supports an arbitrary number of dimensions in parameters search. Additional search dimensions can be added to the ds.hypertune rule by appending extra pairs of hyperparameter => searchgrid to rule parameters. We can then wrap all the hyperparameter searches in a single node for convenience by means of an alias rule which uses the Alias atom:

tunings = RuleDSL.@ref ds.alias([ht_1, ht_2, ht_3, ht_4]; label="Hyperparameter Tuning")

ds:alias/Hyperparameter Tuning

Now proceed with the computation of all the defined hyperparameter tunings:

gs3 = GraphVM.createlocalgraph(config, RuleDSL.GenericData());
@time GraphVM.calcfwd!(gs3, Set([tunings]));

 16.385435 seconds (12.28 M allocations: 777.794 MiB, 2.56% gc time, 20.56% compilation time)

The following cell shows the resulting in-sample and out-of-sample GINI from the different hyperparameters for GradientBoost:

dat = GraphVM.getdata(gs3, hash(ht_3))
df = deepcopy(dat[1][:, 1:2])
df[!, :InSampleGINI] = dat[1][!, 3]
df[!, :OutSampleGINI] = dat[2][!, 3]
df

16 rows × 4 columns

	n_estimators	min_samples_leaf	InSampleGINI	OutSampleGINI
	Int64	Float64	Float64	Float64
1	50	0.05	0.795045	0.786926
2	50	0.1	0.763509	0.784488
3	50	0.15	0.726132	0.784183
4	50	0.2	0.688312	0.796119
5	200	0.05	0.905963	0.766254
6	200	0.1	0.84956	0.775599
7	200	0.15	0.815249	0.780933
8	200	0.2	0.747537	0.793377
9	100	0.05	0.850294	0.78718
10	100	0.1	0.806027	0.78012
11	100	0.15	0.767287	0.784742
12	100	0.2	0.716141	0.802976
13	150	0.05	0.887193	0.767574
14	150	0.1	0.832282	0.778241
15	150	0.15	0.794812	0.785301
16	150	0.2	0.729175	0.797288

A data scientist has to exercise sound judgment in selecting the optimal hyperparameter set, which may have to balance multiple objectives. The parameter set with the maximum out-of-sample GINI may not be the best choice. Often, it is better to choose the parameter set with similar in-sample and out-of-sample GINI to minimize the chance of overfitting.

The details of hyperparameter search can be visualized by clicking the url below.

svg = GraphIO.postlocalgraph(gss, gs3, port, false; key="hyper");
GraphIO.postsvg(svg, "titanic_3.svg")

view graph data at http://127.0.0.1:8080/ui/depgraph.html?dataurl=127.0.0.1:7321_hyper

Figure 2 - Machine Learning

4. Conclusions

It only takes a few lines of code in Julius to build a sophisticated Data and ML pipeline, by leveraging the existing rules and atoms provided by the DataScience package. Even though the titanic data set is small, the ML pipeline built in this tutorial is quite representative; it features the essential elements of a real world ML pipeline such as data cleansing, imputation, feature engineering, model performance monitoring and hyper parameter tuning.

The ML pipeline built by Julius is fully transparent, allowing data scientists to easily visualize and explore data in every intermediate step, all from Julius' web UI. Julius also offers full data lineage and explainability. A data scientist can easily query and trace how a piece of data is sourced, modified and used throughout the entire ML pipeline, making it easy to explain and audit the ML output.

In the next tutorial "distributed ML pipeline", we will show how to deal with very large data sets that do not fit into memory.

This page was generated using Literate.jl.