Publish Date
Starting a machine learning project is difficult because there so much to think about. Most of the time there is little brain power left to think about code architecture. However, since the machine learning pipe-line does not differ so much.
This goal of this post is to provide some insight on how we can structure code by looking at well known machine learning libraries.
The Machine Learning Pipeline
Scikit Learn
Scikit Learn is probably the most well known machine learning library out there. It is built on core python tools such as numpy, scipy and matplotlib.
Key Design Points
- Scikit-Learn has three fundamental APIs: Estimator, Predictor, and Transformer.
- All learning algorithms implement the estimator interface and expose a fit method
- The instantiation of an Estimator (hyperparameters) is decoupled from the learning process (training data).
- Extends the Estimator and implements a predict method.
- All hyper-parameters for estimators/transformers are public attributes.
- Simplify by getting rid of get/set methods.
- Core data representations a based on Numpy multi-dimensional arrays.
- Reduces the barrier to entry because there is no need to learn a new data class.
- Ensures performance since numpy is optimized for performance using C.
- Easy Composition through pipelines
- Uniform interfaces across core components allow chaining. This allow for code like the following:
Estimators:
Predictors:
Transformers:
my_pipeline = Pipeline([('imputer', SimpleImputer(strategy='median')),
('std_scaler', StandardScaler())
])
transformed_X_train = my_pipeline.fit_transform(X_train)
Skorch
The goal of skorch is to make it possible to use PyTorch with sklearn. skorch abstracts away the training loop, making a lot of boilerplate code obsolete. A simple net.fit(X, y)
is enough.
net = NeuralNetClassifier(...)
net.fit(X_train, y_train)
net.predict(X_test)