Home

Awesome

This repo is depreciated, check out the latest Nailing Machine Learning Concepts

The purpose of this repo is two fold:

The focus is on the knowledge breadth so this is more of a quick reference rather than an in-depth study material. If you want to learn a specific topic in detail please refer to other content or reach out and I'd love to point you to materials I found useful.

I might add some topics from time to time but hey, this should also be a community effort, right? Any pull request is welcome!

Here are the categorizes:

Resume

The only advice I can give about resume is to indicate your past data science / machine learning projects in a specific, quantifiable way. Consider the following two statements:

Trained a machine learning system

and

Designed and deployed a deep learning model to recognize objects using Keras, Tensorflow, and Node.js. The model has 1/30 model size, 1/3 training time, 1/5 inference time, and 2x faster convergence compared with traditional neural networks (e.g, ResNet)

The second is much better because it quantifies your contribution and also highlights specific technologies you used (and therefore have expertise in). This would require you to log what you've done during experiments. But don't exaggerate.

Spend some time going over your resume / past projects to make sure you explain them well.

SQL

Difference between joins

back to top

Tools and Framework

The resources here are only meant to help you brush up on the topis rather than making you an expert.

Spark

Using PySpark API.

back to top

Statistics and ML In General

Project Workflow

Given a data science / machine learning project, what steps should we follow? Here's how I would tackle it:

back to top

Cross Validation

Cross-validation is a technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a validation set to evaluate it. For example, a k-fold cross validation divides the data into k folds (or partitions), trains on each k-1 fold, and evaluate on the remaining 1 fold. This results to k models/evaluations, which can be averaged to get a overall model performance.

back to top

Feature Importance

back to top

Mean Squared Error vs. Mean Absolute Error

back to top

L1 vs L2 regularization

back to top

Correlation vs Covariance

back to top

Would adding more data address underfitting

Underfitting happens when a model is not complex enough to learn well from the data. It is the problem of model rather than data size. So a potential way to address underfitting is to increase the model complexity (e.g., to add higher order coefficients for linear model, increase depth for tree-based methods, add more layers / number of neurons for neural networks etc.)

back to top

Activation Function

For neural networks

back to top

Bagging

To address overfitting, we can use an ensemble method called bagging (bootstrap aggregating), which reduces the variance of the meta learning algorithm. Bagging can be applied to decision tree or other algorithms.

Here is a great illustration of a single estimator vs. bagging.

back to top

Stacking

stacking

back to top

Generative vs discriminative

Given a training set, an algorithm like logistic regression or the perceptron algorithm (basically) tries to find a straight line—that is, a decision boundary—that separates the elephants and dogs. Then, to classify a new animal as either an elephant or a dog, it checks on which side of the decision boundary it falls, and makes its prediction accordingly.

Here’s a different approach. First, looking at elephants, we can build a model of what elephants look like. Then, looking at dogs, we can build a separate model of what dogs look like. Finally, to classify a new animal, we can match the new animal against the elephant model, and match it against the dog model, to see whether the new animal looks more like the elephants or more like the dogs we had seen in the training set.

back to top

Parametric vs Nonparametric

back to top

Recommender System

cf

back to top

Supervised Learning

Linear regression

lr

back to top

Logistic regression

back to top

Naive Bayes

back to top

KNN

KNN

back to top

SVM

svm

back to top

Decision tree

decision tree

back to top

Random forest

Random forest improves bagging further by adding some randomness. In random forest, only a subset of features are selected at random to construct a tree (while often not subsample instances). The benefit is that random forest decorrelates the trees.

For example, suppose we have a dataset. There is one very predicative feature, and a couple of moderately predicative features. In bagging trees, most of the trees will use this very predicative feature in the top split, and therefore making most of the trees look similar, and highly correlated. Averaging many highly correlated results won't lead to a large reduction in variance compared with uncorrelated results. In random forest for each split we only consider a subset of the features and therefore reduce the variance even further by introducing more uncorrelated trees.

I wrote a notebook to illustrate this point.

In practice, tuning random forest entails having a large number of trees (the more the better, but always consider computation constraint). Also, min_samples_leaf (The minimum number of samples at the leaf node)to control the tree size and overfitting. Always cross validate the parameters.

back to top

Boosting Tree

How it works

Boosting builds on weak learners, and in an iterative fashion. In each iteration, a new learner is added, while all existing learners are kept unchanged. All learners are weighted based on their performance (e.g., accuracy), and after a weak learner is added, the data are re-weighted: examples that are misclassified gain more weights, while examples that are correctly classified lose weights. Thus, future weak learners focus more on examples that previous weak learners misclassified.

Difference from random forest (RF)

XGBoost (Extreme Gradient Boosting)

XGBoost uses a more regularized model formalization to control overfitting, which gives it better performance

back to top

MLP

A feedforward neural network of multiple layers. In each layer we can have multiple neurons, and each of the neuron in the next layer is a linear/nonlinear combination of the all the neurons in the previous layer. In order to train the network we back propagate the errors layer by layer. In theory MLP can approximate any functions.

mlp

back to top

CNN

The Conv layer is the building block of a Convolutional Network. The Conv layer consists of a set of learnable filters (such as 5 * 5 * 3, width * height * depth). During the forward pass, we slide (or more precisely, convolve) the filter across the input and compute the dot product. Learning again happens when the network back propagate the error layer by layer.

Initial layers capture low-level features such as angle and edges, while later layers learn a combination of the low-level features and in the previous layers and can therefore represent higher level feature, such as shape and object parts.

CNN

back to top

RNN and LSTM

RNN is another paradigm of neural network where we have difference layers of cells, and each cell not only takes as input the cell from the previous layer, but also the previous cell within the same layer. This gives RNN the power to model sequence.

RNN

This seems great, but in practice RNN barely works due to exploding/vanishing gradient, which is cause by a series of multiplication of the same matrix. To solve this, we can use a variation of RNN, called long short-term memory (LSTM), which is capable of learning long-term dependencies.

The math behind LSTM can be pretty complicated, but intuitively LSTM introduce

LSTM resembles human memory: it forgets old stuff (old internal state * forget gate) and learns from new input (input node * input gate)

lstm

back to top

Unsupervised Learning

Clustering

scikit-learn implements many clustering algorithms. Below is a comparison adopted from its page.

clustering

back to top

Principal Component Analysis

Here is a visual explanation of PCA

pca

back to top

Autoencoder

Generative Adversarial Network

gan

back to top

Reinforcement Learning

[TODO]

Natural Language Processing

Tokenization

back to top

Stemming and lemmatization

back to top

N gram

back to top

Bag of Words

back to top

word2vec

back to top

System

Cron job

The software utility cron is a time-based job scheduler in Unix-like computer operating systems. People who set up and maintain software environments use cron to schedule jobs (commands or shell scripts) to run periodically at fixed times, dates, or intervals. It typically automates system maintenance or administration -- though its general-purpose nature makes it useful for things like downloading files from the Internet and downloading email at regular intervals.

Tools:

back to top

Linux

Using Ubuntu as an example.

back to top

Confession: some images are adopted from the internet without proper credit. If you are the author and this would be an issue for you, please let me know.