Home

Awesome

Machine Learning

Matlab implementation of Machine Learning algorithms

Author

Rishi Dua http://github.com/rishirdua

Disclaimer

Problem 1: Logistic Regression

Problem 2: Locally Weighted Linear Regression

Consider a linear regression problem in which we want to weigh different training examples differently. In the above setting, the cost function can also be written $J(\theta) = (X\theta − y)^T W (X\theta − y)$ for an appropriate diagonal matrix W. By finding the derivative and setting that to zero, generalize the normal equation to this weighted setting, and find the value of \theta that minimizes the cost function in closed form as a function of X, W and y. The files q2x.dat and q2y.dat contain the inputs (x(i)) and outputs and q2y.dat(y(i)), respectively, fuor a regression problem, with one training example per row.

Problem 3: Linear Regression with Polynomial Basis Functions

We ignore the last feature (car name) for this problem. The goal is to predict miles per gallon (mpg), given the values of remaining seven features. We will use the first 100 points as the training data and the remainder as the test data. You can ignore any examples with missing values for any of the features.

Implement linear regression with polynomial basis functions. Given the input feature vector x = (x1 , x2 · · · x7 ), define a polynomial basis function of degree d where we consider only the terms for each variable independently. In other words, we define our basis function as \phi(x) = (x0 , x1 , x1^2 , x1^3 · · · x1^d , x2 , x2^2 , x2^3 · · · x2^d · · · x7 , x7^2 , x7^3 · · · x7^d). Here, x0 denotes the intercept term. For each of the parts below, you should normalize your data (both training and testing together) so that each feature has zero mean and unit variance. Remember to do the normalization each time you learn a model for a different d.

Problem 4: Spam classification

In this problem, we will use perceptron and SVM training algorithms to build a spam classifier. The dataset we will be using is a subset of 2005 TREC Public Spam Corpus. It contains a training set and a test set. Both files use the same format: each line represents the space-delimited properties of an email, with the first one being the email ID, the second one being whether it is a spam or ham (non-spam), and the rest are words and their occurrence numbers in this email. The dataset presented to you is processed version of the original dataset where non-word characters have been removed and some basic feature selection has been done.

Problem 5: Decision Trees for Classification

Build a decision tree which would learn a model to predict whether a US congressman is democrat or republican based their voting pattern on various issues

Problem 6: Naive Bayes for Newsgroup Classification

The data and its description is available through the UCI data repository. We have processed the data further to remove punctuation symbols, stopwords etc. The processed dataset contains the subset of the articles in the newsgroups rec.* and talk.*. This corresponds to a total of 7230 articles in 8 different newsgroups. The processed data is made available to you in a single file with each row representing one article. Each row contains the information about the class of an article followed by the list of words appearing in the article.

Problem 7: K-means for Digit Recognition

In this problem, you will be working with a subset of the OCR (optical character recognition dataset) from the Kaggle website. Each of the images in the dataset is represented by a set of 784 (28 × 28) grayscale pixel values varying in the range [0, 255]. The data was further processed 4 so that it could be easily used to experiment with K-means clustering. The processed dataset contains 1000 images for four different (1, 3, 5, 7) handwritten digits. Each image is represented by a sequence of 157 grayscale pixel values (a subset of the original 784 pixel values). A separate file is provided which contains the actual digit value for each of the images.

Contribute

License

This project is licensed under the terms of the MIT license. See LICENCE.txt for details