In this project I implement a simple spam classifier in Matlab/Octave.
SVM enable detection of complex decision boundary more effectively than classic logistic regression most of the times.
- Normalize the email by extracting the kernel of each word ex (hosting, host, hosted,.. reduced to 'host')
- This kernel word corresponds to an entry in a vocabulary file where each cell word is linked with an id
- For each email, a feature vector composed of all the words present in the vocabulary file is built. If the word is present in the email the i-th row of the vector is 1, 0 otherwise.
- A training set of vectors is fed to the svm
- Test performance on a training set
Simply run the spam_classifier.m
script and the output will be displayed in the console.
To classify one of your email simply copy and paste its text content into a file (let's say my_email.txt
) under the
code\samples
directory. Then, modify the spam_classifier.m
line 70 to update it with your filename:
filename = 'samples/my_email.txt';
Then, run the spam_classifier.m
script.
Check honey pot project who try to gather as much as spam emails as possible to build a better vocabulary file or other type of feature.
This project was part of Andrew Ng's Mooc on machine learning which I strongly recommend. This project is no longer updated.