Spam or Ham: Orange to the Rescue

Igor Kleiner
4 min readJan 18, 2020

On Sunday, I taught a lesson for students from “Data Science for All” course.

Here I will show how to use Orange text mining widgets for spam/ham discrimination.

We will use sms data from SMS spam collection: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection

The data includes the text of sms-messages with a label: spam\ham

read and inspect the data

The Data Table widget displays sms messages and their type:

sms data + labels

As we can see, there are 5559 instances and no missing data.

Looking at the preceding data, do you notice any interesting characteristics of spam? Words such: free, win, present — can be good candidates for distinguishing between spam and ham?

Preparing and exploring the data

The text is a kind of raw data, so we need to prepare it before using the Naive Bayes widget.

  1. We transform textual values ham/spam to 0/1. Many algorithms won’t work with textual data. Here we will use Edit Domain widget.
transform ham/spam to 1/0 — Edit Domain widget

2. Let us define target variable — Select Columns widget:

define target variable: Select Columns widget

3. The first step in pre-processing textual data involves creating a corpus — a collection of text documents.

4. Now, we need to clean up and prepare sms messages for further analysis. Here, the Preprocess Text widget will help us:

Here we execute a number of steps:

a. We transformed all letters to lowercase, and also removed accents

b. We asked Orange to split the text in words and keep punctuation marks, since some people think that there are many punctuation in spam.

c. We are using here the stemmer Snowball. The stemming is a common standardization for text data. The stemming takes words like: learned, learn, learns — and transform them into base form: learn

d. We chose only the 200 most frequent words for further analysis.

e. We decided, not to remove stop words such as: a,the,I,… — assuming that these words can be important for spam/ham discrimination.

5. We use the Bag of Words widget. The widget presents each sms in new features spaces. For each sms we have a vector that includes information about the number of times that each one of the 200 frequent words appears in the sms.

Bag Of Words — widget.

Good job!

The pre-processing is almost finished! We can have coffee or eat an orange.

Data Exploration: Word Cloud

Word Cloud is a nice visual widget that presents words and their frequencies in a nice perceptional way.

Let’s us investigate two word clouds: one for spam, and one for ham. Ready?

Word Cloud for ham messages
Word Cloud for spam messages

Can you spot the difference?

Words such: free, call, award, prize appear mainly in spam.

Naive Bayes to the rescue

The last steps are:

a. divide data to train and test

b. use Naive Bayes as model

c. evaluate the quality of model on test data

d. use confusion matrix for precise analysis of result

The final step: Model building and Evaluation

As you can see, the results seem promising

--

--