Spam or Ham: Orange to the Rescue

Igor Kleiner

4 min readJan 18, 2020

On Sunday, I taught a lesson for students from “Data Science for All” course.

Here I will show how to use Orange text mining widgets for spam/ham discrimination.

We will use sms data from SMS spam collection: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection

The data includes the text of sms-messages with a label: spam\ham

The Data Table widget displays sms messages and their type:

As we can see, there are 5559 instances and no missing data.

Looking at the preceding data, do you notice any interesting characteristics of spam? Words such: free, win, present — can be good candidates for distinguishing between spam and ham?

Preparing and exploring the data

The text is a kind of raw data, so we need to prepare it before using the Naive Bayes widget.

We transform textual values ham/spam to 0/1. Many algorithms won’t work with textual data. Here we will use Edit Domain widget.

transform ham/spam to 1/0 — Edit Domain widget

2. Let us define target variable — Select Columns widget:

3. The first step in pre-processing textual data involves creating a corpus — a collection of text documents.

4. Now, we need to clean up and prepare sms messages for further analysis. Here, the Preprocess Text widget will help us:

Here we execute a number of steps:

a. We transformed all letters to lowercase, and also removed accents

b. We asked Orange to split the text in words and keep punctuation marks, since some people think that there are many punctuation in spam.

c. We are using here the stemmer Snowball. The stemming is a common standardization for text data. The stemming takes words like: learned, learn, learns — and transform them into base form: learn

d. We chose only the 200 most frequent words for further analysis.

e. We decided, not to remove stop words such as: a,the,I,… — assuming that these words can be important for spam/ham discrimination.

5. We use the Bag of Words widget. The widget presents each sms in new features spaces. For each sms we have a vector that includes information about the number of times that each one of the 200 frequent words appears in the sms.