SMOTE  Synthetic Minority Oversampling Technique
Codes described throughout are placed in this GitHub Repository.
For theoretical information and pseudo code you may refer : https://www.jair.org/media/953/live9532037jair.pdf
Imbalance in the dataset is mitigated using SMOTE. suppose we have a dataset one class is nearly 85% and other class is nearly 15%. This imbalance is not good for machine learning. Machine learning generally prefers same number of sample for all classes for effective balanced learning and prediction. I have worked with dataset having one class upto 97% and other is about 3% only. I know the headache if dealing with such dataset. Certain techniques can be used to balance such biased dataset by synthesizing new samples identical to minor class. For this tutorial we will be using iris dataset, Iris dataset is having 3 class of flower (Irissetosa, Irisvirginica and Irisversicolor). Each class is having 50 samples. Our goal is to make unbalance dataset balanced so I have deleted 45 samples of Irisvirginica. Now in new dataset we have only 5 sample of Irisvirginica (minor class).
Our goal for this tutorial includes:
1) To multiply samples of Irisvirginica 2) To visualize all sample of Irisvirginica and to ascertain that newly synthesized sample are similar to original Irisvirginica samples.
The core idea behind entire algorithm is very simple

Lets say we had a iris data with N = 105 samples.

Out of N, we have a class Irisvirginica having n = 5 samples, And other two classes having 50 samples each.

We need to increase number of samples for class Irisvirginica (minor class)

Each sample in iris dataset having 4 attribute, one sample can be represented in array as [6.3, 3.3, 6.0, 2.5, ‘Irisvirginica’]

First we will find out samples (nearest neighbors) which are very similar to given minor sample, lets say we got following nearest neighbors [[6.5, 3.0, 5.8, 2.2, ‘Irisvirginica’],[6.3, 2.9, 5.6, 1.8, ‘Irisvirginica’],[6.4, 2.8, 5.4, 1.9, ‘Irisvirginica’]]

Out of this nearest neighbor one sample is choose at random let say we choose

[6.5, 3.0, 5.8, 2.2, ‘Irisvirginica’] to be the nearest one

Calculation:
 One minor sample (m)  [6.3, 3.3, 6.0, 2.5, ‘Irisvirginica’]
 Random nearest neighbor (r)  [6.5, 3.0, 5.8, 2.2, ‘Irisvirginica’]
 Positive Difference between two (d)  [0.2, 0.3, 0.2, 0.3]
 Random number between 0 and 1 (i)  [0.2, 0.8, 0.3, 0.5]
 Synthetic minor sample = m + d * i  [6.34, 3.54, 6.06, 2.65, ‘Irisvirginica’] –> New synthetic sample
Function 1. Below mentioned function perform a sole function to load data
def loadDataset(filename, numattrs):
"""
loads data from file
:param filename:
:param numattrs: number of column in file, Excluding class column
:return:
"""
csvfile = open(filename, 'r')
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)):
for y in range(numattrs):
dataset[x][y] = float(dataset[x][y])
return dataset
Function 2. Another function we would require is to calculate euclidean distance between attributes of two samples:
def euclideanDistance(instance1, instance2, length):
"""
calculate euclidean distance between two
:param instance1:[6.5, 3.0, 5.8, 2.2]
:param instance2:[6.3, 2.9, 5.6, 1.8]
:param length: 4 length of array
:return:
"""
distance = 0
for x in range(length):
distance += pow((instance1[x]  instance2[x]), 2)
return math.sqrt(distance)
If we execute above given function with set of array $[6.5, 3.0, 5.8, 2.2]$ and $[6.3, 2.9, 5.6, 1.8]$ then it will give output as 0. $euclideanDistance([6.5, 3.0, 5.8, 2.2],[6.3, 2.9, 5.6, 1.8],4)$ Euclidean distance starts from 0.0 (Exactly similar) to any number indicating how far two objects are.
Function 3. Next function is getNeighbors. getNeighbors function will take three inputs
 Entire dataset
 A minorsample and
 Number of neighbors to be found out
The mechanism behind this function is as follows:

For a given minor sample find out euclidean distance with all samples in dataset

Sort all samples in ascending order of euclidean distance so that most similar will come first

From beginning take required number of neighbors to be found out except first sample ( The first sample will be the same minor sample from entire dataset  most similar )
def getNeighbors(trainingSet, eachMinorsample, k):
"""
will give top k neighbors for given minor sample (eachMinorsample) in dataset (trainingSet)
:param trainingSet: here entire dataset is a training set
:param eachMinorsample:
:param k: number of nearest neighbors to search for each minor sample value, using Euclidean distance
:return: top k neighbors output: # Minor Sample: [6.3, 3.3, 6.0, 2.5, 'Irisvirginica']  Neighbors: [[6.5, 3.0, 5.8, 2.2, 'Irisvirginica'],
# [6.3, 2.9, 5.6, 1.8, 'Irisvirginica'],
"""
distances = []
length = len(eachMinorsample)  1
for x in range(len(trainingSet)):
dist = euclideanDistance(eachMinorsample, trainingSet[x], length) # get euclidean distance for all matches
distances.append((trainingSet[x], dist))
distances.sort(key=operator.itemgetter(1)) # sort as per distance and get top 3 excluding first one
neighbors = []
for x in range(k):
neighbors.append(distances[x + 1][0]) # X+1 in the sorted list ensure that the minor sample itself is not selected as neighbors
return neighbors
Function 4. seperateMinority will separates all minor samples as per class specified from dataset. This will return an array of all minor samples.
def seperateMinority(dataSet, MinorClassName, classColumnNumber):
"""
will separate given minor class from the entire dataset
:param dataSet: Entire dataset
:param MinorClassName: name of minor class, e.g. MinorClassName = "Irisvirginica"
:param classColumnNumber: column number where class is present [zero indexed]
:return:
"""
minorSamples = []
for eachSample in dataSet:
if (eachSample[classColumnNumber] == MinorClassName):
minorSamples.append(eachSample)
return minorSamples
Function 5. populate is the main function which takes following parameters:

N: factor by which sample needs to be increase, e.g. 2 means twice

minorSample: all minor samples

nnarray: nearest neighbor array [[2.4, 2.5, ‘a’],[2.3, 2.2, ‘a’],[2.5, 2.5, ‘a’]]

numattrs: equals to number of feature (3) , in 0 based index it iterates from
This function perform actual SMOTE algorithm on datset. It perform following steps

Take minorsample and nearest neighbors.

Find out difference between attributes of minorsample and nearest neighbors as diff

Generate a positive random float number between 0 and 1 known as gap

New attributes of minorsample = (attributes of minorsample) + gap * difference

As described above, all attribute for a particular sample are generated

Such controlled randomness is added to N number of samples to generate synthetic ones
def populate(N, minorSample, nnarray, numattrs):
"""
perform actual algorithm
1) take minorsample and nearest neighbours.
2) find out difference between attributes of minorsample and nearest neighbours as diff
3) generate a positive random float number between 0 and 1 known as gap
4) new attributes of minorsample = (attributes of minorsample) + gap * difference
5) As described above, all attribute for a particular sample are generated
6) such controlled randomness is added to N number of samples to generate synthetic ones
:param N: factor by which sample needs to be increase, e.g. 2 means twice
:param minorSample: all minor samples
:param nnarray: nearest neighbour array [[2.4, 2.5, 'a'],[2.3, 2.2, 'a'],[2.5, 2.5, 'a']]
:param numattrs: equals to number of feature (3) , in 0 based index it iterates from 0,1,2,3
:return:
"""
while (N > 0):
nn = randint(0, len(nnarray)  2)
eachUnit = []
for attr in range(0, numattrs+1): #[0,1,2,3] iterate over each attribute (feature)
diff = float(nnarray[nn][attr])  (minorSample[nn][attr]) # difference between nearest neighbour and actual minor sample
gap = random.uniform(0, 1) # generate a random number between 0 and 1
eachUnit.append(minorSample[nn][attr] + gap * diff) # multiply difference with random number and add this to original attribute value
for each in eachUnit:
syntheticData.write(str(each)+",")
syntheticData.write("\\n")
N = N  1
Function 6. The last function is SMOTE function , which call all these above functions and serve as an entry point for the workflow. k is number of nearest neighbors to be taken in to consideration to generate synthetic samples.
def SMOTE(T, N, minorSamples, numattrs, dataSet, k):
"""
:param T = Number of minority class Samples # here we have 5
:param k = k mean (clustering value)
:param minorSample: all minor samples
:param N = "Number of sample to be generated should be more than 100%"
Amount of smoted sample required N%
"""
if (N <= 100):
print "Number of sample to be generated should be more than 100%"
raise ValueError
N = int(N / 100) * T # N = number of output samples required
nnarray = []
for eachMinor in minorSamples:
# nnarray all nearest neighbour [[2.4, 2.5, 'a'],[2.3, 2.2, 'a'],[2.5, 2.5, 'a']]
nnarray = (getNeighbors(dataSet, eachMinor, k))
populate(N, minorSamples, nnarray, numattrs)
I am not yet done, I have generated synthetic data but who knows I am right or not. If this would be 2D data I would directly plot on graph and can visualize but this is 4D data, A technique called tsne (tDistributed Stochastic Neighbor Embedding ) is used to visualize multi  dimensional data. If you are unaware about tsne, go through my previous tutorial.
After running file visualize.py I got below given Image:
Figure. 1 Samples of Irisvirginica before and after application of SMOTE algorithm
Increase in Dark blue samples clearly represents that we have successfully synthesized sample for IrisvirginicaI am currency working on making this implement multi threaded, Soon will update this tutorial.