- Train the logistic regression algorithm using SGD
- Draw a plot, depicting the method in action
- Depending on the value of the a posteriori probability, make a plot so that the higher the probability value, the brighter the color becomes
- Demonstrate the operation of the algorithms implemented above on a single plot
The gradient descent technique is a powerful way of optimising stuff. Specifically, let's introduce the concept of a loss function: it's a function that takes all points from our dataset and our predicted values and returns some numeric characteristic of how bad our prediction was.
Obviously, we want the loss to be the least possible, ideally zero. In some simple cases we can just take a derivative of a loss function, solve the equation and find the most optimal weights set right away. However, it's not always possible. That's when gradient descent comes in handy.
Instead of solving the equation we launch an iterative process, at each step optimising our weights further by calculating the loss function gradient and making step towards the most optimal point. The step size depends on the gradient and learning rate - a special number that usually changes over iterations. We end this process once the step becomes small enough or we've reached the limit of steps.
Remember the loss function from the previous section? Well, at each step we had to plug our whole dataset in it. Now imagine we have some really huge dataset and it takes us a lot of steps to reach the optimal point. That's when the regular gradient descent becomes too slow.
The approach can be sped up if instead of considering the whole dataset on each step, we took only a single random point and calculated the loss based on it. That's exactly what stochastic gradient descent does.
Logistic regression is used to model the probability of a certain class or event existing such as pass/fail, win/lost, etc. Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable. In this task, logistic regression is built on top of the SGD.
sigmoid = function(z) 1 / (1 + exp(-z))
logisticRegressionLossFunction <- function(margin) log2(1 + exp(-margin))
logisticRegressionUpdateRule <- function(w, learningRate, sample, class) w + learningRate * sample * class * sigmoid(-sum(w * sample) * class)
Logistic regression classifier is, in fact, an optimal bayesian classifier. It is as simple as scalar product of weights vector and feature vector:
result <- sum(c( i, j, -1) * w)
Here we had to introduce a constant artificial feature equal to -1. The sign of this expression determines a class to which this point belongs.
- Initialize weights with some random small numbers.
- Calculate the initial loss.
- Until the step size is not too small and the steps limit is not reached, do:
- Pick a random sample from the dataset.
- Calculate the algorithm loss.
- Update the weights.
- Calculate the current loss.
computeWeightsWithStochasticGradient <- function(dset, classes, lossFunction, updateRule) {
dsetLength <- dim(dset)[1]
featuresCount <- dim(dset)[2]
w <- runif(featuresCount, -1 / (2 * featuresCount), 1 / (2 * featuresCount))
lambda <- 1 / dsetLength
currentLoss <- 0
for (i in 1:dsetLength) {
margin <- sum(w * dset[i, ]) * classes[i]
currentLoss <- currentLoss + lossFunction(margin)
}
prevLoss <- currentLoss
step <- 1
repeat {
margins <- array(0, dim = dsetLength)
for (i in 1:dsetLength) {
margins[i] <- sum(w * dset[i, ]) * classes[i]
}
errorIndices <- which(margins <= 0)
if (length(errorIndices) == 0) break
randomErrorSampleIndex <- sample(errorIndices, 1)
x <- dset[randomErrorSampleIndex, ]
class <- classes[randomErrorSampleIndex]
margin <- sum(w * x) * class
loss <- lossFunction(margin)
learningRate <- 1 / step
w <- updateRule(w, learningRate, x, class)
currentLoss <- (1 - lambda) * currentLoss + lambda * loss
if (abs(prevLoss - currentLoss) / abs(max(prevLoss, currentLoss)) < 1e-5) break
if (step > 25000) break
prevLoss <- currentLoss
step <- step + 1
}
return(w)
}
Point's saturation matches its posteriori probability. | Red line is a final line. |