Linear regression is simple and easy to understand, but not powerful. Let’s consider a scenario in which you have a lot of patient data(training set) and you are required to predict whether a patient has certain disease. This is a typical binary classification problem. Let’s write it in a formal way:
we are given a training set \((x^{(i)},y^{(i)})\), for each training case, \(x\) stands for the patient’s biological feature, and \(y\) indicates whether s/he had the disease. 0 indicates not having the disease and 1 indicates having the disease. We are asked to develop a model to predict whether a new patient has the disease given the same biological features in the training set.
If we still apply linear regression to this scenario, there will be problems:
- The \(y^{(i)}\) in the training set are 0 or 1, but the result linear regression produces is unbounded, may be greatly larger than 1 or greatly less than 0.
- Linear regression is sensitive to only a few error training cases.
So to solve the problem, firstly, we wrap the original linear model with a sigmoid function:
\(f(x)=\frac{1}{1+e^{-x}}\)
so the hypothesis function would be:
\(h_\theta(x)=\frac{1}{1+e^{-t}}\)
\(t=\theta^Tx\)
One may wonder why use the sigmoid function, not something else. Well, the sigmoid function has desirable features:
- It can map unbounded real value into a bounded interval (0,1)
- It has a very nice derivative:
\(f^{‘}(x) = f(x)(1-f(x))\) (1)
The first feature ensure that the result our model can produce is within the \((0,1)\) interval, and the second feature is more useful, which we’ll see shortly, now we can predict following the criteria below:
\(
\[ y = \begin{cases}
0 & h_\theta<0.5\\
1 & h_\theta\geq0.5
\end{cases} \]
\)
Only changing the model is not enough. We also need to make some changes to the cost function, in linear regression the cost function is defined as:
\(C=(h_{\theta}(x^{(i)})-y^{(i)})^2\)
(for simplicity, we only consider one training case)
recall we have (1), so we can easily compute the gradient as:
\(\frac{\partial C}{\partial \theta_j} = \frac{\partial C}{\partial z} \frac{\partial z}{\theta_j} = 2(z-y^{(i)})z(1-z)x_j^{(i)}\)
\(z=h_\theta(x)\)
we can easily derive that:
\(|\frac{\partial C}{\partial \theta_j}|<=\frac{1}{2}|x_j^{(i)}|\) (2)
Recall the gradient descent approach:
Repeat {
\(\theta_j=\theta_j-\alpha\frac{\partial C}{\partial \theta_j},j=0,1\)
}
So if unfortunately we come across many training cases for which \(x\) has a very small magnitude, e.g. \(x_j^{(i)}<0.0001\), we know (2) is true, so in every iteration, the \(\theta_j\) will be updated in a difference less than 0.00005! This is disastrous, because it will make the method learn like a snail!
So we need to modify the cost function, one appropriate candidate would be:
\(C=-(1-y)*log(1-h_\theta(x))-y*log(h_\theta(x))\)
The gradient of the function above is much better. The same problem will not occur for this function.
By improving the linear model and cost function, we now have another important model in machine learning — logistic regression model. This model works well for binary classification problems, and by using combinations of logistic models, we can even deal with multi-label classification problems such as digit recognition!




