Machine Learning Note II — Logistic Regression

Linear regression is simple and easy to understand, but not powerful. Let’s consider a scenario in which you have a lot of patient data(training set) and you are required to predict whether a patient has certain disease. This is a typical binary classification problem. Let’s write it in a formal way:

we are given a training set \((x^{(i)},y^{(i)})\), for each training case, \(x\) stands for the patient’s biological feature, and \(y\) indicates whether s/he had the disease. 0 indicates not having the disease and 1 indicates having the disease. We are asked to develop a model to predict whether a new patient has the disease given the same biological features in the training set.

If we still apply linear regression to this scenario, there will be problems:

  • The \(y^{(i)}\) in the training set are 0 or 1, but the result linear regression produces is unbounded, may be greatly larger than 1 or greatly less than 0.
  • Linear regression is sensitive to only a few error training cases.

So to solve the problem, firstly, we wrap the original linear model with a sigmoid function:

\(f(x)=\frac{1}{1+e^{-x}}\)

so the hypothesis function would be:

 \(h_\theta(x)=\frac{1}{1+e^{-t}}\)

\(t=\theta^Tx\)

One may wonder why use the sigmoid function, not something else. Well, the sigmoid function has desirable features:

  • It can map unbounded real value into a bounded interval (0,1)
  • It has a very nice derivative:

\(f^{‘}(x) = f(x)(1-f(x))\)   (1)

The first feature ensure that the result our model can produce is within the \((0,1)\) interval, and the second feature is more useful, which we’ll see shortly, now we can predict following the criteria below:

\(

 \[ y = \begin{cases}
0 & h_\theta<0.5\\
1 & h_\theta\geq0.5
\end{cases}  \]

\)

 Only changing the model is not enough. We also need to make some changes to the cost function, in linear regression the cost function is defined as:

 \(C=(h_{\theta}(x^{(i)})-y^{(i)})^2\)

 (for simplicity, we only consider one training case)

recall we have (1), so we can easily compute the gradient as:

\(\frac{\partial C}{\partial \theta_j} = \frac{\partial C}{\partial z} \frac{\partial z}{\theta_j} = 2(z-y^{(i)})z(1-z)x_j^{(i)}\)

\(z=h_\theta(x)\)

we can easily derive that:

\(|\frac{\partial C}{\partial \theta_j}|<=\frac{1}{2}|x_j^{(i)}|\)  (2)

Recall the gradient descent approach:

Repeat {

\(\theta_j=\theta_j-\alpha\frac{\partial C}{\partial \theta_j},j=0,1\)

}

So if unfortunately we come across many training cases for which \(x\) has a very small magnitude, e.g. \(x_j^{(i)}<0.0001\), we know (2) is true, so in every iteration, the \(\theta_j\) will be updated in a difference less than 0.00005! This is disastrous, because it will make the method learn like a snail!

So we need to modify the cost function, one appropriate candidate would be:

\(C=-(1-y)*log(1-h_\theta(x))-y*log(h_\theta(x))\)

The gradient of the function above is much better. The same problem will not occur for this function.

By improving the linear model and cost function, we now have another important model in machine learning — logistic regression model. This model works well for binary classification problems, and by using combinations of logistic models, we can even deal with multi-label classification problems such as digit recognition!

 

Machine Learning note I

Machine learning is currently a hot field, it has a variety of applications — recommendation system, face recognition software, stock market predicting. Machine learning is powerful or somewhat like a magic, because it can make computer accomplish tasks that require great intelligence.

Motivated by the great power, I decided to learn machine learning. While I read introductory books and online materials on the Internet, I found most of them are much too mathematical without practical examples, which makes it hard for me to carry on. Luckily, I found a wonderful open course regarding machine learning on coursera, it was taught by Standford professor Andrew Ng. He combined theory and practice pretty well, which made his lectures a kind of enjoyment. I learned a lot from this course, so here I shared my understanding of machine learning from this course and hopefully it would be useful to you.

At first, let’s review linear regression. Most people may be familiar with it in high school, and this is a simple case of machine learning. In a typical linear regression problem, one may be asked to use a straight line to fit the scattered points best.

Let’s represent the coordinates of the scattered point as \((x^{(i)},y^{(i)})\), the linear model is \(h_{\theta}(x) = \theta_0x_0+\theta_1x_1=\theta^Tx\quad,x_0=1\),  \(x\) is two dimensional because we consider the constant term 1, as a part of the data.

Normally, we have a cost function showing the average distance between all the points and the line represented by the linear model:

\(C=\frac{1}{m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2\)

We may calculate the solution using a formula,

\(\theta=(x^Tx)^{-1}x^Ty\)

However in a machine learning approach, the solution is calculated in an iterative manner:

Set \(\theta_j\) to an initial value

Repeat {

\(\theta_j=\theta_j-\alpha\frac{\partial C}{\partial \theta_j}=\theta_j-\frac{\alpha}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)},j=0,1\)

//update the \(\theta_j\) simultaneously

//\(\alpha\)is a positive constant set by the user, it is formally called learning rate

}

By repeatedly updating \(\theta_j\) following the procedure above, it is ensured that we will get to the right answer closer and closer, no matter what the initial values are . The mathematical proof for this would be tough, but we can assume it is right. The cost function \(C\) is a function of \(\theta\), we may notice that in every iteration, the gradient of \(C\) with respect to \(\theta\) is computed. So this method is called gradient descent. It is a very useful method in machine learning, which can be found throughout Andrew’s course.

Why would I consider linear regression as a simple case of machine learning? Because it shares the same key components with all the other machine learning methods, these key components are listed as follows, and the words in the brackets shows the correspondence into linear regression:

  • Data that machine can learn from, formally called training set in machine learning (the scattered data points)
  • a model represented as a parameterized formula (a linear model represented by two parameters)
  • A cost function which set the objective (the sum of Euclidean distance between all the points and the line)
  • An iterative approach in which we can get the best solution (gradient descent)

As for more advanced machine learning paradigms such as neural network and support vector machine, they all have these key components. The only difference would be that they have more sophisticated models (non-linear models) and accordingly more sophisticated cost functions.

 

 p.s: It is the first time I type in mathematical formulas in my blog, I used a wordpress plugin which can parse LaTeX script. Writing formulas in LaTeX is easy and fast!

 

Hand Gesture Recognizer — An early project

This project was completed on March 2012, right after I learned fundamentals about image processing. It was a hand gesture recognizer based on OpenCV which could recognize 3 hand gestures. It was my first project in computer vision which triggered my great interest in this area.

I used skin color as the detecting feature. My hand gesture recognition scheme mainly contained four steps:

  • representation of skin color
  • skin detection
  • removal of non-hand skin colored regions
  • recognition of hand contour

To represent the skin color, first I loaded an image which contained a hand exclusively, then I extracted the Hue histogram in HSI color space of the image. The histogram was what I used as the skin color representation. I used the Hue histogram because I wanted the representation to be invariant to illumination.

To detect skin in a video frame, I also extracted the Hue histogram. In the previous step, we got a histogram-based skin color representation. This histogram gave a probability distribution of the skin color on Hue dimension. So given the probability distribution, I could readily compute the probability of each pixel being a skin pixel by applying the Bayesian Theorem. In fact, OpenCV provided a powerful method which can did what I discussed above, given a probability distribution and the Hue histogram of the image to be processed, it will return the result as a probability map:

 

 

 

 

Above is a sample return of the method. Despite the scattered false positives, the hand region is preserved very well.

To remove the false positives, i.e. the non-hand skin pixels. To do this, I applied image erosion and dilation. If given the right parameter, this approach can produce fairly good results under various conditions(different scales, translations, rotations)

 

 

 

 

 

You see, the result is much better. The hand region is still intact. Then by applying a threshold and extracting the largest connected component, we will be able to retrieve an accurate hand contour.

     

The left is the result after applied a threshold, the right is the extracted largest component, i.e the hand region.

To recognize the hand contour, I used the Hu moment, OpenCV has also provided a group of methods which could did the task. I defined 3 gestures: the palm(stretching all fingers), only stretching the index and thumb, and only stretching the index. My recognizer worked well with the three gestures.

Finally, the Presentation Video! You will see in the video that I applied my gestures recognition scheme to controlling a virtual aircraft. At first I intended to use it to play aircraft games such as demonstar, but I failed to find such an open source project without compile error.