Hello All,

I have been going over the Naive Bayes algorithm as presented by Professor Andrew Ng in his CS229 lectures, and I wanted to make sure I understand how it is implemented (here I am talking about the simplest case without Laplace Smoothing, and not dealing with the multivariate Naive Bayes).

So as I understand it, the goal is to ultimately calculate $$ P(y \mid x=1) $$ by creating the models $$ P(x=1 \mid y) $$ and $$ P(y) $$ (essentially solving for Bayes' Rule by finding the values for $$ P(x=1 \mid y), P(Y), and P(X) $$).

In the case where the feature vector $$ x $$ is binary-valued, such that $$ x_j, j = 1, ..., n, x_j \in \{0, 1\}, where x = [ x_1, x_2, ..., x_n ]^T $$, we can say that the model is parameterized by $$ \phi_{j \mid y=0} = p(x_j=1 \mid y=0), \phi_{j \mid y=1} = p(x_j=1 \mid y=1) $$ and $$ \phi = p(y=1) $$. Therefore we can model the joint-distributions as:

$$ P(x \mid y=0) = \prod_{j=1}^n p(x_j \mid y=0) = \prod_{j=1}^n (\phi_{j \mid y=0})^{x_j} (1 -\phi_{j \mid y=0})^{(1-x_j)} $$

$$ P(x \mid y=1) = \prod_{j=1}^n p(x_j \mid y=1) = \prod_{j=1}^n (\phi_{j \mid y=1})^{x_j} (1 -\phi_{j \mid y=1})^{(1-x_j)} $$

$$ P(y) = (\phi_{y})^{y} (1 -\phi_{y})^{(1-y)} $$

And we can plug this back into Bayes' Rule as:

$$ P(y \mid x=1) = \frac{P(x=1 \mid y) P(y=1)}{P(x)} $$

$$ P(y \mid x=1) = \frac{P(x=1 \mid y) P(y=1)}{P(x=1 \mid y=1) P(y=1) + P(x=1 \mid y=0) (1-P(y=1)} $$

$$ P(y \mid x=1) = \frac{\prod_{j=1}^n (\phi_{j \mid y=1})^{x_j} (1 -\phi_{j \mid y=1})^{(1-x_j)} (\phi_{y})^{y} (1 -\phi_{y})^{(1-y)}} {\prod_{j=1}^n (\phi_{j \mid y=1})^{x_j} (1 -\phi_{j \mid y=1})^{(1-x_j)} (\phi_{y})^{y} (1 -\phi_{y})^{(1-y)} + \prod_{j=1}^n (\phi_{j \mid y=0})^{x_j} (1 -\phi_{j \mid y=0})^{(1-x_j)} (1 - (\phi_{y})^{y} (1 -\phi_{y})^{(1-y)})} $$

We then solve for the parameters to maximize the likelihood function, which gives us:

$$ \phi_{j \mid y=1} = \frac{\sum_{i=1}^m 1\{x_j^{(i)} = 1 \wedge y^{(i)} = 1 \}}{\sum_{i=1}^m 1\{y^{(i)} = 1 \}} $$

$$ \phi_{j \mid y=0} = \frac{\sum_{i=1}^m 1\{x_j^{(i)} = 1 \wedge y^{(i)} = 0 \}}{\sum_{i=1}^m 1\{y^{(i)} = 0 \}} $$

$$ \phi_{y} = \frac{\sum_{i=1}^m 1\{y^{(i)} = 1 \}}{m} $$

Going back to the model and plug in the values calculated of the parameters to find $$ P(y=1 \mid x) $$, which if we were still using the spam filter example, a value of $$ P(y=1 \mid x) \geq 0.5 $$ would indicate that the email is spam.

Is this correct reasoning? Thanks! (Please forgive me if the LaTeX formatting is poor - this is my first time using it here)

Cheers

asked 02 Mar '12, 12:11

jinho's gravatar image

jinho
714

edited 02 Mar '12, 18:38

I didn't know naive bayes this complex! Which lecture are you addressing? or maybe his book?
(02 Mar '12, 15:31) comptrol comptrol's gravatar image
Haha, it's very possible I'm making it overly complicated. This is from Professor Andrew Ng's lecture (I believe #4 or 5), and while he goes over the theory pretty much in depth, he sort of glossed over the implementation. I just wanted to see whether my idea of how it might be implemented is right, or along the right path. Thanks! Just a thought, but perhaps I should have asked whether this applies to Generative Algorithms in general. Eager to hear your thoughts.
(02 Mar '12, 18:37) jinho jinho's gravatar image
Be the first one to answer this question!
toggle preview

Follow this Question via Email

Once you sign in you will be able to subscribe for any updates here

Q&A Editor Basics

  • to upload an image into your question or answer hit
  • to create bulleted or numbered lists hit or
  • to add a title or header hit
  • to section your text hit
  • to make a link clickable, surround it with <a> and </a> (for example, <a>www.google.com</a>)
  • basic HTML tags are also supported (for those who know a bit of HTML)
  • To insert an EQUATION you can use LaTeX. (backslash \ has to be escaped, so in your LaTeX code you have to replace \ with \\). You can see more examples and info here

powered by OSQA