Hello All, I have been going over the Naive Bayes algorithm as presented by Professor Andrew Ng in his CS229 lectures, and I wanted to make sure I understand how it is implemented (here I am talking about the simplest case without Laplace Smoothing, and not dealing with the multivariate Naive Bayes). So as I understand it, the goal is to ultimately calculate $$ P(y \mid x=1) $$ by creating the models $$ P(x=1 \mid y) $$ and $$ P(y) $$ (essentially solving for Bayes' Rule by finding the values for $$ P(x=1 \mid y), P(Y), and P(X) $$). In the case where the feature vector $$ x $$ is binaryvalued, such that $$ x_j, j = 1, ..., n, x_j \in \{0, 1\}, where x = [ x_1, x_2, ..., x_n ]^T $$, we can say that the model is parameterized by $$ \phi_{j \mid y=0} = p(x_j=1 \mid y=0), \phi_{j \mid y=1} = p(x_j=1 \mid y=1) $$ and $$ \phi = p(y=1) $$. Therefore we can model the jointdistributions as: $$ P(x \mid y=0) = \prod_{j=1}^n p(x_j \mid y=0) = \prod_{j=1}^n (\phi_{j \mid y=0})^{x_j} (1 \phi_{j \mid y=0})^{(1x_j)} $$ $$ P(x \mid y=1) = \prod_{j=1}^n p(x_j \mid y=1) = \prod_{j=1}^n (\phi_{j \mid y=1})^{x_j} (1 \phi_{j \mid y=1})^{(1x_j)} $$ $$ P(y) = (\phi_{y})^{y} (1 \phi_{y})^{(1y)} $$ And we can plug this back into Bayes' Rule as: $$ P(y \mid x=1) = \frac{P(x=1 \mid y) P(y=1)}{P(x)} $$ $$ P(y \mid x=1) = \frac{P(x=1 \mid y) P(y=1)}{P(x=1 \mid y=1) P(y=1) + P(x=1 \mid y=0) (1P(y=1)} $$ $$ P(y \mid x=1) = \frac{\prod_{j=1}^n (\phi_{j \mid y=1})^{x_j} (1 \phi_{j \mid y=1})^{(1x_j)} (\phi_{y})^{y} (1 \phi_{y})^{(1y)}} {\prod_{j=1}^n (\phi_{j \mid y=1})^{x_j} (1 \phi_{j \mid y=1})^{(1x_j)} (\phi_{y})^{y} (1 \phi_{y})^{(1y)} + \prod_{j=1}^n (\phi_{j \mid y=0})^{x_j} (1 \phi_{j \mid y=0})^{(1x_j)} (1  (\phi_{y})^{y} (1 \phi_{y})^{(1y)})} $$ We then solve for the parameters to maximize the likelihood function, which gives us: $$ \phi_{j \mid y=1} = \frac{\sum_{i=1}^m 1\{x_j^{(i)} = 1 \wedge y^{(i)} = 1 \}}{\sum_{i=1}^m 1\{y^{(i)} = 1 \}} $$ $$ \phi_{j \mid y=0} = \frac{\sum_{i=1}^m 1\{x_j^{(i)} = 1 \wedge y^{(i)} = 0 \}}{\sum_{i=1}^m 1\{y^{(i)} = 0 \}} $$ $$ \phi_{y} = \frac{\sum_{i=1}^m 1\{y^{(i)} = 1 \}}{m} $$ Going back to the model and plug in the values calculated of the parameters to find $$ P(y=1 \mid x) $$, which if we were still using the spam filter example, a value of $$ P(y=1 \mid x) \geq 0.5 $$ would indicate that the email is spam. Is this correct reasoning? Thanks! (Please forgive me if the LaTeX formatting is poor  this is my first time using it here) Cheers asked 02 Mar '12, 12:11 jinho I didn't know naive bayes this complex! Which lecture are you addressing? or maybe his book?
(02 Mar '12, 15:31)
comptrol
Haha, it's very possible I'm making it overly complicated. This is from Professor Andrew Ng's lecture (I believe #4 or 5), and while he goes over the theory pretty much in depth, he sort of glossed over the implementation. I just wanted to see whether my idea of how it might be implemented is right, or along the right path. Thanks!
Just a thought, but perhaps I should have asked whether this applies to Generative Algorithms in general. Eager to hear your thoughts.
(02 Mar '12, 18:37)
jinho
