|
Hello All, I have been going over the Naive Bayes algorithm as presented by Professor Andrew Ng in his CS229 lectures, and I wanted to make sure I understand how it is implemented (here I am talking about the simplest case without Laplace Smoothing, and not dealing with the multivariate Naive Bayes). So as I understand it, the goal is to ultimately calculate $$ P(y \mid x=1) $$ by creating the models $$ P(x=1 \mid y) $$ and $$ P(y) $$ (essentially solving for Bayes' Rule by finding the values for $$ P(x=1 \mid y), P(Y), and P(X) $$). In the case where the feature vector $$ x $$ is binary-valued, such that $$ x_j, j = 1, ..., n, x_j \in \{0, 1\}, where x = [ x_1, x_2, ..., x_n ]^T $$, we can say that the model is parameterized by $$ \phi_{j \mid y=0} = p(x_j=1 \mid y=0), \phi_{j \mid y=1} = p(x_j=1 \mid y=1) $$ and $$ \phi = p(y=1) $$. Therefore we can model the joint-distributions as: $$ P(x \mid y=0) = \prod_{j=1}^n p(x_j \mid y=0) = \prod_{j=1}^n (\phi_{j \mid y=0})^{x_j} (1 -\phi_{j \mid y=0})^{(1-x_j)} $$ $$ P(x \mid y=1) = \prod_{j=1}^n p(x_j \mid y=1) = \prod_{j=1}^n (\phi_{j \mid y=1})^{x_j} (1 -\phi_{j \mid y=1})^{(1-x_j)} $$ $$ P(y) = (\phi_{y})^{y} (1 -\phi_{y})^{(1-y)} $$ And we can plug this back into Bayes' Rule as: $$ P(y \mid x=1) = \frac{P(x=1 \mid y) P(y=1)}{P(x)} $$ $$ P(y \mid x=1) = \frac{P(x=1 \mid y) P(y=1)}{P(x=1 \mid y=1) P(y=1) + P(x=1 \mid y=0) (1-P(y=1)} $$ $$ P(y \mid x=1) = \frac{\prod_{j=1}^n (\phi_{j \mid y=1})^{x_j} (1 -\phi_{j \mid y=1})^{(1-x_j)} (\phi_{y})^{y} (1 -\phi_{y})^{(1-y)}} {\prod_{j=1}^n (\phi_{j \mid y=1})^{x_j} (1 -\phi_{j \mid y=1})^{(1-x_j)} (\phi_{y})^{y} (1 -\phi_{y})^{(1-y)} + \prod_{j=1}^n (\phi_{j \mid y=0})^{x_j} (1 -\phi_{j \mid y=0})^{(1-x_j)} (1 - (\phi_{y})^{y} (1 -\phi_{y})^{(1-y)})} $$ We then solve for the parameters to maximize the likelihood function, which gives us: $$ \phi_{j \mid y=1} = \frac{\sum_{i=1}^m 1\{x_j^{(i)} = 1 \wedge y^{(i)} = 1 \}}{\sum_{i=1}^m 1\{y^{(i)} = 1 \}} $$ $$ \phi_{j \mid y=0} = \frac{\sum_{i=1}^m 1\{x_j^{(i)} = 1 \wedge y^{(i)} = 0 \}}{\sum_{i=1}^m 1\{y^{(i)} = 0 \}} $$ $$ \phi_{y} = \frac{\sum_{i=1}^m 1\{y^{(i)} = 1 \}}{m} $$ Going back to the model and plug in the values calculated of the parameters to find $$ P(y=1 \mid x) $$, which if we were still using the spam filter example, a value of $$ P(y=1 \mid x) \geq 0.5 $$ would indicate that the email is spam. Is this correct reasoning? Thanks! (Please forgive me if the LaTeX formatting is poor - this is my first time using it here) Cheers I didn't know naive bayes this complex! Which lecture are you addressing? or maybe his book?
(02 Mar '12, 15:31)
comptrol
Haha, it's very possible I'm making it overly complicated. This is from Professor Andrew Ng's lecture (I believe #4 or 5), and while he goes over the theory pretty much in depth, he sort of glossed over the implementation. I just wanted to see whether my idea of how it might be implemented is right, or along the right path. Thanks!
Just a thought, but perhaps I should have asked whether this applies to Generative Algorithms in general. Eager to hear your thoughts.
(02 Mar '12, 18:37)
jinho
|