[pLog-svn] r6088 - plog/branches/lifetype-1.2/class/security
Mark Wu
markplace at gmail.com
Fri Nov 30 00:02:49 EST 2007
Hi Jon:
For CJK site bayesian filter does not work well, but I believe it works well for western user. The problem is not in Bayesian Fitler it self, it is becasue the tokenize().
Esepecially, if we ask BayesianFilter to learn what is the spam from the comment text and topic.
I am not sure you can see Chinese or Not, but here comes the example:
我是一個程式開發者 <== Means "I am a developer"
In english, you can easily seperate the sentense just by seperate them by white space, but in CJK, we can't. The whole sentense should seperate to
我(I) 是(am) 一個(a) 程式開發者(developer)
It is about the natual language process. It is the most default part.
It is a side topice, that's why I said "For your information"
> Does that fix everything? It is certainly the easiest
> (coding and
> performance) wise.
> With my thinking it seems like that fixes it - at least
> for now, because we don't have any other plugins that would
> use the inputs of others. And we can maybe do Mark's
> priority idea if we ever need that sort of thing.
> As long as it works for Paul's stuff, I think that sounds good.
> So, then we should take Mark's rev 6088 or whatever it is and
> use that, but modify it to pass in the previouslyRejected
> flag, and then put the bayesian at the end.
>
> > BTW, most lifetype installations in CJK site does rely on Bayesian
> > Filter to protect the spam attack. Because the tokenize algorithm
> > can't separate CJK into each atomic token. We don't use
> stop words and
> > "white space" to seperate a paragraph into "word".
> I am not sure what you are saying. It seems like you
> are saying the tokenizer doesn't work, so then it seems that
> the bayesian filter wouldn't be very good at all...
>
> Well, it's been 10 minutes since I read your idea of
> simply putting the bayesian filter at the end, and haven't
> come up with a reason why it won't work. So, probably good.
> Do you want to do it, or me?
I will keep your commit in 1.2, it seems we already have a conclusion to do this way.
But, I will try to implement the $filter order or say priority in 2.0.
So, we can make sure the bayesian fitler can run in the last minutes and user can have chance to unmark them.
Mark
More information about the pLog-svn
mailing list