[pLog-svn] r6088 - plog/branches/lifetype-1.2/class/security

Fri Nov 30 00:06:27 EST 2007

Just in case you can not see the Chinese, I send a screen shots for you.

Mark

> -----Original Message-----
> From: Mark Wu [mailto:markplace at gmail.com]
> Sent: Friday, November 30, 2007 1:03 PM
> To: 'LifeType Developer List'
> Subject: RE: [pLog-svn] r6088 -
> plog/branches/lifetype-1.2/class/security
>
> Hi Jon:
>
> For CJK site bayesian filter does not work well, but I
> believe it works well for western user. The problem is not in
> Bayesian Fitler it self, it is becasue the tokenize().
>
> Esepecially, if we ask BayesianFilter to learn what is the
> spam from the comment text and topic.
>
> I am not sure you can see Chinese or Not, but here comes the example:
>
> 我是一個程式開發者 <== Means "I am a developer"
>
> In english, you can easily seperate the sentense just by
> seperate them by white space, but in CJK, we can't. The whole
> sentense should seperate to
>
> 我(I) 是(am) 一個(a) 程式開發者(developer)
>
> It is about the natual language process. It is the most default part.
>
> It is a side topice, that's why I said "For your information"
>
>
> >     Does that fix everything?  It is certainly the easiest
> (coding and
> > performance) wise.
> >     With my thinking it seems like that fixes it - at least
> for now,
> > because we don't have any other plugins that would use the
> inputs of
> > others.  And we can maybe do Mark's priority idea if we
> ever need that
> > sort of thing.
> >     As long as it works for Paul's stuff, I think that sounds good.
> > So, then we should take Mark's rev 6088 or whatever it is and use
> > that, but modify it to pass in the previouslyRejected flag,
> and then
> > put the bayesian at the end.
> >
> > > BTW,  most lifetype installations in CJK site does rely
> on Bayesian
> > > Filter to protect the spam attack. Because the tokenize algorithm
> > > can't separate CJK into each atomic token. We don't use
> > stop words and
> > > "white space" to seperate a paragraph into "word".
> >     I am not sure what you are saying.  It seems like you
> are saying the
> > tokenizer doesn't work, so then it seems that the bayesian filter
> > wouldn't be very good at all...
> >
> >     Well, it's been 10 minutes since I read your idea of
> simply putting
> > the bayesian filter at the end, and haven't come up with a
> reason why
> > it won't work.  So, probably good.
> > Do you want to do it, or me?
>
> I will keep your commit in 1.2, it seems we already have a
> conclusion to do this way.
>
> But, I will try to implement the $filter order or say priority in 2.0.
>
> So, we can make sure the bayesian fitler can run in the last
> minutes and user can have chance to unmark them.
>
> Mark 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://limedaley.com/pipermail/plog-svn/attachments/20071130/d1d67669/attachment-0001.htm 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 19679 bytes
Desc: not available
Url : http://limedaley.com/pipermail/plog-svn/attachments/20071130/d1d67669/attachment-0001.jpeg