[pLog-svn] r6088 - plog/branches/lifetype-1.2/class/security

Jon Daley plogworld at jon.limedaley.com
Fri Nov 30 17:27:31 EST 2007

 	Thanks.  My son was interested in the symbols.  Though he asked me 
to pronounce it too.  We are actually planning dinner right now - he is 
going to make "Chinese salad", which I'll bet you have never had anything 
like it, at least with that name.


On Fri, 30 Nov 2007, Mark Wu wrote:
> Just in case you can not see the Chinese, I send a screen shots for you.
> Mark
>> -----Original Message-----
>> From: Mark Wu [mailto:markplace at gmail.com]
>> Sent: Friday, November 30, 2007 1:03 PM
>> To: 'LifeType Developer List'
>> Subject: RE: [pLog-svn] r6088 -
>> plog/branches/lifetype-1.2/class/security
>> Hi Jon:
>> For CJK site bayesian filter does not work well, but I
>> believe it works well for western user. The problem is not in
>> Bayesian Fitler it self, it is becasue the tokenize().
>> Esepecially, if we ask BayesianFilter to learn what is the
>> spam from the comment text and topic.
>> I am not sure you can see Chinese or Not, but here comes the example:
>> §Ú¬O¤@­Óµ{¦¡¶}µoªÌ <== Means "I am a developer"
>> In english, you can easily seperate the sentense just by
>> seperate them by white space, but in CJK, we can't. The whole
>> sentense should seperate to
>> §Ú(I) ¬O(am) ¤@­Ó(a) µ{¦¡¶}µoªÌ(developer)
>> It is about the natual language process. It is the most default part.
>> It is a side topice, that's why I said "For your information"
>>>     Does that fix everything?  It is certainly the easiest
>> (coding and
>>> performance) wise.
>>>     With my thinking it seems like that fixes it - at least
>> for now,
>>> because we don't have any other plugins that would use the
>> inputs of
>>> others.  And we can maybe do Mark's priority idea if we
>> ever need that
>>> sort of thing.
>>>     As long as it works for Paul's stuff, I think that sounds good.
>>> So, then we should take Mark's rev 6088 or whatever it is and use
>>> that, but modify it to pass in the previouslyRejected flag,
>> and then
>>> put the bayesian at the end.
>>>> BTW,  most lifetype installations in CJK site does rely
>> on Bayesian
>>>> Filter to protect the spam attack. Because the tokenize algorithm
>>>> can't separate CJK into each atomic token. We don't use
>>> stop words and
>>>> "white space" to seperate a paragraph into "word".
>>>     I am not sure what you are saying.  It seems like you
>> are saying the
>>> tokenizer doesn't work, so then it seems that the bayesian filter
>>> wouldn't be very good at all...
>>>     Well, it's been 10 minutes since I read your idea of
>> simply putting
>>> the bayesian filter at the end, and haven't come up with a
>> reason why
>>> it won't work.  So, probably good.
>>> Do you want to do it, or me?
>> I will keep your commit in 1.2, it seems we already have a
>> conclusion to do this way.
>> But, I will try to implement the $filter order or say priority in 2.0.
>> So, we can make sure the bayesian fitler can run in the last
>> minutes and user can have chance to unmark them.
>> Mark

Jon Daley

Any concern too small to be turned into a prayer
is too small to be made into a burden.
-- Corrie Ten Boom

More information about the pLog-svn mailing list