[pLog-svn] r6088 - plog/branches/lifetype-1.2/class/security
Jon Daley
plogworld at jon.limedaley.com
Fri Nov 30 17:27:31 EST 2007
Thanks. My son was interested in the symbols. Though he asked me
to pronounce it too. We are actually planning dinner right now - he is
going to make "Chinese salad", which I'll bet you have never had anything
like it, at least with that name.
http://jon.limedaley.com/plog/archives/2007/05/06/chef-jonathan
On Fri, 30 Nov 2007, Mark Wu wrote:
> Just in case you can not see the Chinese, I send a screen shots for you.
>
>
>
> Mark
>
>> -----Original Message-----
>> From: Mark Wu [mailto:markplace at gmail.com]
>> Sent: Friday, November 30, 2007 1:03 PM
>> To: 'LifeType Developer List'
>> Subject: RE: [pLog-svn] r6088 -
>> plog/branches/lifetype-1.2/class/security
>>
>> Hi Jon:
>>
>> For CJK site bayesian filter does not work well, but I
>> believe it works well for western user. The problem is not in
>> Bayesian Fitler it self, it is becasue the tokenize().
>>
>> Esepecially, if we ask BayesianFilter to learn what is the
>> spam from the comment text and topic.
>>
>> I am not sure you can see Chinese or Not, but here comes the example:
>>
>> §Ú¬O¤@Óµ{¦¡¶}µoªÌ <== Means "I am a developer"
>>
>> In english, you can easily seperate the sentense just by
>> seperate them by white space, but in CJK, we can't. The whole
>> sentense should seperate to
>>
>> §Ú(I) ¬O(am) ¤@Ó(a) µ{¦¡¶}µoªÌ(developer)
>>
>> It is about the natual language process. It is the most default part.
>>
>> It is a side topice, that's why I said "For your information"
>>
>>
>>> Does that fix everything? It is certainly the easiest
>> (coding and
>>> performance) wise.
>>> With my thinking it seems like that fixes it - at least
>> for now,
>>> because we don't have any other plugins that would use the
>> inputs of
>>> others. And we can maybe do Mark's priority idea if we
>> ever need that
>>> sort of thing.
>>> As long as it works for Paul's stuff, I think that sounds good.
>>> So, then we should take Mark's rev 6088 or whatever it is and use
>>> that, but modify it to pass in the previouslyRejected flag,
>> and then
>>> put the bayesian at the end.
>>>
>>>> BTW, most lifetype installations in CJK site does rely
>> on Bayesian
>>>> Filter to protect the spam attack. Because the tokenize algorithm
>>>> can't separate CJK into each atomic token. We don't use
>>> stop words and
>>>> "white space" to seperate a paragraph into "word".
>>> I am not sure what you are saying. It seems like you
>> are saying the
>>> tokenizer doesn't work, so then it seems that the bayesian filter
>>> wouldn't be very good at all...
>>>
>>> Well, it's been 10 minutes since I read your idea of
>> simply putting
>>> the bayesian filter at the end, and haven't come up with a
>> reason why
>>> it won't work. So, probably good.
>>> Do you want to do it, or me?
>>
>> I will keep your commit in 1.2, it seems we already have a
>> conclusion to do this way.
>>
>> But, I will try to implement the $filter order or say priority in 2.0.
>>
>> So, we can make sure the bayesian fitler can run in the last
>> minutes and user can have chance to unmark them.
>>
>> Mark
>
--
Jon Daley
http://jon.limedaley.com/
Any concern too small to be turned into a prayer
is too small to be made into a burden.
-- Corrie Ten Boom
More information about the pLog-svn
mailing list