<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML><HEAD><TITLE></TITLE>

<META http-equiv=Content-Type content="text/html; charset=big5">

<META content="MSHTML 6.00.6000.16544" name=GENERATOR></HEAD>

<BODY><!-- Converted from text/plain format -->

<P><FONT size=2>Just in case you can not see the Chinese, I send a screen shots 

for you.</FONT></P>

<DIV><FONT size=2><IMG 

src="cid:580170605@30112007-2C47"><BR><BR>Mark<BR><BR>&gt; -----Original 

Message-----<BR>&gt; From: Mark Wu [<A 

href="mailto:markplace@gmail.com">mailto:markplace@gmail.com</A>]<BR>&gt; Sent: 

Friday, November 30, 2007 1:03 PM<BR>&gt; To: 'LifeType Developer List'<BR>&gt; 

Subject: RE: [pLog-svn] r6088 -<BR>&gt; 

plog/branches/lifetype-1.2/class/security<BR>&gt;<BR>&gt; Hi 

Jon:<BR>&gt;<BR>&gt; For CJK site bayesian filter does not work well, but 

I<BR>&gt; believe it works well for western user. The problem is not in<BR>&gt; 

Bayesian Fitler it self, it is becasue the tokenize().<BR>&gt;<BR>&gt; 

Esepecially, if we ask BayesianFilter to learn what is the<BR>&gt; spam from the 

comment text and topic.<BR>&gt;<BR>&gt; I am not sure you can see Chinese or 

Not, but here comes the example:<BR>&gt;<BR>&gt; 我是一個程式開發者 &lt;== Means "I am a 

developer"<BR>&gt;<BR>&gt; In english, you can easily seperate the sentense just 

by<BR>&gt; seperate them by white space, but in CJK, we can't. The whole<BR>&gt; 

sentense should seperate to<BR>&gt;<BR>&gt; 我(I) 是(am) 一個(a) 

程式開發者(developer)<BR>&gt;<BR>&gt; It is about the natual language process. It is 

the most default part.<BR>&gt;<BR>&gt; It is a side topice, that's why I said 

"For your information"<BR>&gt;<BR>&gt;<BR>&gt; &gt;&nbsp; &nbsp;&nbsp; Does that 

fix everything?&nbsp; It is certainly the easiest<BR>&gt; (coding and<BR>&gt; 

&gt; performance) wise.<BR>&gt; &gt;&nbsp; &nbsp;&nbsp; With my thinking it 

seems like that fixes it - at least<BR>&gt; for now,<BR>&gt; &gt; because we 

don't have any other plugins that would use the<BR>&gt; inputs of<BR>&gt; &gt; 

others.&nbsp; And we can maybe do Mark's priority idea if we<BR>&gt; ever need 

that<BR>&gt; &gt; sort of thing.<BR>&gt; &gt;&nbsp; &nbsp;&nbsp; As long as it 

works for Paul's stuff, I think that sounds good.<BR>&gt; &gt; So, then we 

should take Mark's rev 6088 or whatever it is and use<BR>&gt; &gt; that, but 

modify it to pass in the previouslyRejected flag,<BR>&gt; and then<BR>&gt; &gt; 

put the bayesian at the end.<BR>&gt; &gt;<BR>&gt; &gt; &gt; BTW,&nbsp; most 

lifetype installations in CJK site does rely<BR>&gt; on Bayesian<BR>&gt; &gt; 

&gt; Filter to protect the spam attack. Because the tokenize algorithm<BR>&gt; 

&gt; &gt; can't separate CJK into each atomic token. We don't use<BR>&gt; &gt; 

stop words and<BR>&gt; &gt; &gt; "white space" to seperate a paragraph into 

"word".<BR>&gt; &gt;&nbsp; &nbsp;&nbsp; I am not sure what you are saying.&nbsp; 

It seems like you<BR>&gt; are saying the<BR>&gt; &gt; tokenizer doesn't work, so 

then it seems that the bayesian filter<BR>&gt; &gt; wouldn't be very good at 

all...<BR>&gt; &gt;<BR>&gt; &gt;&nbsp; &nbsp;&nbsp; Well, it's been 10 minutes 

since I read your idea of<BR>&gt; simply putting<BR>&gt; &gt; the bayesian 

filter at the end, and haven't come up with a<BR>&gt; reason why<BR>&gt; &gt; it 

won't work.&nbsp; So, probably good.<BR>&gt; &gt; Do you want to do it, or 

me?<BR>&gt;<BR>&gt; I will keep your commit in 1.2, it seems we already have 

a<BR>&gt; conclusion to do this way.<BR>&gt;<BR>&gt; But, I will try to 

implement the $filter order or say priority in 2.0.<BR>&gt;<BR>&gt; So, we can 

make sure the bayesian fitler can run in the last<BR>&gt; minutes and user can 

have chance to unmark them.<BR>&gt;<BR>&gt; Mark</FONT> </DIV></BODY></HTML>