<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE></TITLE>
<META http-equiv=Content-Type content="text/html; charset=big5">
<META content="MSHTML 6.00.6000.16544" name=GENERATOR></HEAD>
<BODY><!-- Converted from text/plain format -->
<P><FONT size=2>Just in case you can not see the Chinese, I send a screen shots
for you.</FONT></P>
<DIV><FONT size=2><IMG
src="cid:580170605@30112007-2C47"><BR><BR>Mark<BR><BR>> -----Original
Message-----<BR>> From: Mark Wu [<A
href="mailto:markplace@gmail.com">mailto:markplace@gmail.com</A>]<BR>> Sent:
Friday, November 30, 2007 1:03 PM<BR>> To: 'LifeType Developer List'<BR>>
Subject: RE: [pLog-svn] r6088 -<BR>>
plog/branches/lifetype-1.2/class/security<BR>><BR>> Hi
Jon:<BR>><BR>> For CJK site bayesian filter does not work well, but
I<BR>> believe it works well for western user. The problem is not in<BR>>
Bayesian Fitler it self, it is becasue the tokenize().<BR>><BR>>
Esepecially, if we ask BayesianFilter to learn what is the<BR>> spam from the
comment text and topic.<BR>><BR>> I am not sure you can see Chinese or
Not, but here comes the example:<BR>><BR>> 我是一個程式開發者 <== Means "I am a
developer"<BR>><BR>> In english, you can easily seperate the sentense just
by<BR>> seperate them by white space, but in CJK, we can't. The whole<BR>>
sentense should seperate to<BR>><BR>> 我(I) 是(am) 一個(a)
程式開發者(developer)<BR>><BR>> It is about the natual language process. It is
the most default part.<BR>><BR>> It is a side topice, that's why I said
"For your information"<BR>><BR>><BR>> > Does that
fix everything? It is certainly the easiest<BR>> (coding and<BR>>
> performance) wise.<BR>> > With my thinking it
seems like that fixes it - at least<BR>> for now,<BR>> > because we
don't have any other plugins that would use the<BR>> inputs of<BR>> >
others. And we can maybe do Mark's priority idea if we<BR>> ever need
that<BR>> > sort of thing.<BR>> > As long as it
works for Paul's stuff, I think that sounds good.<BR>> > So, then we
should take Mark's rev 6088 or whatever it is and use<BR>> > that, but
modify it to pass in the previouslyRejected flag,<BR>> and then<BR>> >
put the bayesian at the end.<BR>> ><BR>> > > BTW, most
lifetype installations in CJK site does rely<BR>> on Bayesian<BR>> >
> Filter to protect the spam attack. Because the tokenize algorithm<BR>>
> > can't separate CJK into each atomic token. We don't use<BR>> >
stop words and<BR>> > > "white space" to seperate a paragraph into
"word".<BR>> > I am not sure what you are saying.
It seems like you<BR>> are saying the<BR>> > tokenizer doesn't work, so
then it seems that the bayesian filter<BR>> > wouldn't be very good at
all...<BR>> ><BR>> > Well, it's been 10 minutes
since I read your idea of<BR>> simply putting<BR>> > the bayesian
filter at the end, and haven't come up with a<BR>> reason why<BR>> > it
won't work. So, probably good.<BR>> > Do you want to do it, or
me?<BR>><BR>> I will keep your commit in 1.2, it seems we already have
a<BR>> conclusion to do this way.<BR>><BR>> But, I will try to
implement the $filter order or say priority in 2.0.<BR>><BR>> So, we can
make sure the bayesian fitler can run in the last<BR>> minutes and user can
have chance to unmark them.<BR>><BR>> Mark</FONT> </DIV></BODY></HTML>