nfilter users

- J
- Jim Wilson
  
  Contact options for registered users
posted
20 years ago

Fri, Feb 20, 2004 8:18 PM

Here is yet another improvement.

rec.woodworking Drop subject:.*[Yy]*[^A-Za-z]*[Uu]*[^A-Za-z]*[Kk].*

This version eliminates messages with YUK in the subject, regardless of whether any non-alphabetic junk is inserted between the letters. You'll need additional similar lines in your filter file, one for each different objectionable word.

Regular expressions must be enabled. To do this, go to Edit-Configuration, and check the "Enable regular expressions" box on the General tab.

Jim

- G
- Gordon Airport
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Fri, Feb 20, 2004 8:39 PM

Trying to keep up with a subject filter arms-race seems like a bad idea. It's too easy to make a slight change and get more crap through, especially if you're posting the filters. It looks to me like filtering on "nym.alias.net" in the header might be the way to go. There may be some false positives, but I don't think many people here are using remailers (or wherever that string comes from.)

- Doug

- D
- Doug Miller
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Fri, Feb 20, 2004 8:58 PM

That would be the best -- for those of us whose newsservers supply the Path or Organization headers when Nfilter asks for them. Mine doesn't, unfortunately, so I have to filter by subject keywords or by Message-ID. Fortunately, the latter works pretty well until this bozo gets kicked off and finds another ISP.

-- Regards, Doug Miller

For a copy of my TrollFilter for NewsProxy/Nfilter, email me at filterinfo-at-milmac-dot-com

- A
- alexy
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Fri, Feb 20, 2004 9:32 PM

Actually, wouldn't that get anything with the letters y, u, and k in that order? You really don't want to miss messages about firetrucks!

- D
- Doug Miller
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Fri, Feb 20, 2004 10:38 PM

Not in my opinion. It does not appear that you tested this very carefully, if at all. Look at the test posts I put in alt.test.cztery. Then run this filter on them and see which ones it drops.

It does a lot more than that -- it eliminates all messages with the letter K anywhere in the subject.

Examine it piece by piece:

.* = any string of ZERO or more characters [Yy]* = ZERO or more upper or lower case Y [^A-Za-z]* = ZERO or more non-alphabetic characters [Uu]* = ZERO or more upper or lower case U [^A-Za-z]* = ZERO or more non-alphabetic characters [Kk] = exactly ONE upper or lower case K .* = any string of ZERO or more characters

Thus, any string that has an upper or lower case K anywhere in it will be removed by this filter. Probably not what you want.

Either remove the asterisks after the characters of the objectionable word(s)

.*[Yy][^A-Za-z]*[Uu][^A-Za-z]*[Kk].*

or replace them with plus signs (match ONE or more)

:.*[Yy]+[^A-Za-z]*[Uu]+[^A-Za-z]*[Kk]+.*

and it will work a *lot* better.

-- Regards, Doug Miller

For a copy of my TrollFilter for NewsProxy/Nfilter, email me at filterinfo-at-milmac-dot-com

- D
- Doug Miller
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Fri, Feb 20, 2004 10:42 PM

As I noted in another post in this thread, it's not.

Perhaps you missed the ^ character? That *negates* what follows; thus, "[^A-Za-z]*" means "any sequence of zero or more consecutive NON-alphabetic characters".

However, this filter will in fact kill "firetruck" as a subject, but that's because it kills anything with a K in it anywhere. See my other post in this thread for a full description.

-- Regards, Doug Miller

For a copy of my TrollFilter for NewsProxy/Nfilter, email me at filterinfo-at-milmac-dot-com

- R
- Robert Bonomi
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Fri, Feb 20, 2004 11:41 PM

Please *don't* publish things when you "don't know what you're doing".

That filter is *grossly* defective. the '*' modifier character means "match ZERO OR MORE" of the preceding token.

As written, your filter will catch *anything*. "(zero or more 'y') (zero or more 'anything') (zero or more 'u') (zero or more 'anything') (zero or more 'k')" which reduces to 'match anything'.

Yes, it will get rid of the troll posts. at the price of getting rid of _everything_else_! This qualifies as a "BAD IDEA(TM)"

I've expanded/updated the filter files at so one can apply filters in 'groups', by simply appending individual files together.

- R
- Robert Bonomi
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Sat, Feb 21, 2004 1:02 AM

I've updated the filter rules at:

They're broken out into several separate files, for: hosts/domains/IP-addresses the troll posts from inappropriate and/or excessive cross-posting vulgarities politics

Simply download the rule-sets you want to use, append together to create a single file, and install.

Note: The main web-page now shows the date/time that each filter set was last modified.

- D
- Doug Miller
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Sat, Feb 21, 2004 1:10 AM

Well, not quite. There's no asterisk after the K, so it reduces to 'match anything containing a K'.

-- Regards, Doug Miller

For a copy of my TrollFilter for NewsProxy/Nfilter, email me at filterinfo-at-milmac-dot-com

- J
- Jim Wilson
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Sat, Feb 21, 2004 2:18 AM

I'm sorry for this error. There is apparently a shortcoming in nfilter's regular expression interpreter. The regular expression [^A-Za-z] means "any character that is not a letter."

Well, you are right that it does not work correctly with nfilter, which is the important point, of course, and I did fail to detect that before posting. When I tested it, it *did* effectively filter the posts that I wanted it to. However, I didn't make sure that it left the good stuff alone. For that, I do apologize. It's a beginner's programming error for which there is no excuse.

Actually, both of these interpretations overlook the leading caret, which is the negation modifier for a list (or class, for you PERL users). The fact that it doesn't work in nfilter doesn't make it an incorrect regular expression. Impractical in this case, though, to be sure.

Well, this is a rather strong admonishment. I don't mean get uppity, but programming regular expressions is something that I do know a thing or two about, having been using them "regularly" for a couple of decades as a professional software developer. I admit to poorly testing this example on this implementation, as well as my slash typo in an earlier message. If these two careless mistakes are severe enough to warrant banning further posts from me on the subject, then so be it, but I think that's going a little far.

Cheers!

Jim

- T
- Thomas Kendrick
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Sat, Feb 21, 2004 3:25 AM

Jim, Thanks for trying to help the group develop useful filters.

Newsproxy DOES actually produce a suppression log which I have used to see which rule deleted a message. My take is that if I copy the results of your labors, it will be my responsibility to validate that it works for me, either via understanding the filter logic or testing it myself. All that I would assume is that you have given best effort to test the code in a few situations. Certainly expecting a full system regression test would be excessive. Thanks, Tom

- D
- Doug Miller
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Sat, Feb 21, 2004 3:59 AM

That's not the problem -- the problem is that "[Yy]*" matches Y, y, or

*nothing*.

Missing the point. The point is that it's an incorrect regexp. Period. Whether it works with nfilter or not is irrelevant.

[snip]

I'm not overlooking that at all, as my posts in this thread have made abundantly clear.

It doesn't work in nfilter precisely because it IS an incorrect regexp. As I posted earlier, you must either (a) omit the asterisk, or (b) replace it with a plus sign, for it to do what you want. Regexps so constructed work.

I tested mine.

-- Regards, Doug Miller

For a copy of my TrollFilter for NewsProxy/Nfilter, email me at filterinfo-at-milmac-dot-com

- J
- Jim Wilson
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Sat, Feb 21, 2004 5:24 AM

You're absolutely right. I did miss the point, and it wouldn't work as intended for the very reason you were trying to point out. I compounded my error with a mis-diagnosis of the problem and then still further by misunderstanding what you were trying to say. Pretty laughable, even to me. All I can say is, egg on my face.

Has this sort of cascading blunderfest ever happened in the shop? Nah. (G)

Anyway, I finally have a filter that does what I want, nothing more and nothing less, and I have tested it extensively -- live, using nfilter in the alt.test newsgroup. It doesn't look like yours or Robert's, and it takes care of some conditions not covered by the other filter expressions I've seen so far here. Although we haven't seen all these "tricks" from our local trolls, I *have* seen them in spam email, so I figure it is only a matter of time before these variants start showing up here. Because of that, I am going to paste an example below with the caveat that it's from a guy who's been screwing up royally on this subject all day long. Take it or leave it.

Filter for "YUCK" or some variant of it in the subject:

rec.woodworking Drop subject:[Yy]([Yy]|[^A-Za-z])*[Uu]([Uu]|[^A-Za- z])*[Cc]([Cc]|[^A-Za-z])*[Kk]

Note that it should be only one line of text. Note also that the initial and final ".*" everyone's been using are not present as they are unnecessary.

Cheers!

Jim

- J
- Jim Wilson
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Sat, Feb 21, 2004 5:27 AM

Tom,

Thanks for your kind words of encouragement. Robert and Doug were absolutely right, though, and I don't blame them for ripping me a new one for my errors. I would have done the same if I could have reached around me that far. (G)

Cheers!

Jim

- R
- Robert Bonomi
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Sat, Feb 21, 2004 11:27 AM

Now you begin to understand why I was "encouraging" you not to publish.

"Regular expression" construction -- once you get much beyond the 'trivial' cases -- is *very* arcane art. Unfortunately, there are many, *many*, ways to accomplish complex matches -- frequently with the more 'obvious' ways being far, *far*, less efficient than some less-obvious alternatives. I'm not talking about a factor of 2-3, but sometimes 1000x or more. I've seen a _one-line_ regexp (about 60 characters) that would require _centuries_ on a fast computer to find a match (*if* it existed) in one line (of roughly

100 characters) of text. That particular "exercise in futility" is a pathological case, unlikely to occur in the real world -- it was designed to showcase the potential problems in the design of the regular expression "language".

Yeah, that does 'pretty close' to what you say. Of course, it trips on that 4-letter sequence 'in the middle of' another word -- like "Pennsyucky".

Also, while it _does_ work, it is a _seriously_inefficient_ expression.

The construct "([Yy]|[^A-Za-z])*" results in a lot of 'thrashing' inside the code that does matching. "[^A-XZa-xz]*" is _literally_ orders of magnitude more efficient. Admittedly, you have to tailor the 'not' list depending on what the preceding character was. Also, characters that really are doubled in the 'targeted' word require special consideration.

Yup, you're ABSOLUTELY RIGHT about that. I actually use "something completely different" for my filtering, which gets _entire_ header lines to check against. Thus, when I want to check _only_ the "Subject: " line my regexp _has_ to start with "^Subject:.*'.

The 'cost' of ".*" is -- for all practical purposes -- _ZERO_ when it occurs at the beginning of an expression. The only difference is _where_ the 'start of match' occurs in the string being checked. Without the leading ".*", the match starts at the first character of the targeted word/phrase, But with the ".*", the match begins at the beginning of the line. If you use the 'results of the match' for other processing (which you cannot do in nFilter/NewsProxy, but *can* do in other types of software) the difference can be important.

At the end of a regexp, the 'cost' of ".*" is small, but it is utterly wasted effort. The one that's even funnier is a trailing ".*$" -- `match zero or more of anything, up to end-of-line'. *exactly* the same functionality as a trailing ".*", _but_ introduces the additional overhead of an explicit check after _every_ character, to see if end-of-line has been reached, or not.

Anyway, lazy me, I just strip the '^' from the beginning of the regexp I use, and export to an nFilter rule set.

- D
- Doug Miller
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Sat, Feb 21, 2004 3:02 PM

In article , snipped-for-privacy@host122.r-bonomi.com (Robert Bonomi) wrote: [snip]

Elegant.

Just curious, Robert ... how do the efficiencies of these two expressions compare?

[Yy][^A-XZa-xz]* [Yy]+[^A-Za-z]*

-- Regards, Doug Miller

For a copy of my TrollFilter for NewsProxy/Nfilter, email me at filterinfo-at-milmac-dot-com

- D
- Doug Miller
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Sat, Feb 21, 2004 3:22 PM

Jim,

I'm sorry my tone was harsh; I didn't mean to be "ripping you a new one". It's important to bear in mind that, although a good number of the readers of this ng are computer-savvy (including more than a few professional programmers), there are many more who are not. And therefore, for their benefit, we must make sure that any code that we post is as close to perfect as we can make it.

You've just bumped into the single hardest facet of software testing. Almost any fool can verify that a piece of software does what it is intended to do. It's infinitely more difficult to make sure that it does _not_ do what it is _not_ intended to do. :-)

-- Regards, Doug Miller

For a copy of my TrollFilter for NewsProxy/Nfilter, email me at filterinfo-at-milmac-dot-com

- W
- Wood Butcher
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Sat, Feb 21, 2004 8:53 PM

How do *you* test your nfilter rules? This is what I've been doing and if there is a better way I'd sure appreciate knowing what it is.

First close nfilter & restart it to load the new nfilter.dat file.

Send test msgs to alt.test & see what happens. [being sure to substitute alt.test for rec.woodworking in the rules for this test.]

In an alternate account (in OE) subscribe to the wreck and download all message headers (about 12K at this time) and see what leaked thru. This tests for all the actual spam msgs to the wreck. In nfilters "Dropped Articles" window, copy everything into an excel worksheet. Manipulate the data to sort on the source in the message-id field (i.e. what's after the @) and see what rules killed those sources. The overwhelming bulk of the kills are for vulgarity and

3 or more x-posts so I figure those are good kills.

For the remaining few sources that could be bad kills how does one xref the msg-id to a subject line?

Art

- J
- Jim Wilson
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Sat, Feb 21, 2004 9:52 PM

These aren't equivalent expressions, so an efficiency comparison doesn't seem pertinent.

Jim

- J
- Jim Wilson
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Sat, Feb 21, 2004 9:53 PM

Hey, no sweat. My skin is thick, and I had it coming.

Not the first time with over 20 years in the field. (G)

Jim