0

I have a problem with some spam messages with the subject field encoded in utf8 base 64 and weird characters used to fool the filter rules

example:

raw subject of incoming email

Subject: =?UTF-8?B?UklGSVVU0J4gREkgUklOTtCeVtCe?=#821538

decode by spamassasin contains this char О instead of O

__SUBJ_NOT_SHORT ======> got hit: "RIFIUTО DI RINNOVO"

so the rule not trigger

header     __SUBJECT_PHISHING_3     Subject=~ /(RIFIUTО DI RINNОVО)/i

however these characters are displayed in the email client ( Outlook or Thunderbird) with an O and result correct in italian language to fool the user

RIFIUTО DI RINNОVО

So the spammer inserts weird characters knowing that the client will show them correctly in Italian while spamassassin will not trigger the rule

there is a solution to match these characters or decode them like the email client do without having to create a new rule every time the spammer insert special char to bypass filter

found same problem with some hint https://users.spamassassin.apache.narkive.com/LhGDKXkm/utf-8-spam-rules

2
  • what do you mean, correctly - the email header unambiguously instructs to treat the base64-encoded payload as UTF-8?
    – anx
    Commented Nov 8, 2022 at 2:38
  • hi, the spammer uses these special characters RIFIUTО DI RINNOVO (О instead O )that the mail client instead displays as Rifiuto di rinnovo, correct in Italian. So if I create a rule to block emails with subject Rifiuto di rinnovo the spammer manages to bypass it, I would like to understand if there is a way with spamassassin to decode special characters in the defined language (italian) to avoid having to create ad hoc rules every time a new modified subject arrives
    – hcomputer
    Commented Nov 9, 2022 at 7:59

1 Answer 1

1

I don't think there is an easy solution for this.

The problem here is that the email client decodes the base64-encoded text correctly as not having an "O" (as in, "Latin capital letter O") character, but a Cyrillic one ("Cyrillic capital letter O"). The former is U+004F, the latter is U+041E.

So your regexp will not match, simply because for the regexp parser (and for programs in general), those two characters are not the same. For a human, they are, since they look exactly like one another, so it doesn't really matter which one is displayed. I'm not aware of any simple solution which allows you to match texts based on appearance.

By the way, Spamassassin should recognize the Cyrillic character and should have displayed that instead of the garbage "О" (but, truth to be told, that would have been even more confusing). You should check the server's default character encoding.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .