How to detect non-latin scripts in email subject?

Question

On my mail server running Postfix, I want to reject mails using some non-latin scripts in their subject, in particular Arabic and Cyrillic since none of my users (family) speak languages that use them.

I tried it with a PCRE header check like this:

/Subject:.*\p{Arabic}/i WARN Arabic detected

Unfortunately, that doesn't trigger a warning when I send a test mail to my server. I have verified that WARN messages do appear in the system log using a /Subject:.*Test/i WARN Test rule, which does trigger.

How can I detect Arabic and Cyrillic in the subject using Postfix?

For completeness, in my main.cf I include the header_checks like this:

header_checks = pcre:/etc/postfix/header_checks

Esa Jokinen · Accepted Answer · 2024-02-13 22:19:49Z

According to RFC 5322, 3.6.5, the Subject header is defined as

subject = "Subject:" unstructured CRLF

and the RFC 5322, 2.2.1 defines "unstructured":

Some field bodies in this specification are defined simply as "unstructured" (which is specified in section 3.2.5 as any printable US-ASCII characters plus white space characters) with no further restrictions.

Because only US-ASCII characters are allowed in the Subject header, any non-US-ASCII character must be encoded as US-ASCII, and RFC 2047 defines a draft standard for that, which is widely in use. E.g., with the "Quoted-Printable" a.k.a. "Q" encoding (section 4.2),

cyrillic Тест becomes Subject: =?UTF-8?Q?=D0=A2=D0=B5=D1=81=D1=82?=
arabic متحان becomes Subject: =?UTF-8?Q?=D9=85=D8=AA=D8=AD=D8=A7=D9=86?=.

The matching in PCRE header_checks should be done against that encoding. However, matching unicode blocks using regular expressions like PCRE is pretty hard, as the following table demonstrates.

Unicode block	Range	Q-P start	Q-P end
Cyrillic	U+0400..U+04FF	`=D0=80`	`=D3=BF`
Arabic	U+0600..U+06FF	`=D8=80`	`=DB=BF`

This limitation is also mentioned in the Postfix Built-in Content Inspection documentation:

Limitations of Postfix header/body checks

Header/body checks do not decode message headers or message body content. For example, if text in the message body is BASE64 encoded (RFC 2045) then your regular expressions will have to match the BASE64 encoded form. Likewise, message headers with encoded non-ASCII characters (RFC 2047) need to be matched in their encoded form.

I would suggest using SpamAssassin rules, instead. The TextCat language guesser even has ok_languages for detecting languages from message body.

That would already be hard to pull off without false-positives, but the mails I'm seeing are using the B encoding… Base 64. You can't do regular expressions on those. So I guess both you and @Zac67 are telling me "it can't be done with header_checks", but there surely must be some way to reject such mails early, without accepting and passing to something like SpamAssassin first. — DarkDust, Commented Feb 13 at 22:11
That's true. I added a citation from the official documentation stating the same. if you want to reject these mails with SpamAssassin rules, you could place it as a Postfix before-queue Milter. That will add some delay before accepting messages, which might cause problems with some MTAs. On the other hand, SpamAssassin could only have a few rules on that phase and be used again later with a different set of rules. — Esa Jokinen, Commented Feb 13 at 22:26
Good idea, using a "fast checks only" SpamAssassin setup would indeed solve my question and help with rejecting some other stuff early as well. — DarkDust, Commented Feb 13 at 22:46

Zac67 · Accepted Answer · 2024-02-13 21:46:11Z

1

RFC 2047 defines how to encode non-ASCII character sets into the Subject header.

Essentially, it uses the =?charset?encoding?encoded-text?= where charset can be any character set as defined for MIME like UTF-8, encoding is B for base64 or Q for quoted printable, and encoded-text is the actual subject line. Just look at the source code from one of those encoded messages and you'll get the idea.

answered Feb 13 at 21:46

Zac67

12.2k2 gold badges13 silver badges33 bronze badges

Do you mean that Postfix only passes the encoded subject to the header_checks, and there's no way to check the decoded subject here?
– DarkDust
Commented Feb 13 at 22:02
I'm pretty sure Postfix only handles the encoded headers and leaves decoding to the email client.
– Zac67
Commented Feb 13 at 22:03

Add a comment |

Stack Exchange Network

How to detect non-latin scripts in email subject?

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
postfix
spam
.

Hot Network Questions

How to detect non-latin scripts in email subject?

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged postfixspam.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
postfix
spam
.