According to RFC 5322, 3.6.5, the Subject
header is defined as
subject = "Subject:" unstructured CRLF
and the RFC 5322, 2.2.1 defines "unstructured":
Some field bodies in this specification are defined simply as
"unstructured" (which is specified in section 3.2.5 as any printable
US-ASCII characters plus white space characters) with no further
restrictions.
Because only US-ASCII characters are allowed in the Subject
header, any non-US-ASCII character must be encoded as US-ASCII, and RFC 2047 defines a draft standard for that, which is widely in use. E.g., with the "Quoted-Printable" a.k.a. "Q" encoding (section 4.2),
- cyrillic
Тест
becomes Subject: =?UTF-8?Q?=D0=A2=D0=B5=D1=81=D1=82?=
- arabic
متحان
becomes Subject: =?UTF-8?Q?=D9=85=D8=AA=D8=AD=D8=A7=D9=86?=
.
The matching in PCRE header_checks
should be done against that encoding. However, matching unicode blocks using regular expressions like PCRE is pretty hard, as the following table demonstrates.
Unicode block |
Range |
Q-P start |
Q-P end |
Cyrillic |
U+0400..U+04FF |
=D0=80 |
=D3=BF |
Arabic |
U+0600..U+06FF |
=D8=80 |
=DB=BF |
This limitation is also mentioned in the Postfix Built-in Content Inspection documentation:
Limitations of Postfix header/body checks
Header/body checks do not decode message headers or message body content. For example, if text in the message body is BASE64 encoded (RFC 2045) then your regular expressions will have to match the BASE64 encoded form. Likewise, message headers with encoded non-ASCII characters (RFC 2047) need to be matched in their encoded form.
I would suggest using SpamAssassin rules, instead. The TextCat language guesser even has ok_languages
for detecting languages from message body.