Main Page | Compound List | File List | Compound Members | File Members

MailHeader Class Reference

#include <MailHeader.h>

List of all members.

Public Member Functions

 MailHeader (SpamParameters &p, HeaderInfo &headInfo)
MailFilter::classification parseContentType (FILE *fp, char *buf)
const char * getBoundaryStr ()
MailFilter::classification checkHeader (FILE *fp)

Private Member Functions

void saveBoundary (const char *pBound)
void fillInSections (FILE *fp)
MailFilter::classification checkReceived (const char *buf, FILE *fp, size_t line)
MailFilter::classification checkSubject (const char *buf, FILE *fp)
MailFilter::classification checkFrom (const char *buf)
bool addrContinues (const char *buf)
MailFilter::classification checkDomainAddrs (const char *domainName, const char *pBuf)
MailFilter::classification checkAddressSection (const char *buf, FILE *fp)

Private Attributes

Logger log
SpamParametersmParams
HeaderInfomHeadInfo
bool foundValidAddress
char boundaryStr [128]
char pushBackBuf [1024]
char * pPushBack


Detailed Description

Support for processing the email header (e.g., From:, To:, Subject:, etc...)

The pushBackBuf

When processing the email header sometimes it is necessary to read the next line to know what to do. For example, to know whether the header has come to an end with a blank line. Or to know if the subject or other parts of the header continue on the next line. By reading the next line is may also be that we've read too far. We've read a line that should be processed. To deal with this issue, the pushBackBuf is used. The next line can be read into the pushBackBuf. When subsequent logic needs another line, the pushBackBuf can be checked (to see if it is non-zero length) before reading another line.

The boundaryStr

A boundary string is used to separate the sections of a MIME formatted email. Email software (like this email filter) can skip between sections by looking for the boundary string. The boundary string is defined in the email header. The MailHeader code saves the boundary string (if it exists) in the boundaryStr buffer. The boundary string is then used in processing the email body.

Definition at line 57 of file MailHeader.h.


Member Function Documentation

bool MailHeader::addrContinues const char *  buf  )  [private]
 

addrContinues

Return TRUE if it looks like the "to:" is spread across multiple lines.

The "To:" continues on another line when the line ends with either a comma or a single quote follwed by a double quote.

This function looks at the end of the line. If the line ends with:

"," (comma) "'\"" (that's a single quote followed by a double quote)
Note that in the code below the search is done in reverse, so the '"' is encountered before the '\'' (single quote).

Definition at line 188 of file MailHeader.C.

References Logger::log().

Referenced by checkAddressSection().

00189 {
00190   bool rslt = false;
00191 
00192   log.log(Logger::DEBUG, "addrContinues", "enter");
00193   if (buf) {
00194     int end = strlen( buf );
00195 
00196     if (end > 0) {
00197       end--;
00198       const char *endPtr;
00199       for (endPtr = &buf[end]; endPtr >= buf && isspace(*endPtr); endPtr--)
00200         /* nada */;
00201       if (*endPtr == ',') {
00202         rslt = true;
00203       }
00204       else if (*endPtr == '"') {
00205         if (endPtr > buf && *(endPtr-1) == '\'') {
00206           rslt = true;
00207         }
00208       }
00209     }
00210   }
00211 
00212   char msgbuf[128];
00213   sprintf( msgbuf, "returns %s", (rslt) ? "TRUE" : "FALSE" );
00214   log.log(Logger::DEBUG, "addrContinues", msgbuf );
00215 
00216   log.log(Logger::DEBUG, "addrContinues", "exit");
00217   return rslt;
00218 } // addrContinues

MailFilter::classification MailHeader::checkAddressSection const char *  buf,
FILE *  fp
[private]
 

Check either the To: or Cc: sections.

The "to_list" addresses, defined in SpamFilterParams, will usually be mailing lists (for example, the ANTLR anltr-interest mailing list that is distributed via Yahoo). If one of these addresses (or parts of an address) are found, then then the function will return EMAIL and no further content checking will be done by the mail filter.

This function checks for the domain name specified in the my_domain secton of SpamFilterParams. If the domain is found it then checks to see if the user (e.g., the string to the left of the @) is listed in the valid_users section. This function limits user names to strings consisting of 'a'..'z' and '0'..'9' (case insensitive) plus the underscore character. Note that only one domain is allowed in the my_domain section.

Before moving to my current ISP I would get any e-mail addressed to bearcave.com. This was a problem when the site bearcave.org existed since a number of bearcave.org users made the mistake of using .com when they should have used .org. Checking for valid users marks as garbage any email to a user that is not valid.

Definition at line 334 of file MailHeader.C.

References addrContinues(), checkDomainAddrs(), SpamParameters::getSection(), and Logger::log().

Referenced by checkHeader().

00335 {
00336   log.log(Logger::DEBUG, "checkAddressSection", "enter");
00337 
00338   MailFilter::classification klass = MailFilter::UNKNOWN;
00339   vector<const char *> toAddrs = mParams.getSection(SpamParameters::to_list);
00340   vector<const char *> myDomain = mParams.getSection(SpamParameters::my_domain);
00341 
00342   const char *domainName = 0;
00343   if (myDomain.size() > 0)
00344     domainName = myDomain[0];
00345   
00346   const char *pBuf = buf;
00347   const size_t toAddrLen = toAddrs.size();
00348 
00349   char localBuf[1024];
00350   size_t i;
00351   bool done;
00352   do {
00353     done = true;
00354 
00355     SpamUtil().toLower(localBuf, pBuf, sizeof(localBuf));
00356     for (i = 0; i < toAddrLen; i++) {
00357       if (strstr(localBuf, toAddrs[i]) != 0) {
00358         const char *hit = toAddrs[i];
00359         char msg[128];
00360         sprintf(msg, "found \"%s\", marked as EMAIL", hit );
00361         log.log(Logger::DEBUG, "checkAddressSection", msg );
00362         klass = MailFilter::EMAIL;
00363         break;
00364       }
00365     } // for
00366 
00367     if (klass == MailFilter::UNKNOWN && domainName != 0) {
00368       klass = checkDomainAddrs( domainName, localBuf );
00369     }
00370 
00371     if (klass == MailFilter::UNKNOWN) {
00372       if (addrContinues(localBuf)) {
00373         if ((pBuf = fgets(localBuf, sizeof(localBuf), fp)) != 0) {
00374           done = false;
00375         }
00376       }
00377     }
00378 
00379   } while (!done);
00380 
00381   log.log(Logger::DEBUG, "checkAddressSection", "exit");
00382 
00383   return klass;
00384 } // checkAddressSection

MailFilter::classification MailHeader::checkDomainAddrs const char *  domainName,
const char *  pBuf
[private]
 

Check the string for a user name associated with domainName.

The domain name is defined in the my_domain section of SpamFilterParams. Valid user names for this domain are defined in the valid_users section.

If valid user names are found then the foundValidAddrAddress flag is set to true. If there are users that are not in the valid_users list then the classification GARBAGE is returned. Otherwise, UNKNOWN is returned (UNKNOWN is returned when a valid user is found as well).

Definition at line 235 of file MailHeader.C.

References SpamParameters::getSection(), Logger::log(), and HeaderInfo::reason().

Referenced by checkAddressSection().

00237 {
00238   log.log(Logger::DEBUG, "checkDomainAddrs", "enter");
00239 
00240   assert( ((domainName != 0) && (pBuf != 0)) );
00241 
00242   vector<const char *> validUsers = mParams.getSection(SpamParameters::valid_users);
00243   const size_t numUsers = validUsers.size();
00244   MailFilter::classification klass = MailFilter::UNKNOWN;
00245   
00246   bool done = false;
00247   size_t domainNameLen = strlen( domainName );
00248   const char *domainPtr = strstr(pBuf, domainName );
00249   while (domainPtr) {
00250     if (domainPtr > pBuf+2) {
00251       domainPtr--;
00252       if (*domainPtr == '@') {
00253         // find the start and end of the user name
00254         const char *endPtr = domainPtr;
00255         domainPtr--;
00256         const char *beginPtr = domainPtr;
00257         while (beginPtr >= pBuf && isalnum( *beginPtr ))
00258           beginPtr--;
00259         if (!isalnum(*beginPtr)) {
00260           beginPtr++;
00261         }
00262 
00263         // Now check to see if the user name is in the valid_users list
00264         // Note that this function is used for both To: and Cc:, so 
00265         // foundValidAddress could have been set in a previous call.
00266         bool foundInList = false;
00267         for (size_t i = 0; i < numUsers; i++) {
00268           const char *word = validUsers[i];
00269           if (SpamUtil().match(beginPtr, endPtr, word)) {
00270             foundValidAddress = true;
00271             foundInList = true;
00272           }
00273         } // for
00274 
00275         if (!foundInList) {
00276           char msg[128];
00277           char user[128];
00278           size_t ix = 0;
00279           for (const char *pCh = beginPtr; pCh < endPtr; pCh++, ix++) {
00280             user[ix] = *pCh;
00281           }
00282           user[ix] = '\0';
00283           sprintf(msg, "Non-valid user \"%s\", email marked as GARBAGE", 
00284                   user );
00285           mHeadInfo.reason( msg );
00286           log.log(Logger::DEBUG, "checkDomainAddrs", msg );
00287           klass = MailFilter::GARBAGE;
00288         }
00289         // endPtr points to the '@'
00290         pBuf = (endPtr + 1);
00291       }
00292     }
00293     if (klass == MailFilter::UNKNOWN) {
00294       pBuf = pBuf + domainNameLen;
00295       domainPtr = strstr(pBuf, domainName);
00296     }
00297     else {
00298       break;  // exit the while loop
00299     }
00300   } // while
00301 
00302   log.log(Logger::DEBUG, "checkDomainAddrs", "exit");
00303 
00304   return klass;
00305 } // checkDomainAddrs

MailFilter::classification MailHeader::checkFrom const char *  buf  )  [private]
 

Check to see if an e-mail address in the from_address section of the SpamParameters is in the "From:" field. If a "from_address" string is found then it is valid email and no further checking will be done by the mail filter. This allows people you know to send you e-mail that may have spam or kill words in it.

The "From" is also checked against "from_kill" strings. This allows you to mark as garbage email from frequent spammers. For example, when I developed this software there was a spammer who used YoDude in the from line and another that used "TailWaggingOffers".

Definition at line 123 of file MailHeader.C.

References SpamParameters::getSection(), Logger::log(), and HeaderInfo::reason().

Referenced by checkHeader().

00124 {
00125   MailFilter::classification klass = MailFilter::UNKNOWN;
00126   log.log(Logger::DEBUG, "checkFrom", "enter");
00127 
00128   vector<const char *> fromAddrs = mParams.getSection(SpamParameters::from_address);
00129   vector<const char *> killAddrs = mParams.getSection(SpamParameters::from_kill);
00130 
00131   char msg[128];
00132   char from[256];
00133 
00134   SpamUtil().toLower(from, buf, sizeof(from));
00135 
00136   size_t len;
00137   len = killAddrs.size();
00138   for (size_t i = 0; i < len; i++) {
00139     if (strstr(from, killAddrs[i]) != 0) {
00140       sprintf(msg, "Found address \"%s\", email marked as GARBAGE", 
00141               killAddrs[i] );
00142       mHeadInfo.reason( msg );
00143       log.log(Logger::DEBUG, "checkFrom", msg );
00144       klass = MailFilter::GARBAGE;
00145       break;
00146     }
00147   }
00148 
00149   if (klass == MailFilter::UNKNOWN) {
00150     len = fromAddrs.size();
00151     for (size_t i = 0; i < len; i++) {
00152       if (strstr(from, fromAddrs[i]) != 0) {
00153         sprintf(msg, "Found \"from address\" \"%s\", email marked as EMAIL", 
00154                 fromAddrs[i] );
00155         log.log(Logger::DEBUG, "checkFrom", msg );
00156         klass = MailFilter::EMAIL;
00157         break;
00158       }
00159     } // for
00160   }
00161 
00162   log.log(Logger::DEBUG, "checkFrom", "exit");
00163   return klass;
00164 } // checkFrom

MailFilter::classification MailHeader::checkHeader FILE *  fp  ) 
 

Rules for processing the e-mail header:

The "To:" and "Cc:"

  • Check for items in the "to_list". This is where mailing list addresses go. If a "to_list" item is found it is classified as "EMAIL".

  • Check for a "my_address" address. In some cases spammers do not include your e-mail address in the "To:" or "Cc:" lines since the are using mailing lists or direct SMTP connections. Of course if your address is found it may still be spam.

The "From:" and "From"

At least in the case of e-mail on Linux there is a "From" line which leads the e-mail file. This line has the following format:

From

Note that this "From" has no colon. A "From:" line follows which may or may not have the same e-mail address. In the case of SPAM it frequently does not, since SPAMmers forge the email address.

  • Check for an entry in the "from_list" in the "From:" part of the header. The "from_list" contains the email addresses (user name, domain, or both user and domain) of people you know. If a "from_list" item is found it is marked as valid email.

Check the subject line for spam and kill words (e.g., penis, xanax).

Processing of the email header ends when a blank line is found (all email headers must end with a blank line).

When it comes to recognizing "spam_words" and "kill_words" in the subject line, the code below relies on the fact that the "From:" line preceeds the "Subject:" line. This allows your lover, whose address will presumably be in the from_address part of the SpamFilterParams, to send you e-mail with the word "penis" in the subject, without having the mail discarded if "penis" is in the kill_words list.

The subject line and other parts of the email header are copied into a HeaderInfo object. This information is used in generating debug trace information and the garbage trace (for discarded email). and error messages.

Many emails (especially those that are MIME formatted) will have a boundary line (which usually follows the "Content-Type:" line. The boundary line has the format

boundary=""

The boundary string is used to demarkate the bounds of the various sections. This string is saved in the class variable boundaryStr.

Header processing may terminate before the header is completely read since it may be determined at an early point that the email is either valid or SPAM. In these case the rest of the mail header will be read to initialize the HeaderInfo object.

Definition at line 804 of file MailHeader.C.

References checkAddressSection(), checkFrom(), checkReceived(), checkSubject(), HeaderInfo::date(), fillInSections(), HeaderInfo::from(), HeaderInfo::fromNoColon(), HeaderInfo::klass(), Logger::log(), parseContentType(), HeaderInfo::subject(), and HeaderInfo::to().

Referenced by MailFilter::checkMail().

00805 {
00806   log.log(Logger::DEBUG, "checkHeader", "enter");
00807   MailFilter::classification klass = MailFilter::BAD_VALUE;
00808   if (!feof(fp)) {
00809     klass = MailFilter::UNKNOWN;
00810     char *pBuf;
00811 
00812     // Skip any blank lines which start the e-mail message
00813     while ((pPushBack = fgets(pushBackBuf, sizeof(pushBackBuf), fp)) != 0) {
00814       if (! SpamUtil().isBlankLine( pPushBack )) {
00815         break;
00816       }
00817     } // while
00818 
00819     if (pPushBack != 0) {  // Loop through the e-mail header
00820       size_t line = 1;
00821       const char *RECEIVED = "received";
00822       static const size_t RECEIVED_LEN = strlen(RECEIVED);
00823       const char *SUBJECT = "subject";
00824       static const size_t SUBJECT_LEN = strlen(SUBJECT);
00825       const char *FROM = "from";
00826       static const size_t FROM_LEN = strlen(FROM);
00827       const char *CONTENT_TYPE = "content-type";
00828       const char *TO = "to";
00829       static const size_t TO_LEN = strlen(TO);
00830       const char *CC = "cc";
00831       const char *DATE = "date";
00832       const char *pBound = 0;
00833       const char *pColon = 0;
00834       char buf[1024];
00835       do { // DO
00836         if (pPushBack == 0) {
00837           pushBackBuf[0] = '\0';
00838           pBuf = fgets(buf, sizeof(buf), fp);
00839         }
00840         else {
00841           pBuf = pPushBack;
00842           pPushBack = 0;
00843         }
00844         if (!pBuf) {
00845           break;
00846         }
00847 
00848         if (! SpamUtil().isBlankLine( pBuf )) {
00849           pColon = SpamUtil().findColon( pBuf );
00850           if (SpamUtil().match(pBuf, FROM_LEN, FROM)) {
00851             pColon = pBuf + FROM_LEN;
00852             if (*pColon == ':') {
00853               pColon++;
00854               mHeadInfo.from( pColon );
00855             }
00856             else {
00857               mHeadInfo.fromNoColon( pColon );
00858             }
00859             klass = checkFrom( pColon );
00860           }
00861           else if (pColon != 0) {
00862             if (SpamUtil().match(pBuf, RECEIVED_LEN, RECEIVED)) {
00863               pColon = pBuf + RECEIVED_LEN + 1;
00864               klass = checkReceived( pColon, fp, line );
00865             } else if (SpamUtil().match(pBuf, SUBJECT_LEN, SUBJECT)) {
00866               pColon = pBuf + SUBJECT_LEN + 1;
00867               mHeadInfo.subject(pColon);
00868               klass = checkSubject( pColon, fp );
00869             }
00870             else if (SpamUtil().match(pBuf, pColon, CONTENT_TYPE)) {
00871               pColon++;
00872               klass = parseContentType(fp, pBuf);
00873             }
00874             else if (SpamUtil().match(pBuf, TO_LEN, TO)) {
00875               pColon = pBuf + TO_LEN + 1;
00876               mHeadInfo.to( pColon );
00877               klass = checkAddressSection(pColon, fp);
00878             }
00879             else if (SpamUtil().match(pBuf, pColon, CC)) {
00880               pColon++;
00881               klass = checkAddressSection(pColon, fp);
00882             }
00883             else if (SpamUtil().match(pBuf, pColon, DATE)) {
00884               pColon++;
00885               mHeadInfo.date(pColon);
00886             }
00887           } // has a colon (pColon != 0
00888         } // is blank line
00889         else {
00890           // found a blank line
00891           break;
00892         }
00893       } while (klass == MailFilter::UNKNOWN && pBuf != 0);
00894 
00895       // if we have not finished on a black line, fill in any sections that
00896       // have not been encountered yet.
00897       if (! SpamUtil().isBlankLine( pBuf )) {
00898         fillInSections(fp);
00899       }
00900 
00901       // If the email was not addressed to a known mailing list and
00902       // and address in the SpamFilterParams section my_address is
00903       // not found, then it is classified as SPAM.
00904       if (klass == MailFilter::UNKNOWN && (! foundValidAddress)) {
00905         log.log(Logger::DEBUG, "checkHeader", "Did not find a valid To: or Cc: address");
00906         klass = MailFilter::SUSPECT;
00907       }
00908     } // if pBuf
00909   }
00910 
00911   mHeadInfo.klass( klass );
00912 
00913   char msg[128];
00914   sprintf(msg, "return value = %s", SpamUtil().classificationToStr( klass ));
00915   log.log(Logger::DEBUG, "checkHeader", msg );
00916   log.log(Logger::DEBUG, "checkHeader", "exit");
00917   return klass;
00918 } // checkHeader

MailFilter::classification MailHeader::checkReceived const char *  buf,
FILE *  fp,
size_t  line
[private]
 

The received line in the email header spans multiple lines. The end is determined by the next line that contains a colon. Something like "Message-ID:" or "From:" (or, perhaps, another "Received:").

Right now not much is done with this section except to search for the word "forged". If a section is added to the SpamFilterParams file for spammer addresses, then this function could recognize these. Right now it is not clear that this would be very profitable, since spammers move around so much. For use in future checking, the buf pointer points to the character that follows the colon.

One complexity introduced by this function is that it reads the next line to see if this line has a colon header in it. There is no way to "unget" a line. So as a hack around this there is a "pushBackBuf" in the class (can you say global variable by another name) which contains this line. If the pPushBack pointer at this buffer (e.g., is not NULL) then the pushBackBuf line will be used rather than reading a new line from stdin.

Some mailers add a "may be forged" note on one of the received lines. This seems to happen when the address is given via SMTP, rather than in the "To:" line. In this case, the mail should be marked as SPAM (e.g., SUSPECT)

The "line" argument is for debugging.

Definition at line 416 of file MailHeader.C.

References Logger::log().

Referenced by checkHeader().

00419 {
00420   log.log(Logger::DEBUG, "checkReceived", "enter");
00421   MailFilter::classification klass = MailFilter::UNKNOWN;
00422 
00423   const char *pColon;
00424   do {
00425     pPushBack = fgets(pushBackBuf, sizeof(pushBackBuf), fp);
00426     if (pPushBack != 0) {
00427       if ((klass == MailFilter::UNKNOWN) && (strstr(pPushBack, "forged") != 0)) {
00428         klass = MailFilter::SUSPECT;
00429       }
00430       pColon = SpamUtil().findColon( pPushBack );
00431     }
00432   } while (pPushBack != 0 && !pColon);
00433 
00434   log.log(Logger::DEBUG, "checkReceived", "exit");
00435   return klass;
00436 } // checkReceived

MailFilter::classification MailHeader::checkSubject const char *  buf,
FILE *  fp
[private]
 

Check the email header subject line for spam or kill words or phrases.

Apparently some emails may have multi-line subjects. So after the subject line is found, we check to see if there is another line with a colon (something like "Reply-To:" for example) or if the line is blank (indicating an end to the header). In both cases we "push back" the line. If the line does not have a colon or is not blank, we skip it (since this is the subject line continuing on the next line).

The subject line should not continue on more than one line after the "Subject:" line or something is really wrong with the email format.

Definition at line 68 of file MailHeader.C.

References Logger::log(), and HeaderInfo::reason().

Referenced by checkHeader().

00069 {
00070   log.log(Logger::DEBUG, "checkSubject", "enter");
00071 
00072   char msg[256];
00073   char subject[256];
00074   char foundStr[128];
00075 
00076   // convert to lower case
00077   SpamUtil().toLower(subject, buf, sizeof(subject));
00078 
00079   foundStr[0] = '\0';
00080   MailFilter::classification klass = SpamUtil().checkLine(subject, 
00081                                                           mParams, 
00082                                                           foundStr, 
00083                                                           sizeof(foundStr));
00084   
00085   if (klass == MailFilter::SUSPECT || klass == MailFilter::GARBAGE) {
00086     if (klass == MailFilter::SUSPECT) {
00087       sprintf(msg, "Found \"spam\" word \"%s\", email marked as SUSPECT", 
00088               foundStr );
00089     }
00090     else if (klass == MailFilter::GARBAGE) {
00091       mHeadInfo.reason( foundStr );
00092       sprintf(msg, "Found \"kill\" word \"%s\", email marked as GARBAGE", 
00093               foundStr );
00094     }
00095     log.log(Logger::DEBUG, "checkSubject", msg );
00096   }
00097 
00098   pPushBack = 0;
00099   char *pBuf;
00100   if ((pBuf = fgets(pushBackBuf, sizeof(pushBackBuf), fp)) != 0) {
00101     if (SpamUtil().isBlankLine(pBuf) || SpamUtil().findColon(pBuf) != 0) {
00102       pPushBack = pBuf;
00103     }
00104   }
00105   
00106   log.log(Logger::DEBUG, "checkSubject", "exit");
00107   return klass;
00108 } // checkSubject

void MailHeader::fillInSections FILE *  fp  )  [private]
 

Fill in the email header in the mHeadInfo class variable.

The mHeadInfo object is used to encapsulate header information that is used in generating debug log messages and in generating the garbage trace (if the trace_garbage flag is set).

Processing the email stops as soon as it can be determined that the email is valid, suspect or garbage. In some cases (for example an invalid domain address) the complete header will not have been processed and some fields in mHeadInfo have not been filled in. This function is called to read the rest of the header and fill in these fields.

Definition at line 659 of file MailHeader.C.

References HeaderInfo::date(), HeaderInfo::from(), Logger::log(), HeaderInfo::subject(), and HeaderInfo::to().

Referenced by checkHeader().

00660 {
00661   static const char *TO = "to:";
00662   static const size_t TO_LEN = strlen( TO );
00663   static const char *FROM = "from:";
00664   static const size_t FROM_LEN = strlen( FROM );
00665   static const char *SUBJECT = "subject:";
00666   static const size_t SUBJECT_LEN = strlen( SUBJECT );
00667   static const char *DATE = "date:";
00668   static const size_t DATE_LEN = strlen( DATE );
00669   char buf[1024];
00670   char *pBuf = 0;
00671   size_t bufSize = 0;
00672 
00673   log.log(Logger::DEBUG, "fillInSections", "enter");
00674 
00675   do {
00676     if (pPushBack == 0) {
00677       pushBackBuf[0] = '0';
00678       bufSize = sizeof(buf);
00679       pBuf = fgets(buf, bufSize, fp);
00680     }
00681     else {
00682       pBuf = pPushBack;
00683       bufSize = sizeof( pushBackBuf );
00684       pPushBack = 0;
00685     }
00686     if (pBuf != 0) {
00687       if (! SpamUtil().isBlankLine( pBuf )) {
00688         char *pCopy = 0;
00689         if (SpamUtil().match(pBuf, TO_LEN, TO)) {
00690           pCopy = pBuf + TO_LEN;
00691           mHeadInfo.to( pCopy );
00692         }
00693         else if (SpamUtil().match(pBuf, FROM_LEN, FROM)) {
00694           pCopy = pBuf + FROM_LEN;
00695           mHeadInfo.from( pCopy );
00696         }
00697         else if (SpamUtil().match(pBuf, SUBJECT_LEN, SUBJECT)) {
00698           pCopy = pBuf + SUBJECT_LEN;
00699           mHeadInfo.subject( pCopy );
00700         }
00701         else if (SpamUtil().match(pBuf, DATE_LEN, DATE)) {
00702           pCopy = pBuf + DATE_LEN;
00703           mHeadInfo.date( pCopy );
00704         }
00705       }
00706       else {
00707         // found a blank line which follows the mail header
00708         break;
00709       }
00710     }
00711   } while (pBuf != 0);
00712 
00713   if (pBuf == 0) {
00714     log.log(Logger::DEBUG, "fillInSections", "end-of-file reached");
00715   }
00716   log.log(Logger::DEBUG, "fillInSections", "exit");
00717 } // fillInSections

MailFilter::classification MailHeader::parseContentType FILE *  fp,
char *  contentBuf
 

Return the content type for the email.

Many emails (especially those which are MIME encoded, but others as well) include a Content-type section whose format is:

"Content-type:" type
Where examples of "Content-Type:" include "text/html", "text/plain", "multipart/alternative", "multipart/mixed", "image/jpg"

This spam filter marks all email that _starts_ with an HTML section, instead of a text section, as SUSPECT, which will result in placing the email in the junk_mail file.

This spam filter also attempts to identify email with base64 encoded sections. If the "kill_base64" flag is set, email with base64 encoded sections will be discarded. This tends to weed out viruses and spam that attempts to hide behind the base64 encoding. If the "kill_base64" flag is not set, the email with base64 encoded data will be placed in the junk_mail file.

The Content-Type section may be followed by a charset or boundary definition. The boundary definition is discussed below.

The charset definition may be on the same line as the Content-Type definition (separated by a semicolon) or it may be on the following line. The case that causes difficulty is the one where it is on the following line:

Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: base64

In this case the charset line is skipped, setting up processing for the Content-Transfer-Encoding line, which in this case is base64. This avoids having an email marked as HTML when we really want to classify it as base64 encoded.

Multipart email is separated by boundary strings. This allows email programs (and this spam filter) to skip to a section by looking for the boundary string.

A boundary string definition may follow the Content-Type. This may be either on the same line, separated by a semicolon:

Content-Type: multipart/mixed;boundary="--SpammersAreScum--"

or on the following line:

Content-Type: multipart/mixed; boundary="--SpammersAreScum--"
The boundary definition is saved in a class variable.

Note that the boundary definition is not saved if the Content-Type is text. This is because the following sometimes appears:

Content-Type: text/plain; boundary="--09064944530622531466"

Here, even though a boundary section is defined, it is unused because the email Content-Type is text.

Definition at line 561 of file MailHeader.C.

References SpamParameters::hasFlag(), Logger::log(), HeaderInfo::reason(), and saveBoundary().

Referenced by checkHeader().

00563 {
00564 
00565 
00566   const char *BOUNDARY = "boundary";
00567   static size_t BOUNDARY_LEN = strlen( BOUNDARY );
00568   const char *CONTENT_ENCODE = "Content-Transfer-Encoding";
00569   static size_t CONTENT_ENCODE_LEN = strlen(CONTENT_ENCODE);
00570 
00571   log.log(Logger::DEBUG, "parseContentType", "enter");
00572 
00573   MailFilter::classification klass = MailFilter::UNKNOWN;
00574 
00575   SpamUtil::contentType type = SpamUtil().classifySection( contentBuf );
00576 
00577   {
00578     char *pBound = strstr(contentBuf, BOUNDARY);
00579     if (pBound && type != SpamUtil::TEXT) {
00580       saveBoundary( pBound + BOUNDARY_LEN );
00581     }
00582   }
00583   
00584   pPushBack = 0;
00585   char *pBuf;
00586   if ((pBuf = fgets(pushBackBuf, sizeof(pushBackBuf), fp)) != 0) {
00587     // If the line that follows the Content-Type does not have a colon
00588     // (e.g., findColon() does not return a pointer) and it is not a 
00589     // blank line, then it may be a boundary definition or a charset
00590     // definition.  If it is a boundary we want to pick it up.  Otherwise
00591     // we want to skip it.
00592     if (SpamUtil().isBlankLine(pBuf) || SpamUtil().findColon(pBuf) != 0) {
00593       pPushBack = pBuf;
00594     }
00595     else {
00596       char *pBound = strstr(pBuf, BOUNDARY);
00597       if (pBound && type != SpamUtil::TEXT) {
00598         saveBoundary( pBound + BOUNDARY_LEN );
00599       }
00600       // get the next line
00601       pBuf = fgets(pushBackBuf, sizeof(pushBackBuf), fp);
00602       pPushBack = pBuf;
00603     }
00604 
00605     // Check for the Content-Transfer-Encoding line and see if it is base64
00606     char *pEncode;
00607     if ((pEncode = strstr(pBuf, CONTENT_ENCODE)) != 0) {
00608       pPushBack = 0;
00609       if (strstr(pEncode + CONTENT_ENCODE_LEN, "base64")) {
00610         type = SpamUtil::BASE64;
00611       }
00612     }
00613 
00614   }
00615 
00616   if (type == SpamUtil::BASE64) {
00617     if (mParams.hasFlag("kill_base64")) {
00618       mHeadInfo.reason("found base64 encoded information");
00619       klass = MailFilter::GARBAGE;
00620     }
00621     else {
00622       klass = MailFilter::SUSPECT;
00623     }
00624   }
00625   else if (type == SpamUtil::HTML) {
00626       klass = MailFilter::SUSPECT;
00627   }
00628   else if (type == SpamUtil::IMAGE || type == SpamUtil::AUDIO) {
00629       klass = MailFilter::SUSPECT;
00630   }
00631 
00632   const char *typeName;
00633   typeName = SpamUtil().typeToStr( type );
00634 
00635   char msg[128];
00636   sprintf(msg, "mail type = %s", typeName );
00637   log.log(Logger::DEBUG, "parseContentType", msg );
00638 
00639   log.log(Logger::DEBUG, "parseContentType", "exit");
00640   return klass;
00641 } // parseContentType

void MailHeader::saveBoundary const char *  pBound  )  [private]
 

Save the boundary string which may follow the "Content-Type:" line. The format for the boundary definition is

boundary=""

The pBound argument should point to the '=' character.

There are times when the boundary string does not start with a quote. There are also times when there is white space between the "=" and the quote.

The boundary string is used in processing the mail body to move between the various sections of an email.

Definition at line 456 of file MailHeader.C.

References Logger::log().

Referenced by parseContentType().

00457 {
00458   log.log(Logger::DEBUG, "saveBoundary", "enter");
00459 
00460   if (*pBound == '=') {
00461     pBound++;
00462     // skip any white space between the "=" and the quote
00463     pBound = SpamUtil().skipWhiteSpace( pBound );
00464     if (*pBound == '"') {
00465       pBound++;
00466     }
00467     const size_t len = sizeof(boundaryStr) - 1;
00468     size_t ix = 0;
00469     while (*pBound && 
00470            ix < len && 
00471            (! isspace(*pBound)) &&
00472            *pBound != '"') {
00473       boundaryStr[ix] = *pBound;
00474       pBound++;
00475       ix++;
00476     }
00477     boundaryStr[ix] = '\0';
00478 
00479     if (ix > 0) {
00480       char msg[128];
00481       sprintf(msg, "boundary str. = \"%s\"", boundaryStr );
00482       log.log(Logger::DEBUG, "saveBoundary", msg);
00483     }
00484   }
00485   else {
00486     log.log(Logger::ERROR, "saveBoundary", "'=' expected");
00487   }
00488 
00489   log.log(Logger::DEBUG, "saveBoundary", "exit");
00490 } // saveBoundary


The documentation for this class was generated from the following files:
Generated on Sat Mar 27 13:07:38 2004 for Mail Filter by doxygen 1.3.3