I thought I’d give some insight into how we fixed this emailing issue and what had been causing it.
At WBB, we do not directly send mail when the forum requires an email to be sent. Instead this email gets serialized and saved into a “buffer” database. A script then runs in the background, going through the “buffer” database to send out the emails.
Because you now know the basis of how our email system works, let’s explain the reason why our members were receiving duplicate emails. Our email script is simple, it reads the latest 20 emails from the buffer database, and sends them out through the mail() function. However, this design is prone to one flaw, that is if two instances of the email script are run at the same time, then one script may read the same latest emails the other script has just read in. This will cause one email to be mailed twice.
Now so far, our Admins have not designed this script to use a locking mechanism. A locking mechanism will basically force one instance to run at once. Only when the running instance has finished executing will it release the lock and allow another instance to be run.
This may sound like a good idea, but it becomes a nightmare to handle when the script begins to fail before the lock is released. Because of this, our Admins have just relied on a simple balance between average time execution, the number of emails batched up per instance and the time the next instance is run to prevent this from happening. This has one major flaw, this balance is fragile and if anything unforeseen unbalances it, then we get the duplication issue.
So, what caused our latest problem? Through probing around, there were two issues.
One, our /var/spool/clientmqueue folder contained around 900MB of files. Considering these files are small in size (i.e. 2KB each), we’re looking at about 460,800 files. This may not be the files limit per folder ext3 has, but it surely is the limit for sendmail. FYI, sendmail is the program that we use to send email.
[root@srv6 clientmqueue]# du –max-depth=1 -h
This problem leads to the question, why did the folder become so big? The reason why is because /var/spool/clientmqueue is a folder designed for sendmail to queue files before they’re sent. An email sent using sendmail will first be copied to /var/spool/clientmqueue and whatever requested the email to be sent will assume that it was sent successfully. And in many cases, it does get sent. The second step in the process is a sendmail daemon is run to listen for new files in that folder and copy those files to /var/spool/mqueue to be actually sent out by the server. In essence, these folders act as temporary buffers to improve performance similar to our “buffer” database.
To prune the /var/spool/clientmqueue folder, we cannot use the conventional “rm” command. Instead, we use “find . -type f -delete” which we find to be the fastest out of all the solutions for deleting a large number of files.
[root@srv6 clientmqueue]# find . -type f -delete
Now, there comes our second problem and our second question. We in fact do not use sendmail to send emails. sendmail is supposed to be replaced with something superior called Exim. And we trick the OS by deleting the sendmail binary and replacing it with a symbolic link (a shortcut) to Exim. Naturally, the Exim installation should have automatically removed sendmail and replaced it with a symbolic link. Turns out, this was not the case. Sadly, this meant that all this time we were never actually using Exim but using sendmail.
So after all of this reading, the solution of this email problem was to create a shortcut from sendmail to Exim… Anyone want to buy me a beer?