Server silently fails on messages with a huge To: header; any ideas?

Our incoming relay (sendmail) occasionally receives messages which were sent to many recipients
(sometimes it's spam, sometimes valid maillists to which our users have subscribed). The messages
in question have a To: header which is typically over 6kb in size and over 80 lines long (and since
several recipients with short names/addresses may be grouped on one line, there's about a hundred
recipients listed).
It fails trying to relay these messages to our backend Sun Messaging Server (6.3-6.0.3 x64), and it
fails silently. I am not definitely sure that this is SMS's flaw and not Sendmails; but perhaps someone
can shed light on the matter? :)
SMS's mail.log_current receives such entries (here xxx.xxx.xxx.100 is the relay, xxx.xxx.xxx.73
is the backend server):
04-Dec-2008 16:54:44.62 tcp_local    +            O TCP|xxx.xxx.xxx.73|25|xxx.xxx.xxx.100|33728 SMTP
04-Dec-2008 16:59:44.62 tcp_intranet ims-ms       VE 0 [email protected] rfc822;[email protected] ouruser@ims-ms-daemon relay.domain.ru ([xxx.xxx.xxx.100]) '' Timeout after 5 minutes trying to read SMTP packet
04-Dec-2008 16:59:44.62 tcp_local    +            C TCP|xxx.xxx.xxx.73|25|xxx.xxx.xxx.100|33728 SMTP Timeout
after 5 minutes trying to read SMTP packetSendmail logs a broken connection:
Dec  4 17:01:27 relay sendmail[14689]: [ID 801593 mail.crit] mB47gCN4014672: SYSERR(root): timeout writing message to sunmail.domain.ru.: Broken pipe
Dec  4 17:01:27 relay sendmail[14689]: [ID 801593 mail.info] mB47gCN4014672: to=<[email protected]>, delay=00:07:01, xdelay=00:06:58, mailer=esmtp, pri=329059, relay=sunmail.domain.ru. [xxx.xxx.xxx.73], dsn=4.0.0, stat=DeferredSniffing the wire gives strange results: The SMTP dialog part seems okay, the message is submitted
(relayed) only for our local user's address. But the message is not transferred until sendmail dies.
When the sendmail process dies (due to timeout or by a manual kill), about 3 packets appear in the
sniffer's output, starting with the usual "Received: from" lines and other header parts. The last packet
has text from the middle of the To: header, often breaking mid-word. Perhaps it's some buffering error
in either the sending Sendmail or the receiving Sunmail, or some server TCP-networking/sniffer glitch.
If I manually edit the queue file (/var/spool/mqueue/qfmB47gCN4014672 for the sample above) and delete
most of the To: header's lines, the message goes through okay.
This just does not seem logical - the message header text seems to be compliant (that is, each single
line is short, although all sub-lines of To: concatenate to a rather large text; but not that extremely large).
Neither sendmail nor sun mail report any error except networking socket failure.
MTUs are the same on both servers (1500), and any other large message (i.e. with attachments),
relays okay.
Are there any known issues on Sun Messaging Server (or Sendmail for that matter) which look like
this and ring a bell to a casual reader? :) Perhaps Sieve filters, etc.?
Since sendmail does successfully receive this message from the internet, and none of our several
incoming milters break along the way, I don't think it should have a huge problem forwarding it to
another server (I'll try experimenting though). This is why I think it's possible that Sun mail may be
at fault.
# imsimta version
Sun Java(tm) System Messaging Server 6.3-6.03 (built Mar 14 2008; 64bit)
libimta.so 6.3-6.03 (built 17:15:08, Mar 14 2008; 64bit)
SunOS sunmail 5.10 Generic_127112-07 i86pc i386 i86pc

Hello all, thanks for your suggestions.
In short, I debugged with Shane's suggestions. Apparently, tcp_smtp_server didn't get
a byte for 5 minutes so the read() was locked. At least, there's no specific failing routine
in Sunmail, so I'm back to research about Sendmail and networking, buffering and so on.
As I mentioned, when relay's sendmail process is killed, the system spits out about 3
packets of header data to the network...
Details follow...
By "silently failing" i meant that no obvious SMTP error is issued. The connection hangs
until it's aborted and both servers only complain on that - a failed network connection.
The resulting problem is that the sendmail relay marks sunmail as "Deferring connections"
in its hoststatus table, and valid messages are not even attempted for submission. At the
moment we fixed that brutally but effectively - by removing the hoststatus file for our sunmail
via cron every minute.
Concerning Mark's post, these servers are in the same DMZ, on a Cisco 2960G switch
which caused no specific problems. I mentioned MTU's are the same and standard,
because a few weeks back we did have LDAP replication problems due to experiments
with Jumbo frames, but solved them internally (I posted on this in the DSEE forum, also
asking how to compare LDAPs: [http://forums.sun.com/thread.jspa?threadID=5349017]).
We use this tandem of relay-backend servers for half a year now (and before we deployed
Sun Messaging Server, this sendmail relayed mails to our old server for many years).
So far this (large To:) is the only type of messages I see that cause such behavior; for
any other large mails the size does not matter, or at least some rejection explanation
is generated by one of the SMTP engines.
Shane, thanks for your help over and over ;)
I tried enabling the options you mentioned, ran "imsimta cnbuild" and reloaded the services.
Then I fired up the sniffer on the relay server, "tail -f mail.log_current" on the sunmail, and
submitted a "bad message" from the Sendmail queue.
In the sniffer the SMTP dialog went ok until submission of message data, where it hung as
before:
# ngrep "" tcp port  25 and host sunmail
T xxx.xxx.xxx.73:25 -> xxx.xxx.xxx.100:53200 [AP]
  220 sunmail.domain.ru -- Server ESMTP (Sun Java(tm) System Messaging Server 6.
  3-6.03 (built Mar 14 2008; 64bit))..                                      
T xxx.xxx.xxx.100:53200 -> xxx.xxx.xxx.73:25 [AP]
  EHLO relay.domain.ru..                                                         
T xxx.xxx.xxx.73:25 -> xxx.xxx.xxx.100:53200 [AP]
  250-sunmail.domain.ru..250-8BITMIME..250-PIPELINING..250-CHUNKING..250-DSN..25
  0-ENHANCEDSTATUSCODES..250-EXPN..250-HELP..250-XADR..250-XSTA..250-XCIR..25
  0-XGEN..250-XLOOP 4A70E733A15FFE33EF3564BD522B1348..250-STARTTLS..250-ETRN.
  .250-NO-SOLICITING..250 SIZE 20992000..                                   
T xxx.xxx.xxx.100:53200 -> xxx.xxx.xxx.73:25 [AP]
  MAIL From:<[email protected]> SIZE=200312..                                    
T xxx.xxx.xxx.73:25 -> xxx.xxx.xxx.100:53200 [AP]
  250 2.5.0 Address and options OK...                                       
T xxx.xxx.xxx.100:53200 -> xxx.xxx.xxx.73:25 [AP]
  RCPT To:<[email protected]> NOTIFY=SUCCESS,FAILURE,DELAY..DATA..                
T xxx.xxx.xxx.73:25 -> xxx.xxx.xxx.100:53200 [AP]
  250 2.1.5 [email protected] and options OK...                                   
T xxx.xxx.xxx.73:25 -> xxx.xxx.xxx.100:53200 [AP]
  354 Enter mail, end with a single "."...                                  
#In the mail.log_current just one line appeared:
05-Dec-2008 10:51:18.46 tcp_local    +            O TCP|xxx.xxx.xxx.73|25|xxx.xxx.xxx.100|53200 SMTPSince it also mentions tcp_local channel, I decided to enable slave_debug on that as well.
Rebuilt the configs, and ran msg-stop to see if the processes actually die. When I checked
the "netstat -an | grep -w 25" and "ps -ef" outputs, there was indeed a tcp_smtp_server
process running:
mailsrv 23594   656   0 10:50:08 ?           0:00 /opt/SUNWmsgsr/messaging64/lib/tcp_smtp_serverBoth the sunmail and sendmail relay kept the socket ESTABLISHED. I took a pstack
of the tcp_smtp_server (below) and killed it with SIGSEGV so I have a core dump if
needed. Then I started the services and submitted the message from the queue again.
The SMTP dialog log was actually from tcp_local, and it ended with the lines like these
(note that even in this detailed log it just died with "network read failed" after 5 minutes,
I inserted an empty line to make it more visible):
11:21:18.26: Good address count 1 defer count 0
11:21:18.26: Copy estimate after address addition is 2
11:21:18.26: mmc_rrply: Return detailed status information.
11:21:18.26: mmc_rrply: Returning
11:21:18.26: Sending    : "250 2.1.5 [email protected] and options OK."
11:21:18.26: Received   : "DATA"
11:21:18.26: mmc_waend(0x00749cc0) called.
11:21:18.26:   Copy estimate is 2
11:21:18.26:   Queue area size 35152252, temp area size 2785988
11:21:18.26:   8788063 blocks of effective free queue space available; setting disk limit accordingly.
11:21:18.26:   1392994 blocks of free temporary space available; setting disk limit accordingly.
11:21:18.26: Sending    : "354 Enter mail, end with a single "."."
11:26:18.27: os_smtp_read: [9] network read failed with error 145
11:26:18.27:     Error: Connection timed out
11:26:18.27:   Generating V records for all addresses on channel ims-ms                          .
11:26:18.27: mmc_flatten_address: Flattening address tree into a list.
11:26:18.27:   Tree prior to flattening:
11:26:18.27: Level/Node/Left/Right Address
11:26:18.27: 0/0x0072ea30/0x00000000/0x00866050
11:26:18.27: 1/0x00866050/0x00751ef8/0x00751ef8 ouruser@ims-ms-daemon
11:26:18.27: Zero address: 0x00751ef8
11:26:18.27: smtpc_enqueue returning a status of 137 (Timeout)
11:26:18.27: SMTP routine failure from SMTPC_ENQUEUE
11:26:18.27: pmt_close: [9] status 0Apparently, tcp_smtp_server didn't get a byte for 5 minutes so a read() call was locked
and perhaps this is what didn't allow stop-msg to kill this process...
At least, there's no specific failing routine in Sunmail, so I'm back to research about
Sendmail and networking, buffering and so on. As I mentioned, when relay's sendmail
process is killed, the system spits out about 3 packets of header data to the network...
The pstack output for a waiting tcp_smtp_server process follows, for completeness sake:
23594:  /opt/SUNWmsgsr/messaging64/lib/tcp_smtp_server
-----------------  lwp# 1 / thread# 1  --------------------
fffffd7ffd830007 lwp_park (0, 0, 0)
fffffd7ffd829c14 cond_wait_queue () + 44
fffffd7ffd82a1a9 _cond_wait () + 59
fffffd7ffd82a1d6 cond_wait () + 26
fffffd7ffd82a219 pthread_cond_wait () + 9
fffffd7ffededf3e dispatcher_initialize () + 66e
0000000000404078 main () + 768
00000000004036fc ???????? ()
-----------------  lwp# 2 / thread# 2  --------------------
fffffd7ffd830007 lwp_park (0, fffffd7ffc5fdda0, 0)
fffffd7ffd829c14 cond_wait_queue () + 44
fffffd7ffd82a012 cond_wait_common () + 1c2
fffffd7ffd82a286 _cond_timedwait () + 56
fffffd7ffd82a310 cond_timedwait () + 30
fffffd7ffd82a359 pthread_cond_timedwait () + 9
fffffd7ffd520ff4 PR_WaitCondVar () + 264
fffffd7ffd529854 PR_Sleep () + 74
fffffd7ffd62d5d8 LockPoller () + 88
fffffd7ffd5289e7 _pt_root () + f7
fffffd7ffd82fd5b _thr_setup () + 5b
fffffd7ffd82ff90 _lwp_start ()
-----------------  lwp# 3 / thread# 3  --------------------
fffffd7ffd830007 lwp_park (0, fffffd7ffc3fdda0, 0)
fffffd7ffd829c14 cond_wait_queue () + 44
fffffd7ffd82a012 cond_wait_common () + 1c2
fffffd7ffd82a286 _cond_timedwait () + 56
fffffd7ffd82a310 cond_timedwait () + 30
fffffd7ffd82a359 pthread_cond_timedwait () + 9
fffffd7ffd520ff4 PR_WaitCondVar () + 264
fffffd7ffd529854 PR_Sleep () + 74
fffffd7ffd62d5d8 LockPoller () + 88
fffffd7ffd5289e7 _pt_root () + f7
fffffd7ffd82fd5b _thr_setup () + 5b
fffffd7ffd82ff90 _lwp_start ()
-----------------  lwp# 4 / thread# 4  --------------------
fffffd7ffd830007 lwp_park (0, 0, 0)
fffffd7ffd829c14 cond_wait_queue () + 44
fffffd7ffd82a1a9 _cond_wait () + 59
fffffd7ffd82a1d6 cond_wait () + 26
fffffd7ffd82a219 pthread_cond_wait () + 9
fffffd7ffedf5fe8 pmt_refresh_stats () + d8
fffffd7ffd82fd5b _thr_setup () + 5b
fffffd7ffd82ff90 _lwp_start ()
-----------------  lwp# 5 / thread# 5  --------------------
fffffd7ffedecf10 dispatcher_read(), exit value = 0x0000000000000000
        ** zombie (exited, not detached, not yet joined) **
-----------------  lwp# 6 / thread# 6  --------------------
fffffd7ffd830007 lwp_park (0, fffffd7ffc1fded0, 0)
fffffd7ffd829c14 cond_wait_queue () + 44
fffffd7ffd82a012 cond_wait_common () + 1c2
fffffd7ffd82a286 _cond_timedwait () + 56
fffffd7ffd82a310 cond_timedwait () + 30
fffffd7ffd82a359 pthread_cond_timedwait () + 9
fffffd7ffeded829 dispatcher_housekeeping () + 1e9
fffffd7ffd82fd5b _thr_setup () + 5b
fffffd7ffd82ff90 _lwp_start ()
-----------------  lwp# 14 / thread# 14  --------------------
fffffd7ffd83319a lwp_wait (d, fffffd7ffbdfdf24)
fffffd7ffd82c9de _thrp_join () + 3e
fffffd7ffd82cbbc pthread_join () + 1c
fffffd7ffedece66 dispatcher_joiner () + 36
fffffd7ffd82fd5b _thr_setup () + 5b
fffffd7ffd82ff90 _lwp_start ()
-----------------  lwp# 13 / thread# 13  --------------------
fffffd7ffd832caa pollsys  (fffffd7ffc1b9860, 1, fffffd7ffc1b97a0, 0)
fffffd7ffd7d9dc2 poll () + 52
fffffd7ffee6d7e8 pmt_recvfrom () + 868
0000000000405a3f os_smtp_read () + 1ff
0000000000404e3d smtp_get () + 9d
fffffd7ffec0fda7 big_smtp_read () + 797
fffffd7ffec36798 data () + a28
fffffd7ffec460ad smtpc_enqueue () + f9d
0000000000405343 tcp_smtp_slave () + 223
00000000004038a4 tcp_smtp_slave_pre () + 54
fffffd7ffedeccbc dispatcher_newtcp () + 46c
fffffd7ffd82fd5b _thr_setup () + 5b
fffffd7ffd82ff90 _lwp_start ()

Similar Messages

Maybe you are looking for

  • Pop up screen needs to be added to standard tcode

    Hi all , "How to add a pop - up screen to standard Tcode. " I want to add a pop up - screen to the tcode : CO15. In this Tcode i will enter the Production order and press enter. In the next screen I will give the quantity (for eg. 5) for that product

  • (261705413) Q RPCC-21 Missing text/xml MIME-type for WSDL files?

    Q<RPCC-21> Would it mess up the SOAP operations that clients do when using the WSDL in a dynamic client to put a text/xml mime type in the WSDL-providing jsp? A<RPCC-21>: It is a known issue that the MIME-type for WSDL files is not appropriately tran

  • Why can't i place order to qvc from ipad ?

    when i try to place an order to qvc i get the message "unable to complete transaction online" I have called qvc and they reset my password but still can't order

  • RTMT-Alerts

    Hello! The RTMT notified me of three alers. These are: 1., Processor load over 90 Percent. Tar (74 percent) uses most of the CPU. 2., Number of MediaListExhausted events exceed 0 within 60 minutes 3., Number of RouteListExhausted events exceed 0 with

  • Lenovo k900 install Hebrew language

    I understand that language is only English or Chinese Is there a way to insert language Hebrew I did Root and ROM And all the other vegetables Only the language problem  Moderator Note; subject edited