The Exchange 2007 Wiki

Troubleshooting Exchange 2007 Transport Queuing Problems

When messages don’t “flow” through the Exchange 2007 transport server, the Exchange administrator has a variety of tools at his disposition to diagnose the problem: queue viewer, protocol logging, connectivity logging, message tracking, etc. This article explains how to use the queue viewer to diagnose mail flow interruptions.

Prerequisites
Here are a few notions that should be understood before reading this article

-          The queue viewer is referred here as a set of monad tasks. These tasks are also available through the Exchange 2007 management UI, but the UI will not be our focus here.
-          A solution is a subset of the recipients of a message which are all routed to a single destination (or next hop). Solutions are built during routing and the recipients are partitioned into solutions based on their target domain (the address part after the “@” sign), the routing configuration (connectors, AD sites) and other restrictions.
-          Destinations (next hops) are basically of two kinds: Exchange 2007 mailbox servers and any SMTP servers. Consequently, delivery queues are of two kinds: “local” – to a mailbox server (note: “local” doesn’t have to mean the same machine) and “remote” – to an SMTP server. A distinction between the two categories will be made in this article, when relevant. If not specified, it should be assumed that local and remote delivery queues share a common behavior.
-          In the following examples, all strings containing names delimited by “<” and “>” must be replaced with actual values (e.g.: <QueueIdentity>, <server name>)


Queue viewer tasks
The queue viewer is a simple set of tasks which can be used to manipulate queues and messages on a server. Knowledge of the tasks, their parameters and the objects they return or manipulate is assumed.
 

The get-queue task returns information about existing transport queues and the get-message task provides info about the messages queued by the server. These objects have properties that can help us identify them (like Identity and Subject for messages and Identity and NextHopDomain for queues) and other properties which can help diagnose the state of the object (like Status, LastError and RetryCount).


Diagnosing queuing problems
Most mail flow interruption scenarios that can usually be classified in one the following categories:

1)    Cannot connect to destination machine to deliver mail
2)    Messages stuck in a delivery queue, but destination is working properly
3)    Messages rejected by destination
4)    Cannot route one or more recipients of a message
5)    Messages stuck in the submission queue
6)    Messages show in the poison queue


When faced with these situations, the administrator has to answer the following two questions
        What caused this situation?
        How can I fix it so that mail flow is restored again?


Following, we will go over the common scenarios and we’ll attempt to answer those questions for each scenario.


1.   Cannot connect to destination machine to deliver mail
This is the most common problem. Some of the causes may be:

a)       Destination machine is down or there are network connectivity problems
b)       Destination is being debugged
c)       No MX records can retrieved for the destination
d)       Destination rejects connections because of certain limitations
 

The administrator may spring into action at the hint of a MOM alert saying that the total number of messages in the transport queues on some server has exceeded the established threshold. Since that information is rather vague (the MOM counter only provides an aggregate message count for all delivery queues), the administrator must first identify which queue seems to accumulate messages without delivering them.

To get to this information run the command:

get-queue –SortOrder:-MessageCount

Sample output:

Identity                DeliveryType Status MessageCount NextHopDomain

--------                ------------ ------ ------------ -------------

testserver2\5           SmartHost... Retry  2            localhost
testserver2\Unreacha... Unreachable  Ready  1            Unreachable Domain
testserver2\Submissi... Undefined    Ready  0            Submission

 

Note: all commands that do not pass a full queue identity (prefixed with the server name) assume the administrator is logged on the transport server. To run these tasks remotely add the “-Server:<server name>” parameter.

This will return all the queues on the server ordered descending by message count so the largest queue will come first (change –MessageCount to +MessageCount to order ascending by the same field).

The first queue will likely be the queue that creates the problem (let’s assume it’s a delivery queue, that is, its DeliveryType member is not Submission or Unreachable). To figure out if this is the case let’s get all the data about the queue with the largest number of messages. The “-Results:<desired result count>” argument can be passed to the task to limit the number of queues returned.

get-queue –SortOrder:-MessageCount –Results:1 | fl

Sample output (you can see that the smarthost delivery queue testserver2\5 whose next hop domain is ‘localhost’ indicates by its last error that it cannot connect to the destination):

Identity         : testserver2\5
DeliveryType     : SmartHostConnectorDelivery
NextHopDomain    : localhost
NextHopConnector : d94a0d6f-f07c-47fe-af9c-013ae4520f8b
Status           : Retry
MessageCount     : 2
LastError        : 451 4.4.0 Primary target IP address responded with: "421 4.2
                   .1 Unable to connect." Attempted failover to alternate host,
                    but that did not succeed. Either there are no alternate hos
                   ts, or delivery failed to all alternate hosts.
LastRetryTime    : 7/27/2006 5:18:54 PM
NextRetryTime    : 7/27/2006 6:18:55 PM
IsValid          : True
ObjectState      : Unchanged

The Status and LastError queue fields will help diagnose the problem. Usually, if the queue cannot establish a connection with the destination, its Status will be Retry and the LastError will be an error message in the form of an SMTP response. Active queues should be delivering messages as expected.

First make sure the queue status is not Suspended (it may have been suspended by somebody else and left in this state accidentally). A suspended queue will not attempt to establish connections with the next hop. If the queue is suspended, resume it by running the following command, which will automatically attempt to connect to the destination:

resume-queue <QueueIdentity>

You may be able to realize whether the remote machine actively rejects connections (e.g. it has reached some maximum number of incoming connections) or a connection cannot be made because the remote IP address/port combination is not listening, or because the store driver cannot connect to a mailbox server. DNS failure responses are also reflected in this field. You won’t be able to know if a machine is being debugged but since that is only relevant inside the same organization we assume the admin should know about it.

What can be done to fix these problems? The LastError queue field should be descriptive enough to make it obvious that the problem lies with the server being diagnosed or the remote server or somewhere in between (MX records). The actions are different according to whether the destination belongs to the same org (AD site, routing group) or not (some remote domain on the internet)

If the destination is internal, besides the obvious checks that it is up and working it is also useful to check the routing information. If some send connectors are left enabled pointing to a smarthost that doesn’t exist anymore or to an AD site which doesn’t have any Hub servers, messages may be routed to a queue that acts like a dead-end. It is important that this scenario be detected early so that the messages don’t stay in that queue forgotten until they expire.

Note: Most times the NextHopDomain property of a queue is the name of either another SMTP server or of a mailbox machine. In these situations the destination can be promptly identified and actions can be taken. But In the case of the delivery to another AD site, the next hop domain is the name of the AD site and the actual machine our server is trying to connect to will be dynamically chosen by Enhanced DNS. One clue so far is to bring the connectivity log and the protocol log or message tracking log in the picture and identify the IP of the bridgehead that responded with an error.

In the case d) above (destination rejects connections because of certain limitations) a wealth of information is available in the protocol log, which offers the advantage of showing the history of our attempts to connect and deliver as opposed to the queue viewer which only exposes the current situation and the last error. This situation may also uncover other problems – if the destination machine that rejects messages is under the control of the same administrator it may be useful to inspect the event log on destination which can give additional information about why the messages were rejected – tar-pitting, connection rate or message rate exceeded, etc.

2.   Messages stuck in a delivery queue but destination seems OK
Sometimes after, trying to diagnose the state of a queue the admin realizes that the destination is ok, i.e. – it can be contacted via tcp/ip on port 25, or if the queue is local delivery, the mailbox store is up and running. The queue status may be Ready or even Active but the message count won’t go down. In this situation we want to look more in depth at the messages. We could have a situation where the queue simply has a large backlog and the system is only slow. In this case the admin would have to make sure first that some messages are being delivered.

The first thing to do is call:

get-queue <QueueIdentity>

As in the previous situation described, first make sure first the queue status is not Suspended. Next, check if the queue status is Retry. After the “glitch retry” time (4 times at 15 second intervals) the queue will not attempt a connection for the next hour (local delivery queues however have a flat 5-minute retry time). The time of the next connection attempt is given by the NextRetryTime queue field. If the destination is known to have had a problem that was fixed in the meanwhile, run:

retry-queue <QueueIdentity>

to attempt to establish a connection.

If the queue is Active and the message count fluctuates up and down then the queue is working as expected: messages are being delivered but they can also be rejected – in both cases they disappear from the queue. If the queue message count seems to be increasing constantly, it’s time to take a look at the messages:

get-message -Queue <QueueIdentity> 

will return all messages from the queue identified by “QueueIdentity”.
Sample output for “get-message -Queue testserver2\5” with one message suspended and one ready:

Identity         FromAddress     Status          Queue           Subject

--------         -----------     ------          -----           -------

testserver2\5\1  <>              Suspended       testserver2\5   Undeliverabl...
testserver2\5\2 
me@a.com        Ready           testserver2\5   party!


If there are too many messages this can be slow and spew too much information, so limiting the number of results may help. Here’s how to get the first 10 results:

get-message –Queue <QueueIdentity> -Results:10

To diagnose the problem further we must check the status of the messages.

Note: the order in which the messages are returned by get-message is not related to the order in which the messages are delivered.

If the status of some messages is Active it means the queue is doing its job delivering messages. To get the active messages run the following command a few times and see if any results are returned:

get-message –Filter:{Queue –eq ‘<QueueIdentity>’ –and Status –eq ‘Active’}

(Run this a few times since messages in deliver may quickly disappear, so you may as well get no messages back sometimes.)

If many or all messages are Suspended, as in the case of the queues, no delivery will be attempted. Messages may have been suspended by somebody else and forgotten in this state.

If the status of many messages is Retry (you can obtain them using a query filter like the one for Active messages) check their LastError. This should explain why they were put in Retry. Retry messages are usually accompanied by a 400-level SMTP error. This error should be enough to diagnose the error.

3.   Messages rejected by a remote machine
The last delivery error is either a per-message error received during a command like “mail from” or an aggregate of the recipient errors. The admin may see a 400-level recipient-specific error when looking at the message but would not know which recipient caused the problem. To get even more detailed information on a particular message run:

$m=get-message <MessageIdentity> -IncludeRecipientInfo
$m.Recipients | fl

This will also retrieve the recipients and will then display them in a detailed view (the message object will only print the email addresses). Each recipient has its own Status and LastError fields which can help identify what recipient caused the error and take an action (maybe the recipient wasn’t found in AD or the mailbox is full, etc) In addition, the RetryCount property displays the number of times delivery has been attempted for that message.

When messages are rejected, NDRs are usually generated. If NDR messages start queuing up, we may have two problems: delivering to recipients and delivering the NDR to the sender. NDRs are easy to identify in the queue viewer: their FromAddress field is “<>” and the subject usually starts with “Undeliverable:” It is useful in this case to take a look at the NDRs themselves. To do so we can export the NDR messages and look at their body. To export a message it must be first suspended:

suspend-message <message identity>
export-message <message identity> -Path: <path to directory or file>
resume-message <message identity>

4.   Cannot route one or more recipients of a message
When some recipients cannot be routed, the message with the subset of un-routable recipients will end up unreachable queue. The main question an admin must answer is “why did this message end up in the unreachable queue?” As in the previous cases, the LastError field will help diagnose the problem – all messages in the unreachable queue will have the LastError field populated. The value of this field is a concatenation of all the errors encountered when routing all recipients. There are errors like “A matching connector cannot be found to route the external recipient” or “The mailbox recipient does not have a MDB”.

Of course, this doesn’t help much until we realize what recipient caused what error. To do so, run the same sequence of commands as above by adding the IncludeRecipientInfo parameter to the get-message task to dump the recipients. All the recipients of the messages in this queue should have a last error string that describes the reason why they couldn’t be routed.

In most case the actions needed to fix these issues follow from the error description. For example if the error on the message was “A matching connector cannot be found to route the external recipient” and the recipient is known to be valid, then it is likely that the send connectors are misconfigured (e.g. – missing address space, missing connector). After fixing the connectors, the unreachable queue will be automatically resubmitted too; this will result in those messages being drained and routed to the appropriate delivery queues.

To prevent automatic resubmission (e.g., when a few connectors need to be changed or added and we don’t want the unreachable queue to be resubmitted after each configuration change because many messages may end up right back in the same queue) the queue can be suspended first, later resumed, and then resubmit can be performed manually:

suspend-queue Unreachable
 … do some work with the connectors
resume-queue Unreachable
retry-queue Unreachable -Resubmit:$true

5.   Messages stuck in the submission queue
The administrator will be alerted by MOM when the size of the Submission queue grows over the accepted limit. This can mean sometimes that we simply have a spike in incoming mail and the queue drains slower than the usual, but some other times we may see that no messages are going through the categorizer and into the delivery queues. This is usually an indication that something is wrong inside the categorizer component.

As usual, before attempting to investigate further, make sure the submission queue is not suspended (the status must be Ready).

get-queue Submission

One case when the above behavior happens is when the messages reach the categorizer but are being deferred and put back in the submission queue because of AD errors – the recipients cannot be resolved (this will only happen in the Hub role, as the Edge role doesn’t have a resolver). The fix is to look at the messages in Retry

get-message –Filter:{Queue –eq ‘Submission’ –and Status –eq ‘Retry’} 

Note: there is no way to know exactly how long a message will be deferred for, but generally, 400-level errors in remote delivery will defer a message for the time span configured in the transport server property MessageRetryInterval (default is 1 minute). The messages in the submission queue are usually deferred for 30 minutes (non-configurable) if errors like AD connectivity failures. In addition a categorizer agent could defer a message for any duration. Unlike in the case for queues, there isn’t a way for the admin to change or reset the retry time for messages.

If AD is unavailable the last error on those messages will say something like “AD transient failure during resolve.” The AD connectivity problem must then be investigated.

All message retries that are triggered by the categorizer have a reason associated with them. There are LastError messages indicating whether the message has been deferred by an agent (“Message deferred by categorizer agent.”), that a failure happened during content conversion (“A storage transient failure has occurred during content conversion.”) etc. These errors are not always giving an exact indication on what the problem is but they make a good starting point. For example, there won’t be any indication in the last error field about which agent deferred the message and why, but if get-message returns too many messages in Retry with the same “deferred by agent” last error, it likely means that one of the categorizer agents has encountered a problem. The next steps may be trying to identify the agent by disabling all categorizer agents or rules, then enabling them one by one. Ultimately, debugging the agent or analyzing the tracing log of that agent (if available) may be needed.

Another case that we have encountered is when a categorizer agent is stuck (say, in a deadlock) and cannot finish processing a message. By default, the categorizer can only process 20 messages at once (in various stages of categorization). If all those 20 jobs are stuck, no more messages will be picked up from the submission queue for processing, and as a result, the submission queue will grow continuously until a MOM alert is triggered. To figure out which messages are just being processed by the categorizer run:

get-message –Filter: {Queue –eq ‘Submission’ and Status –eq ‘Active’} | ft Identity

Active messages are those being currently in various stages of categorization (routing, resolving, content conversion). This query should return at most 20 results. Run the query a few times in a row. If the same messages are returned each time the query is run (you can see that by the fact that the identities are the same) then it is very likely that we have a stuck categorizer agent problem. The quick fix to restore mail flow is to disable the agents. The in-depth fix would be to attach a debugger and identify the call stack on all the stuck threads which will point to the “stuck” code.

6.   Messages show in the poison queue
The poison queue does not show up when get-queue is run, unless there are messages in it. This means that the server has crashed at least twice while processing those messages. This can have many causes: bugs in our code not knowing how to deal with certain kinds of input, bugs in agents, mis-configurations, etc. In general if poison messages exist they have uncovered a bug somewhere. It is important to realize that the messages in the poison queue are usually not invalid or malicious. They would only become malicious if attackers discover that they can crash Exchange 2007 servers this way.

If an admin that has discovered poison messages after a crash (say, there is a MOM alert on the counter that monitors the length of the poison queue set to fire when the queue is larger than 0 messages) they can get more information using the following commands:

First, take a look at the poison queue

get-queue Poison

or

get-queue <ServerName>\Poison

This will tell how many poison messages exist. Then the messages must be looked at individually and decisions must be made on a case-by-case basis.

get-message –Queue:Poison

All those messages are considered suspended. The admin now has a couple of options:

-          Resume the messages one by one and figure out if they still make the transport service crash

resume-message <poison message identity>


-          Export the poison messages to files and have the content looked at by microsoft developers (the following command exports all poison messages in the temp directory)

get-message –Queue:Poison | export-message –Path: “C:\temp”
 

-          Delete messages if they are indeed poison (and choose to send or not to send NDR)

remove-message <poison message identity> -withNDR:$false


-          If the problem is suspected to be happening because of some agent, disable the agent and resume the poison messages.

The messages in the Poison queue never expire; they have to be either resumed or deleted by an admin.

Site

Changes
Index
Search

 

User

 

Log In
Register

 
 

Last Modified 8/10/06 1:14 PM