Wednesday, February 17, 2010

Websphere MQ FFST

What is FFST?
FFST stands for First Failure Support Technology, and is technology within WebSphere MQ designed to create detailed reports for IBM Service with information about the current state of a part of a queue manager together with historical data.
What are they for?
They are used to report unexpected events or states encountered by WebSphere MQ. (Alternatively, they can be generated upon request).


Note that return codes are used for application programmers to inform them of expected states or errors in a WebSphere MQ application. There are exceptions to this rule, but as a rule of thumb, FFSTs are used to report something that will need to be actioned by:
• system administrators - such as where FFSTs report resource issues such as running low on disk space
• IBM - where FFSTs report a potential code error in WebSphere MQ that (unless already identified and corrected in existing maintenance) may need correcting
Where are they?
On UNIX systems, they are written to /var/mqm/errors
They are contained in files with the extension .FDC
The file name will begin with AMQ followed by the process id for the process which reported the error. e.g /var/mqm/errors/AMQ12345.0.FDC - is the first FFST file produced by a process with ID 12345

What do they contain?
FFST files are text files containing error reports for a single process.
If a single process produces more than one error report, these are all included within the same FFST file for that process, in the order in which they were generated.
How should I look at these files?
FFST files are just text files, so your favorite text editor is normally the best place to start.
The tool ffstsummary is also useful - it produces a summary of FFST reports in the current directory, sorted into time order. This can be a good place to start to see the errors reported in your errors directory.
For example:
[mqm@test~]$ cd /var/mqm/errors
[mqm@test errors]$ ffstsummary
AMQ21433.0.FDC 2007/04/10 10:05:45 amqzdmaa 21433 2 XC338001 xehAsySignalHandler xecE_W_UNEXPECTED_ASYNC_SIGNAL OK
AMQ21429.0.FDC 2007/04/10 10:05:45 amqzmur0 21429 2 XC338001 xehAsySignalHandler xecE_W_UNEXPECTED_ASYNC_SIGNAL OK
AMQ21469.0.FDC 2007/04/10 10:05:45 runmqlsr 21469 2 XC338001 xehAsySignalHandler xecE_W_UNEXPECTED_ASYNC_SIGNAL OK
AMQ21422.0.FDC 2007/04/10 10:05:45 amqzfuma 21422 2 XC338001 xehAsySignalHandler xecE_W_UNEXPECTED_ASYNC_SIGNAL OK
AMQ21424.0.FDC 2007/04/10 10:05:45 amqzmuc0 21424 2 XC338001 xehAsySignalHandler xecE_W_UNEXPECTED_ASYNC_SIGNAL OK
AMQ21431.0.FDC 2007/04/10 10:05:45 amqrrmfa 21431 2 XC338001 xehAsySignalHandler xecE_W_UNEXPECTED_ASYNC_SIGNAL OK
AMQ21449.0.FDC 2007/04/10 10:05:45 amqzlaa0 21449 2 XC338001 xehAsySignalHandler xecE_W_UNEXPECTED_ASYNC_SIGNAL OK
AMQ21434.0.FDC 2007/04/10 10:05:45 amqzmgr0 21434 2 XC338001 xehAsySignalHandler xecE_W_UNEXPECTED_ASYNC_SIGNAL OK
AMQ21452.0.FDC 2007/04/10 10:05:45 runmqchi 21452 2 XC338001 xehAsySignalHandler xecE_W_UNEXPECTED_ASYNC_SIGNAL OK
AMQ21417.0.FDC 2007/04/10 10:05:45 amqzxma0 21417 4 XC338001 xehAsySignalHandler xecE_W_UNEXPECTED_ASYNC_SIGNAL OK
[mqm@testerrors]$
The columns in the output above show:
• filename - which FDC file contains the FFST report
• time and date of the report
• process name - name of the process which produced the report
• process and thread ids - for the process which produced the report
• probe id
• component - part of WebSphere MQ where the report was produced
• error code - major errorcode and minor code
What does an FFST report contain?
I’ve added some numbers on the left to mark out points worth noting…
Sample FFST Report:
+—————————————————————————–+
| |
| WebSphere MQ First Failure Symptom Report |
| ========================================= |
| |
(1) | Date/Time :- Wednesday Feb 02 13:25:56 IST 2008 |
(2) | Host Name :- joseph.joseph.com (Linux 2.6.9-42.0.10.EL) |
| PIDS :- 5724H7207 |
(3) | LVLS :- 6.0.2.0 |
| Product Long Name :- WebSphere MQ for Linux (POWER platform) |
| Vendor :- IBM |
(4) | Probe Id :- XC338001 |
| Application Name :- MQM |
(5) | Component :- xehAsySignalHandler |
(6) | SCCS Info :- lib/cs/unix/amqxerrx.c, 1.214.1.4 |
| Line Number :- 737 |
| Build Date :- Sep 21 2007 |
| CMVC level :- p600-200-060921 |
| Build Type :- IKAP - (Production) |
(7) | UserID :- 00011243 (mqm ) |
( | Program Name :- runmqlsr |
| Addressing mode :- 64-bit |
(9) | Process :- 16337 |
| Thread-Process :- 16337 |
(10) | Thread :- 2 |
| ThreadingModel :- PosixThreads |
(11) | Major Errorcode :- xecE_W_UNEXPECTED_ASYNC_SIGNAL |
| Minor Errorcode :- OK |
| Probe Type :- MSGAMQ6209 |
| Probe Severity :- 3 |
(12) | Probe Description :- AMQ6209: An unexpected asynchronous signal (2 : |
| SIGINT) has been received and ignored. |
| FDCSequenceNumber :- 0 |
| Arith1 :- 2 2 |
(13) | Comment1 :- SIGINT |
| Comment2 :- Signal sent by pid 0 |
| |
+—————————————————————————–+

(14) MQM Function Stack
xehAsySignalMonitor
xehHandleAsySignal
xcsFFST

(15) MQM Trace History
{ xppInitialiseDestructorRegistrations
} xppInitialiseDestructorRegistrations rc=OK
{ xehAsySignalMonitor
-{ xcsGetEnvironmentInteger
–{ xcsGetEnvironmentString
–} xcsGetEnvironmentString rc=xecE_E_ENV_VAR_NOT_FOUND

(16) Process Control Block
0×80006ad890 58494850 000029E8 00003FD1 00000004 XIHP..)…?…..
0×80006ad8a0 00000000 10029F70 00000000 10033A50 …….p……
0×80006ad8b0 00000000 00000000 00000000 00000000 …………….
0×80006ad8c0 to 0×80006ad900 suppressed, 5 lines same as above
0×80006ad910 00000000 00000001 00000000 00000000 …………….
0×80006ad920 00000000 00000000 00000000 00000000 …………….
0×80006ad930 to 0×80006ad9d0 suppressed, 11 lines same as above
0×80006ad9e0 00000000 00000000 00000001 00568001 ………….V..
0×80006ad9f0 00FB8000 00000000 00000080 00760000 ………….v..
0×80006ada00 00000000 00000000 00000000 00000000 …………….
0×80006ada10 to 0×80006ae9f0 suppressed, 255 lines same as above
0×80006aea00 00000000 FFFFFFFF FFFFFFFF 00000000 …………….
0×80006aea10 00000000 00000000 00000001 FFFFFFFE …………….
0×80006aea20 00000001 00000000 00000000 00000000 …………….
0×80006aea30 00000080 0069A380 00000000 00000000 …..i……….
0×80006aea40 00000000 00000000 00000000 00000000 …………….

etc

(17) Environment Variables:
MANPATH=/opt/csm/man:
HOSTNAME=joseph.joseph.com
TERM=xterm
SHELL=/bin/bash
HISTSIZE=1000
SSH_CLIENT=::ffff:9.20.94.90 2625 22
QTDIR=/usr/lib/qt-3.3
OLDPWD=/root
SSH_TTY=/dev/pts/1
USER=root
LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40…
KDEDIR=/usr
MAIL=/var/spool/mail/root
PATH=/usr/kerberos/sbin:/usr/kerberos/bin:/opt/csm/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:…
INPUTRC=/etc/inputrc
PWD=/var/mqm/errors
LANG=en_GB.UTF-8
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
SHLVL=1
HOME=/root
LOGNAME=root
SSH_CONNECTION=::ffff:9.20.94.90 2625 ::ffff:9.20.63.20 22
LESSOPEN=|/usr/bin/lesspipe.sh %s
G_BROKEN_FILENAMES=1
_=/usr/bin/runmqlsr
1. Date and time that this report was produced
For many problems, this is the most useful piece of information - allowing an error report to be correlated with other known events.
2. hostname for the machine where this report was produced
3. Version and maintenance level for WebSphere MQ
This is useful when comparing an error report against a documented known problem.
4. Probe ID
This is an internal method of identifying the error report. It identifies a single point in the WebSphere MQ source code where the report was produced (consisting of two letters giving a component code, a three digit function code, and a three digit probe identifier).
This often makes it the best way to uniquely identify the error that the report is describing. More on this a bit later…
5. Component
this is the bit of WebSphere MQ which produced the report. As with the source information below, it is generally more useful to us than it is to users, although the name can sometimes give a useful hint as to the nature of the error report. For example, in this case where the report is the result of my using Control-C to generate an interrupt signal, you can see that the component which produced the report was a signal handler.
6. source information
Although this isn’t information isn’t useful to users, I thought it might be interesting to highlight that an FFST will identify exactly where it was produced, down to the source code file, line number and version
7. User id that was running the process which produced the report This is useful to confirm whether a problem was the result of insufficient user privileges.
8. process name of process which produced the report
9. process id for the process which produced the report
10. thread id for the process which produced the report
11. error codes for the report
12. a longer description of the error code for the report
This is a textual (English) description containing information that a WebSphere MQ developer thought might be helpful if the situation were to occur. Sometimes this information may be useful to users, such as messages identifying an operating system function which has failed and what the error code was. Other times, it will only useful to IBM Service.
13. additional comment information
14. function stack for the process at the time of the report
15. a history of function calls made by the process leading up to the report
16. A series of dumps

In the WebSphere MQ source code, functions can register data items that may be of interest. If it has something that could be useful (such as in diagnosing or debugging a problem), it can register it with the engine that produces FFST reports. This means that in the event of an FFST being produced, this data will be included. These items are deregistered when a function completes.
This is normally of more use to IBM Service than users, however there may be times - such as when some message data is included - when you will recognise some of the data here.
17. environment variables for the the environment of the process which produced the report

What can I do if I have an FFST report?
Monitoring for the production of FDC files is an important part of handling the occurrence of errors in a WebSphere MQ system. Prompt handling of a problem can be key to a timely resolution.
If an FDC file is created, the next step is probably to determine if this is something that requires you to take an action, and if so how urgent is it. A number of factors will influence this, including:
• Are queue managers running?
• Are applications still working?
• Does the probe description give any insight into why the FFST was generated?
• Does the time and date of the FFST correspond with any other known events or occurrences at the same time which may explain the error?
If the FFST identifies a resource issue, such as low disk space, then this will normally give enough information for a system administrator to identify and correct the source of the problem.
If you are unable to determine an explanation for the FFST, then a useful next step is to look to see if others have seen this FFST before, and if so what they found it to mean and needed to do.
This is where the probe id from the FFST is very useful. In the majority of cases (for one notable exception, see my discussion on signals below), this will be a unique eye-catcher for the issue being reported. This means that you can search for this short string on the WebSphere MQ support site on ibm.com or in the IBM Support Assistant. Often, this will reveal cases where someone has encountered this FFST before and the fix that resulted.
Beyond this point, you will most likely need to raise a PMR with IBM Service. It is useful to send all FFSTs from your system (rather than just the one that you believe to be of interest), as following the history can be key to resolving an issue. It is also useful to send the WebSphere MQ system (/var/mqm/errors/AMQ*.LOG) and queue manager (/var/mqm/qmgrs/errors/AMQ*.LOG) error logs, together with a clear description of what you are seeing and the impact on the system and your business.
Signal handling
Generally find the probe id to be a unique identifier for a specific problem. While this is usually true, one notable exception is FFSTs produced by the signal and exception handlers.
The signal handler component produces FFSTs to report signals sent to WebSphere MQ processes. This means that the information in the FFST (such as the probe id and source code file, line number, etc.) is about the signal handler which caught the signal, not whatever it was that caused or created the signal.
This is less of a problem if the signal was generated externally to WebSphere MQ, such as the SIGINT that I generated with Ctrl-C in the example above. The FFST contains information about the process which was sent the signal and the time and date of the signal.
It can be more complex if the signal is generated from elsewhere within WebSphere MQ, such as a SIGSEGV from a segmentation fault in another WebSphere MQ process. The exception handler will generate an FFST to record the SIGSEGV, however it is important to bear in mind that any such FFST contains a report about where the SIGSEGV was caught, not where it was generated. This doesn’t mean that the cause cannot be found, but it does mean that the FFST information such as the probe id is not necessarily the sort of unique eye-catcher described above.
Generating FFSTs on request
I mentioned above that it is possible to generate FFSTs manually. This can be done using the following commands:
amqldbgn -p PID (on Windows)
or
kill -USR2 PID (on UNIX platforms)
where PID is the process ID for a WebSphere MQ process. FFST reports generated in this way will have a probe id that ends in 255.

No comments:

Post a Comment