View previous topic :: View next topic
|
Author |
Message |
shankarm
Active User
Joined: 17 May 2010 Posts: 175 Location: India
|
|
|
|
Hello,
My production job which is running for more 20 years is failing and I am not able to locate the exact issue.
I am trying different options for the past 10 hours.. AM missing something and i am not able to locate the issue..
Below is what is available in the logs
Code: |
IGD104I RFSAIAS.D164.T2020328.RFSAIAS2 RETAINED, DDNAME=SYS00001
IEA822I COMPLETE TRANSACTION DUMP WRITTEN TO RFSAIAS.D164.T2020328.RFSAIAS2
+CEE3797I LANGUAGE ENVIRONMENT HAS DYNAMICALLY CREATED A DUMP.
+CEE0374C CONDITION=CEE3204S TOKEN=00030C84 59C3C5C5 00000000 933
WHILE RUNNING PROGRAM IGG019BP
AT THE TIME OF INTERRUPT
PSW 078D0000 00CA422E
GPR 0-3 00000000 000BF038 000BF038 002CBDC0
GPR 4-7 002C3FFE 002C4C8C 00007E58 00000030
GPR 8-B 00CA40B8 00404040 002CBDF0 0000FEF0
GPR C-F 00007B78 000BD730 501D7948 00CA40B8
FLT 0-2 4410810000000000 4A6F33F5FD9E0000
FLT 4-6 0000000000000000 0000000000000000
+CEE0374C CONDITION=CEE3206S TOKEN=00030C86 59C3C5C5 00000000 934
WHILE RUNNING PROGRAM CEEBINIT
AT THE TIME OF INTERRUPT
PSW 078D2000 8002472C
GPR 0-3 00000006 0002B240 002CB958 002CB890
GPR 4-7 0002B240 00000076 00000079 0000000C
GPR 8-B 0002B1E0 0002B1C0 404040B9 00D9060C
GPR C-F 0002CB88 0003FC00 800246FA 00D90618
FLT 0-2 4410810000000000 4A6F33F5FD9E0000
FLT 4-6 0000000000000000 0000000000000000
IEA995I SYMPTOM DUMP OUTPUT 941
USER COMPLETION CODE=4087 REASON CODE=00000007
TIME=20.20.32 SEQ=15804 CPU=0000 ASID=0091
PSW AT TIME OF ERROR 078D1000 86BDFDBC ILC 2 INTC 0D
NAME=UNKNOWN
DATA AT PSW 06BDFDB6 - 00181610 0A0DA7F4 001C1811
AR/GR 0: 00000000/84000000 1: 00000000/84000FF7
2: 00000000/00000007 3: 00000000/00031038
4: 00000000/06C1C128 5: 00000000/06C1C2A0
6: 00000000/0002B340 7: 00000000/0002B7F0
8: 00000000/80000000 9: 00000000/00041F9E
A: 00000000/00000001 B: 00000000/86BDFCE8
C: 00000000/0002CB88 D: 00000000/0003FFA0
E: 00000000/8003204A F: 01000002/00000007
END OF SYMPTOM DUMP
IEC915I 219-03,RFSAIAS2,AIAANAL2,********
IEC999I IFG0TC0A,IFG0TC0B,RFSAIAS2,AIAANAL2
IEC999I IGC00020,RFSAIAS2,AIAANAL2
IEC999I IFG0TC0A,IFG0TC0B,RFSAIAS2,AIAANAL2,DEB ADDR=8B5480 ,DSN = UNKNOWN
CC
IEC205I ERRORS,RFSAIAS2,AIAANAL2,FILESEQ=1, COMPLETE VOLUME LIST, 946
DSN=AIA.RFSAIA.ERRORS.G0788V00,VOLS=710825,TOTALBLOCKS=350
IEF450I RFSAIAS2 AIAANAL2 STEP1 - ABEND=S000 U4087 REASON=00000007 947
TIME=20.20.47
IEF234E K 552D,234229,PVT,RFSAIAS2,AIAANAL2
TMS014 IEF234E K 552D,234229,PVT,RFSAIAS2,AIAANAL2
IEF234E K 5566,710825,PVT,RFSAIAS2,AIAANAL2
TMS014 IEF234E K 5566,710825,PVT,RFSAIAS2,AIAANAL2
-STEP1 AIAANAL2 U4087 80116 63293 7.60 .00 15.9 13759K BATCH
|
Code: |
IEC915I 219-03,RFSAIAS2,AIAANAL2,********
IEC999I IGC00020,RFSAIAS2,AIAANAL2
IEC999I IFG0TC0A,IFG0TC0B,RFSAIAS2,AIAANAL2,DEB ADDR=8B5480 ,DSN = UNKNOWN
IEC205I ERRORS,RFSAIAS2,AIAANAL2,FILESEQ=1, COMPLETE VOLUME LIST,
DSN=AIA.RFSAIA.ERRORS.G0788V00,VOLS=710825,TOTALBLOCKS=350
IEF472I RFSAIAS2 AIAANAL2 STEP1 - COMPLETION CODE - SYSTEM=000 USER=4087 REASON=00000007
|
People in the forum have helped me may times in the difficult situations.. I am hoping that i will get a recommendations about what to do here again.. please help.... |
|
Back to top |
|
|
shankarm
Active User
Joined: 17 May 2010 Posts: 175 Location: India
|
|
|
|
I also tried to re compile the program now as i thought this could be a s0c4..
Below data is available in sysdump,
PSW AT ENTRY TO ABEND 078D1000 86BDFDBC ILC 02 INTC 000D
PSW LOAD MODULE ADDRESS = 06B5F000 OFFSET = 00080DBC
NAME=CEEPLPKA
ASCB: 00F6B200
I recompiled with list option but not able to find the address or offset...
Nothing in CAIprint.. Do i have to use some CA optimizer options to locate this? |
|
Back to top |
|
|
Robert Sample
Global Moderator
Joined: 06 Jun 2008 Posts: 8700 Location: Dubuque, Iowa, USA
|
|
|
|
Quote: |
My production job which is running for more 20 years is failing |
Who cares how long the program ran without failing? What matters now is that SOMETHING CHANGED and the program is now failing.
The 219 message states in the MAC manual what to do:
Quote: |
System Action: The system issues an SVC Dump, writes a software error record to the logrec data set, and the task is ended. Operator Response: Start a generalized trace facility (GTF) trace, and re-create the problem. Reply to message AHL100A with:
TRACE=SYS,USR,SLIP
On the DD statement for the data set in error, specify:
DCB=DIAGNS=TRACE
Application Programmer Response: Make sure that your program does not alter the DCB or IOB during processing of SVC 25.
System Programmer Response: If the error recurs and the program is not in error, look at the messages in the job log for more information. Search problem reporting data bases for a fix for the problem. If no fix exists, contact the IBM Support Center. Provide the JCL, the program listing for the job, and the logrec data set error record. |
while the CEE3204S message in the manual indicates
Quote: |
CEE3204S The system detected a protection exception (System Completion
Code=0C4).
Explanation: Your program attempted to access a storage location to which it was not authorized. Programmer Response: Check your application for these common errors:
Using the wrong AMODE to reference storage
Trying to use a pointer that has not been set
Trying to store data into storage reserved for the system
Using an invalid index to an array
See a Principles of Operation manual for a full list of protection exceptions. System Action: The thread is terminated. Symbolic Feedback Code: CEE344 |
If you have tried the diagnostics in the 219 message, then the next step is to contact IBM and open a PMR. |
|
Back to top |
|
|
shankarm
Active User
Joined: 17 May 2010 Posts: 175 Location: India
|
|
|
|
Thanks Robert. I have a question,
If this is S0c4, the where can i find the address and offset? no data is available in sysout.
As i mentioned in the previous post, i tried to use list option in compiler and find the exact location but the offset and address shown in the sysdump is not available in the compilation job? |
|
Back to top |
|
|
dick scherrer
Moderator Emeritus
Joined: 23 Nov 2006 Posts: 19243 Location: Inside the Matrix
|
|
|
|
Hello,
Keep in mind that when "things" change on a system, old modules may fail.
Was the jcl changed recently?
Suggest you check with the system support or Cofiguration Management group (if there is one) to learn if there have been any upgrades or fixes applied since the program last ran successfully.
Suggest you re-compile the program into a test loadlib using a different load module name and see if the newly created test module will:
a. compile/link successfully
b. execute successfully |
|
Back to top |
|
|
dick scherrer
Moderator Emeritus
Joined: 23 Nov 2006 Posts: 19243 Location: Inside the Matrix
|
|
|
|
Hello,
Quote: |
If this is S0c4, the where can i find the address and offset? no data is available in sysout. |
Quite possibly the program has "walked on storage" and generated an invalid address. |
|
Back to top |
|
|
shankarm
Active User
Joined: 17 May 2010 Posts: 175 Location: India
|
|
|
|
When you say "walked on storage".. does it mean that the pogram has used up all the memory allocated to the program and generated an invalid address?
if yes, what are the possible fixes for this?
I already tried to run the program with region=0k.. but still it is failing... |
|
Back to top |
|
|
dick scherrer
Moderator Emeritus
Joined: 23 Nov 2006 Posts: 19243 Location: Inside the Matrix
|
|
|
|
Hello,
"Walked on storage" means the code caused data to be moved to some intended address that was still valid. When the corrupt data (which was supposed to contain an address) is used, the 0c4 can occur.
Again, you need to identify what has changed since the last successful executon.
Have you made the test program, compiled it, and run a test?
If not, suggest you do so now. |
|
Back to top |
|
|
shankarm
Active User
Joined: 17 May 2010 Posts: 175 Location: India
|
|
|
|
I have compiled and ran the job.. it succesfully processes 2.3 million records.. after 2.3 million was processed the job failed... |
|
Back to top |
|
|
shankarm
Active User
Joined: 17 May 2010 Posts: 175 Location: India
|
|
|
|
I thought tis could an issue with one particular record and i skipped the record from the input again the job fails... |
|
Back to top |
|
|
shankarm
Active User
Joined: 17 May 2010 Posts: 175 Location: India
|
|
|
|
There were no changes done to this for atleast 2 years.. i am sure about that... |
|
Back to top |
|
|
dbzTHEdinosauer
Global Moderator
Joined: 20 Oct 2006 Posts: 6966 Location: porcelain throne
|
|
|
|
instead of skipping the record (after 2.3 million). skip the 2.3 million and then run. |
|
Back to top |
|
|
shankarm
Active User
Joined: 17 May 2010 Posts: 175 Location: India
|
|
|
|
There were no changes done to this for atleast 2 years.. i am sure about that... |
|
Back to top |
|
|
dbzTHEdinosauer
Global Moderator
Joined: 20 Oct 2006 Posts: 6966 Location: porcelain throne
|
|
|
|
is this one cobol program,
tape in, tape out
any cobol internal tables?
no CALLs to other modules? |
|
Back to top |
|
|
dick scherrer
Moderator Emeritus
Joined: 23 Nov 2006 Posts: 19243 Location: Inside the Matrix
|
|
|
|
Hello,
Quote: |
There were no changes done to this for atleast 2 years.. i am sure about that... |
Possibly not, but that has nothing to do with i mentioned earlier. . .
Something HAS changed somewhere. It could be the data or ANY of the other possibilities mentioned above. There is also the chance that the problem has been in the code all along and just never caused the problem til now.
Be suspicous of any arrays or called modules (as DBZ mentioned). |
|
Back to top |
|
|
shankarm
Active User
Joined: 17 May 2010 Posts: 175 Location: India
|
|
|
|
This is a cobol program and it calls many modules.. it has 5 internal tables... loaded... |
|
Back to top |
|
|
rajesh1183
New User
Joined: 07 Jan 2008 Posts: 98 Location: Hyderabad
|
|
|
|
hope you have gone thru QW for U04087, else,
Code: |
Explanation: A recursive error was detected. A condition was raised,
causing the number of nested conditions to exceed the limit set by the
DEPTHCONDLMT option. The reason code indicates which subcomponent or
process was active when the exception was detected.
|
Code: |
Reason code Explanation
X'07' (7) While Language Environment was trying to output a message, a
subsequent condition was raised.
|
Code: |
Programmer Response: In the case of CEEHDLR routine, recursion can occur
when you use the DEPTHCONDLMT run-time option. It may be helpful to
generate a system dump of the original error by using run-time options
TERMTHDACT(UAIMM) and TRAP(ON,NOSPIE).
|
not sure though how it helps u |
|
Back to top |
|
|
Nic Clouston
Global Moderator
Joined: 10 May 2007 Posts: 2454 Location: Hampshire, UK
|
|
|
|
maybe one of your internal tables has overflowed - do you have checking on them? |
|
Back to top |
|
|
Bill Woodger
Moderator Emeritus
Joined: 09 Mar 2011 Posts: 7309 Location: Inside the Matrix
|
|
|
|
Code: |
+CEE0374C CONDITION=CEE3204S TOKEN=00030C84 59C3C5C5 00000000 933
WHILE RUNNING PROGRAM IGG019BP
+CEE0374C CONDITION=CEE3206S TOKEN=00030C86 59C3C5C5 00000000 934
WHILE RUNNING PROGRAM CEEBINIT |
You have failed in a system module, via CEEBINIT.
CEEBINIT is used when you call a Cobol program.
As others have suggested, something has done your storage in somewhere, sufficiently to knock over the IGG019BP.
The current data that you are using or possibly the immediately previous data or, if you are unlucky, data some time earlier, has caused your problem.
If possible, try to run with SSRANGE on, which should check the tables for overflow. Otherwise it is likely one of the called modules that is doing something.
dbz's suggestion is useful to you. If you shorten your file but leave about 1000 before the abend, it may help you to track it down.
I would go back through the calling chain and see what program was called to get to the abend. Fortunately even the IBM routines follow the call/save conventions, so it is just some work going back through the dump.
When you find the module call, that might not be the one causing the problem. Storage has been overwritten at some point, but that might be earlier than immediately previous to that, as the abend won't occur until you happen to have something try to use the corrupted storage as instructions. |
|
Back to top |
|
|
dbzTHEdinosauer
Global Moderator
Joined: 20 Oct 2006 Posts: 6966 Location: porcelain throne
|
|
|
|
what are the considerations (reasons) that a new item is added to any one of the several tables?
Has any business definition changed recently?
e.g. new departments, additional somethings that could affect the way an incoming record would be stored in your internal tables. |
|
Back to top |
|
|
Anuj Dhawan
Superior Member
Joined: 22 Apr 2006 Posts: 6248 Location: Mumbai, India
|
|
|
|
You need to give us something concrete to work upon as you're getting a user abend which might mean anything. USER COMPLETION CODE=4087 REASON CODE=00000007 might have a distant possibility to be called by IMS if IMS is involved. Having said that, as Robert indicates about CEE3204S -- I suspect that, it can be as trivial as not using an index properly.
Do you have ODO in your COBOL program?
Suggest, as one of the options, you compile the program with SSRANGE option and execute it again. DISPLAY all the INDEXes and SUBSCRIPTs and check if they are in permissible range. |
|
Back to top |
|
|
Bill Woodger
Moderator Emeritus
Joined: 09 Mar 2011 Posts: 7309 Location: Inside the Matrix
|
|
|
|
Looking a bit further, the CEEBINIT and the IGG are probably "artefacts", the original abend causing subsequent abends as LE tries to "clear up".
IEA995I SYMPTOM DUMP OUTPUT 941
Code: |
USER COMPLETION CODE=4087 REASON CODE=00000007
TIME=20.20.32 SEQ=15804 CPU=0000 ASID=0091
PSW AT TIME OF ERROR 078D1000 86BDFDBC ILC 2 INTC 0D
NAME=UNKNOWN
DATA AT PSW 06BDFDB6 - 00181610 0A0DA7F4 001C1811
AR/GR 0: 00000000/84000000 1: 00000000/84000FF7
2: 00000000/00000007 3: 00000000/00031038
4: 00000000/06C1C128 5: 00000000/06C1C2A0
6: 00000000/0002B340 7: 00000000/0002B7F0
8: 00000000/80000000 9: 00000000/00041F9E
A: 00000000/00000001 B: 00000000/86BDFCE8
C: 00000000/0002CB88 D: 00000000/0003FFA0
E: 00000000/8003204A F: 01000002/00000007 |
The above is probably what you want to concentrate on.
The PSW looks "odd".
Register 8 "looks like" the last parameter passed to a Cobol program had no address. But with storage overwriting this needn't matter necessarily (as anything could be happening).
You dropped your current record and still abended, so it is a prior record causing the problem. "Eyeball" the file (see if records "look" consistent, check any occurence values) and the dump (look for "repeating" storage, if you find it, follow it backwards to where it starts and work out what module that is).
As Anuj said, we have little really to go on. If still stuck (after some sleep/someone else taken over) tell us about the "files" the program is reading, how many modules are called and any unanswere questions from above. I think we can ignore the "recursive" bit :-) |
|
Back to top |
|
|
Robert Sample
Global Moderator
Joined: 06 Jun 2008 Posts: 8700 Location: Dubuque, Iowa, USA
|
|
|
|
This particular problem is one of the class of problems that is rarely possible for an application programmer to solve alone. Usually resolution requires generating a trace and reading that trace to determine what stepped on storage and when -- and that could have happened before the first record was processed.
You need to talk to your site support group and get a system programmer involved. Otherwise, the only realistic option to resolve the issue is to contact IBM and open a PMR -- and IBM will certainly want to see the trace as mentioned in the 219 error message text. |
|
Back to top |
|
|
Bill Woodger
Moderator Emeritus
Joined: 09 Mar 2011 Posts: 7309 Location: Inside the Matrix
|
|
|
|
Robert Sample wrote: |
This particular problem is one of the class of problems that is rarely possible for an application programmer to solve alone. [...]
|
I eat 'em for breakfast :-)
Good advice, though, Robert. After 10 hours there must have been some sort of "escalation".
Takes our fun away, but Production waits for no-one... |
|
Back to top |
|
|
enrico-sorichetti
Superior Member
Joined: 14 Mar 2007 Posts: 10888 Location: italy
|
|
|
|
I wonder....
after 20 years ( according to the TS ) the criteria used for the original design should be reviewed
the amount of data processed in a reasonable healthy organization must have significantly grown after 20 years.
if I had money to spend I would bet on an internal table overflow!
and consequent program/working storage corruption |
|
Back to top |
|
|
|