Batch job tuning

sgandhla · New User Joined: 23 Mar 2017 Posts: 3 Location: USA

Hi Everyone,

This is my first post in the forum, I spent one month in analyzing an issue in my new role without any luck. Hope I find some assistance here.

we have a batch job hosted from different locations, both of them are running in EC12, everyday we recycle the job at 15:00 GMT , so job is down from 11:00 t0 15:00 when we restart the job one site is able to process ~15 million txn in 15 minutes interval other site can process ~3million txn, but by end of the day both are able to process same number of txn(~200 million), the slow running lpar catches up after 2 hours , it is able to process ~10 million in 15 minutes. so I looked few things like looked at system loads at that time, both are running less than 50% , WLM policies , its the same , job is exactly same(as per application team), I changed weights of the LPAR to get more vertical Highs. without any improvement in performance. I pulled numbers from SMFINTRV member which shows both consume same CPU time, but the LPAR that process slowly has higher I/O time than the other one. as one more attempt we made a WLM change to the slow running LPAR service class to increase I/O priority to high, the one unsolved puzzle is when the job process less number of txn it goes to DW status and does nothing for atleast 10 minutes in 15 minutes in when I looked in real time from SDSF.

Robert Sample · Posted: Thu Mar 30, 2017 12:41 am

IF the LPARs are defined the same (same memory, same processor weight, and so forth) I'd look at the I/O situation. Look at the SMF record type 70 and 72 records for each LPAR to see their I/O and channel stats.

sgandhla · New User Joined: 23 Mar 2017 Posts: 3 Location: USA

Hi Robert,
Thanks for response, weights are different we have less number of LPARS in the CEC where we get faster response, but both the CEC'S are running less than 50%, I looked into type 70 and 72 channel stats looks fine, In I/O stats, over a day CPU time remains same but I/O time is 50% more in the slow processing sites. we are using same amount of flash drives in both the sides.
I did run a strobe report to see if there is any issues from application side, it shows
IEAVEWAT (wait service) as 88.48% in 10 min sample. again stuck with this unknown module (for me) to go further, google didn't help me much other than giving the some information like cross memory reference or linkage or I/O interrupts.

Robert Sample · Posted: Thu Mar 30, 2017 2:33 am

sgandhla · New User Joined: 23 Mar 2017 Posts: 3 Location: USA

Robert Sample · Posted: Thu Mar 30, 2017 4:31 am

Pete Wilson · Posted: Thu Aug 19, 2021 12:42 pm

When you say the jobs on different LPAR's are exactly the same what does that mean? Are they creating or updating their own individual datasets or are they sharing access to common datasets? What type of datasets are you dealing with? Please explain in more detail what the jobs actually do otherwise we can only speculate what the issues might be. Are both LPAR's part of the same physical CEC and SMSPlex sharing common DASD & TAPE etc? Is the Linklist the same on both LPAR's? Is COFVLFnn parmlib member the same on both LPAR's? What other competing workload is there on the 'slow' LPAR when it is slow?