IBM Mainframe Forum Index
 
Log In
 
IBM Mainframe Forum Index Mainframe: Search IBM Mainframe Forum: FAQ Register
 

Batch job tuning


IBM Mainframe Forums -> Testing & Performance
Post new topic   Reply to topic
View previous topic :: View next topic  
Author Message
sgandhla

New User


Joined: 23 Mar 2017
Posts: 3
Location: USA

PostPosted: Fri Mar 24, 2017 9:41 pm
Reply with quote

Hi Everyone,

This is my first post in the forum, I spent one month in analyzing an issue in my new role without any luck. Hope I find some assistance here.

we have a batch job hosted from different locations, both of them are running in EC12, everyday we recycle the job at 15:00 GMT , so job is down from 11:00 t0 15:00 when we restart the job one site is able to process ~15 million txn in 15 minutes interval other site can process ~3million txn, but by end of the day both are able to process same number of txn(~200 million), the slow running lpar catches up after 2 hours , it is able to process ~10 million in 15 minutes. so I looked few things like looked at system loads at that time, both are running less than 50% , WLM policies , its the same , job is exactly same(as per application team), I changed weights of the LPAR to get more vertical Highs. without any improvement in performance. I pulled numbers from SMFINTRV member which shows both consume same CPU time, but the LPAR that process slowly has higher I/O time than the other one. as one more attempt we made a WLM change to the slow running LPAR service class to increase I/O priority to high, the one unsolved puzzle is when the job process less number of txn it goes to DW status and does nothing for atleast 10 minutes in 15 minutes in when I looked in real time from SDSF.
Back to top
View user's profile Send private message
Robert Sample

Global Moderator


Joined: 06 Jun 2008
Posts: 8700
Location: Dubuque, Iowa, USA

PostPosted: Thu Mar 30, 2017 12:41 am
Reply with quote

IF the LPARs are defined the same (same memory, same processor weight, and so forth) I'd look at the I/O situation. Look at the SMF record type 70 and 72 records for each LPAR to see their I/O and channel stats.
Back to top
View user's profile Send private message
sgandhla

New User


Joined: 23 Mar 2017
Posts: 3
Location: USA

PostPosted: Thu Mar 30, 2017 1:28 am
Reply with quote

Hi Robert,
Thanks for response, weights are different we have less number of LPARS in the CEC where we get faster response, but both the CEC'S are running less than 50%, I looked into type 70 and 72 channel stats looks fine, In I/O stats, over a day CPU time remains same but I/O time is 50% more in the slow processing sites. we are using same amount of flash drives in both the sides.
I did run a strobe report to see if there is any issues from application side, it shows
IEAVEWAT (wait service) as 88.48% in 10 min sample. again stuck with this unknown module (for me) to go further, google didn't help me much other than giving the some information like cross memory reference or linkage or I/O interrupts.
Back to top
View user's profile Send private message
Robert Sample

Global Moderator


Joined: 06 Jun 2008
Posts: 8700
Location: Dubuque, Iowa, USA

PostPosted: Thu Mar 30, 2017 2:33 am
Reply with quote

Quote:
weights are different we have less number of LPARS in the CEC where we get faster response, but both the CEC'S are running less than 50%
If the CEC are the same machine / model, then different weights automatically imply performance will be different between the two LPARs. If the machine / model are different for the two CEC's then you'd have to look at the weighted LPAR for each machine to make any kind of valid comparison.

And the CEC running less than 50% means what? For example, if the LPAR is capped and the other LPARs are running very low utilizations while the LPAR in question is running 100% CPU utilization then the CEC utilization being under 50% would mean absolutely nothing since the 100% LPAR utilization is what would matter.

I think you're going down the wrong way looking at application performance with STROBE. The difference in I/O rates (a 5:1 ratio between the 2 LPARs) in the system configuration is significant -- application performance is not likely to be relevant with such a difference. Something is not the same between the LPARs -- WLM policy, channels, I/O paths, or whatever -- to have such an impact on performance. You may have to start with the IODF for each LPAR and look at everything to find the reason for the difference, but it seems extremely likely that there is something making a difference.
Back to top
View user's profile Send private message
sgandhla

New User


Joined: 23 Mar 2017
Posts: 3
Location: USA

PostPosted: Thu Mar 30, 2017 3:19 am
Reply with quote

Robert Sample wrote:
Quote:
weights are different we have less number of LPARS in the CEC where we get faster response, but both the CEC'S are running less than 50%
If the CEC are the same machine / model, then different weights automatically imply performance will be different between the two LPARs. If the machine / model are different for the two CEC's then you'd have to look at the weighted LPAR for each machine to make any kind of valid comparison.

And the CEC running less than 50% means what? For example, if the LPAR is capped and the other LPARs are running very low utilizations while the LPAR in question is running 100% CPU utilization then the CEC utilization being under 50% would mean absolutely nothing since the 100% LPAR utilization is what would matter.



None of the LPARs are capped in both the CEC's and when I say CEC's are running at 50% , we have 50% wide space in terms of CPU for the LPARS to expand if they have demand and I also adjusted weights in both the sides to get same number of vertical High to maintain the same polarization. with all these the only difference I see in I/O time between both the sides, I am pretty new to Storage performance tuning, Is there something you can suggest me where to start for storage performance stats and what to look in to find out smoking guns. I will start looking at out from the bottom of both the systems
Back to top
View user's profile Send private message
Robert Sample

Global Moderator


Joined: 06 Jun 2008
Posts: 8700
Location: Dubuque, Iowa, USA

PostPosted: Thu Mar 30, 2017 4:31 am
Reply with quote

Quote:
we have 50% wide space in terms of CPU for the LPARS to expand if they have demand
I am not sure what you mean by this. CEC utilization can be important in a heavily used system, but almost all the time the LPAR utilization is VASTLY more important to batch job performance. Does your site have MXG or MICS or another SMF analysis tool? If so, look at this data rather than the raw SMF records since the raw data needs a lot of work to be usable. What is the LPAR utilization during this time period?

There has to be something different in the CEC / LPAR definitions to see such a radical difference in I/O performance. The hard part is figuring out what that difference is!
Back to top
View user's profile Send private message
Pete Wilson

Active Member


Joined: 31 Dec 2009
Posts: 592
Location: London

PostPosted: Thu Aug 19, 2021 12:42 pm
Reply with quote

When you say the jobs on different LPAR's are exactly the same what does that mean? Are they creating or updating their own individual datasets or are they sharing access to common datasets? What type of datasets are you dealing with? Please explain in more detail what the jobs actually do otherwise we can only speculate what the issues might be. Are both LPAR's part of the same physical CEC and SMSPlex sharing common DASD & TAPE etc? Is the Linklist the same on both LPAR's? Is COFVLFnn parmlib member the same on both LPAR's? What other competing workload is there on the 'slow' LPAR when it is slow?
Back to top
View user's profile Send private message
View previous topic :: :: View next topic  
Post new topic   Reply to topic View Bookmarks
All times are GMT + 6 Hours
Forum Index -> Testing & Performance

 


Similar Topics
Topic Forum Replies
No new posts MacKinney Batch to CICS upgrade causi... CICS 7
No new posts Run rexx in batch job CLIST & REXX 7
No new posts Excuting store procedure via JCL batch JCL & VSAM 1
No new posts SORT on detail record, then repeat he... DFSORT/ICETOOL 3
No new posts batch SFTP job using AOPBATCH unable ... All Other Mainframe Topics 7
Search our Forums:

Back to Top