This is my first post in the forum, I spent one month in analyzing an issue in my new role without any luck. Hope I find some assistance here.
we have a batch job hosted from different locations, both of them are running in EC12, everyday we recycle the job at 15:00 GMT , so job is down from 11:00 t0 15:00 when we restart the job one site is able to process ~15 million txn in 15 minutes interval other site can process ~3million txn, but by end of the day both are able to process same number of txn(~200 million), the slow running lpar catches up after 2 hours , it is able to process ~10 million in 15 minutes. so I looked few things like looked at system loads at that time, both are running less than 50% , WLM policies , its the same , job is exactly same(as per application team), I changed weights of the LPAR to get more vertical Highs. without any improvement in performance. I pulled numbers from SMFINTRV member which shows both consume same CPU time, but the LPAR that process slowly has higher I/O time than the other one. as one more attempt we made a WLM change to the slow running LPAR service class to increase I/O priority to high, the one unsolved puzzle is when the job process less number of txn it goes to DW status and does nothing for atleast 10 minutes in 15 minutes in when I looked in real time from SDSF.
Joined: 06 Jun 2008 Posts: 8700 Location: Dubuque, Iowa, USA
IF the LPARs are defined the same (same memory, same processor weight, and so forth) I'd look at the I/O situation. Look at the SMF record type 70 and 72 records for each LPAR to see their I/O and channel stats.
Hi Robert,
Thanks for response, weights are different we have less number of LPARS in the CEC where we get faster response, but both the CEC'S are running less than 50%, I looked into type 70 and 72 channel stats looks fine, In I/O stats, over a day CPU time remains same but I/O time is 50% more in the slow processing sites. we are using same amount of flash drives in both the sides.
I did run a strobe report to see if there is any issues from application side, it shows
IEAVEWAT (wait service) as 88.48% in 10 min sample. again stuck with this unknown module (for me) to go further, google didn't help me much other than giving the some information like cross memory reference or linkage or I/O interrupts.
Joined: 06 Jun 2008 Posts: 8700 Location: Dubuque, Iowa, USA
Quote:
weights are different we have less number of LPARS in the CEC where we get faster response, but both the CEC'S are running less than 50%
If the CEC are the same machine / model, then different weights automatically imply performance will be different between the two LPARs. If the machine / model are different for the two CEC's then you'd have to look at the weighted LPAR for each machine to make any kind of valid comparison.
And the CEC running less than 50% means what? For example, if the LPAR is capped and the other LPARs are running very low utilizations while the LPAR in question is running 100% CPU utilization then the CEC utilization being under 50% would mean absolutely nothing since the 100% LPAR utilization is what would matter.
I think you're going down the wrong way looking at application performance with STROBE. The difference in I/O rates (a 5:1 ratio between the 2 LPARs) in the system configuration is significant -- application performance is not likely to be relevant with such a difference. Something is not the same between the LPARs -- WLM policy, channels, I/O paths, or whatever -- to have such an impact on performance. You may have to start with the IODF for each LPAR and look at everything to find the reason for the difference, but it seems extremely likely that there is something making a difference.
weights are different we have less number of LPARS in the CEC where we get faster response, but both the CEC'S are running less than 50%
If the CEC are the same machine / model, then different weights automatically imply performance will be different between the two LPARs. If the machine / model are different for the two CEC's then you'd have to look at the weighted LPAR for each machine to make any kind of valid comparison.
And the CEC running less than 50% means what? For example, if the LPAR is capped and the other LPARs are running very low utilizations while the LPAR in question is running 100% CPU utilization then the CEC utilization being under 50% would mean absolutely nothing since the 100% LPAR utilization is what would matter.
None of the LPARs are capped in both the CEC's and when I say CEC's are running at 50% , we have 50% wide space in terms of CPU for the LPARS to expand if they have demand and I also adjusted weights in both the sides to get same number of vertical High to maintain the same polarization. with all these the only difference I see in I/O time between both the sides, I am pretty new to Storage performance tuning, Is there something you can suggest me where to start for storage performance stats and what to look in to find out smoking guns. I will start looking at out from the bottom of both the systems
Joined: 06 Jun 2008 Posts: 8700 Location: Dubuque, Iowa, USA
Quote:
we have 50% wide space in terms of CPU for the LPARS to expand if they have demand
I am not sure what you mean by this. CEC utilization can be important in a heavily used system, but almost all the time the LPAR utilization is VASTLY more important to batch job performance. Does your site have MXG or MICS or another SMF analysis tool? If so, look at this data rather than the raw SMF records since the raw data needs a lot of work to be usable. What is the LPAR utilization during this time period?
There has to be something different in the CEC / LPAR definitions to see such a radical difference in I/O performance. The hard part is figuring out what that difference is!
When you say the jobs on different LPAR's are exactly the same what does that mean? Are they creating or updating their own individual datasets or are they sharing access to common datasets? What type of datasets are you dealing with? Please explain in more detail what the jobs actually do otherwise we can only speculate what the issues might be. Are both LPAR's part of the same physical CEC and SMSPlex sharing common DASD & TAPE etc? Is the Linklist the same on both LPAR's? Is COFVLFnn parmlib member the same on both LPAR's? What other competing workload is there on the 'slow' LPAR when it is slow?