DASD Response times

vasanthz · Posted: Wed Jul 09, 2014 8:58 pm

Hi,

Paul Voyner · New User Joined: 26 Nov 2012 Posts: 52 Location: UK

Vasanthz, you could probably have found the answer yourself if you'd googled it.
Here's one of many clear explanations "IOS Queue represents the average time that an I/O waits because the device is already in use by another task on this system, signified by the device's UCBBUSY bit being on"
In laymans terms, more than one address space* is trying to access the disk, so the requests have to queue until the disk - or more accurately the UCB for that disk -is free. Just like if only one checkout is available in the supermarket, so everyone queues up waiting their turn.
(* @pedants,yes it could also be one address space and multiple TCBs, but the guy wants a simple explanation)

BTW those response times are very bad. You'll need to investigate with a monitoring tool e.g. RMF, Omegamon

vasanthz · Posted: Thu Jul 10, 2014 3:48 pm

Hi Paul,

Thank you very much for the clear explanation. I now have a mental picture of what it is.

Ed Goodman · Active Member Joined: 08 Jun 2011 Posts: 556 Location: USA

Is...is that a velociraptor with a machine gun and a bomb...riding a great white shark???

A case study from the old STROBE manual had a situation where a travel agency had a spike in DASD wait time. They discovered that the spot on the disk was for DisneyLand reservation information. So everyone was banging against the same spot on the disk.

If you are adding cases/invoices/customer records at the end of a file area, then doing most of your work against those new entries, this can happen.

If you are using mammoth buffers trying to speed things up, but one program is locking down records for updates, this can happen.

Akatsukami · Posted: Thu Jul 10, 2014 7:31 pm

vasanthz · Posted: Thu Jul 10, 2014 8:29 pm

Hello Ed,

steve-myers · Posted: Thu Jul 10, 2014 10:25 pm

expat · Posted: Mon Jul 14, 2014 6:22 pm

Have you also thought about using PAV ?
Especially for the most utilised volumes

Maybe SDS (sequential Data Striping) could also help by spreading the naughty dataset over multiple volumes.

vasanthz · Posted: Tue Jul 15, 2014 6:23 pm

expat · Posted: Tue Jul 15, 2014 6:51 pm

SDS is for logical volumes. I used this about 10 years ago for poor response times and it worked rather well.

If you stripe the dataset over say XX logical volumes, you can have up to XX simultaneous accesses to the dataset which has been spread over the the number of logical volumes with one access to each of the logical volumes. SMS sort of knows where all of the data is so it's pretty easy to implement and use.

It's been quite a while since I've been heavily involved with DASD farming so things may well have changed.

Paul Voyner · New User Joined: 26 Nov 2012 Posts: 52 Location: UK

Striping works very well when the need is to improve performance for specific datasets. I did a test a while ago with a dataset which sustained an I/o rate of 5000 on a single disk, but 20000 as a 6-way stripe.
But I'd guess that Vasanth's problem with high IOSQ is more likely to be caused by high contention for a volume by a large number of users e.g. a TSO work volume. That won't be helped by striping.

expat · Posted: Wed Jul 23, 2014 1:56 pm

Paul, I've not come across a situation where SDS hasn't been beneficial before.

Even if the volumes are high usage by splitting the dataset over a number of volumes spreads and reduces the contention too.

vasanthz · Posted: Wed Jul 23, 2014 4:25 pm

Hi,

Thank you Paul & Expat for your views.

Please feel free to correct me if the below statements are wrong. I've been doing some reading & below is my understaning.

In modern DASD boxes, a dataset residing in single volume or multiple volume does not matter, due to hardware striping with RAID and RANK.

Even a logical single volume dataset may actually be residing in multiple physical volumes in the box. So the response time of a single logical volume is actually response time of a number of combination of physical volumes.

The logical volume comes into picture only when there is a contention for UCB. Since the logical dataset is placed in multiple physical volumes or available straight off the cache we can perform mutiple concurrent reads of a dataset, but z/OS is unable to do concurrent reads/writes eventhough the hardware supports it. So a concept of PAV was implemented to make z/OS think it is writing to multiple UCBs, but actually it writes to only single logical volume. HyperPAV gave the most performance by dynamically allocating multiple UCBs to a particular logical volume.

IOSQ delay occurs only when there is a shortage of UCB.
If we enable hyperpav then we could possibly mitigate the IOSQ delay.

Is it correct? or a bunch of nonsense? :S

In our shop we have hyperPAV enabled. Could you please let me know if it is possible to determine the utilization of UCBs? so we can determine if there is a shortage of UCBs from hyperpav's pool?

Thanks & Regards,

Paul Voyner · New User Joined: 26 Nov 2012 Posts: 52 Location: UK

Vasanth - 10/10 for your summing up. You've nailed it. The only quibble I have is the wording "IOSQ delay occurs only when there is a shortage of UCB" when I think it should be something like "IOSQ delay occurs when UCB is busy because an IO is already active to the device".
Of course, enabling PAV isn't something you can do overnight. Striping is easier and can be implemented in SMS routines with minimal risk. Or, easier still, you could simply move some of the most heavily used datasets to another volume, if you know which they are (and that's not easy to find out without a nice tool like Omegamon)

expat · Posted: Wed Jul 23, 2014 4:44 pm

Hi Vasanthz,

Hardware striping and Sequential Data Striping are two completely different beasties. SDS stripes the dataset across a number of volumes making a larger UCB range for accessing it. SMS keeps a map of what data is where, it isn't as straight forward as putting nn GB on one volume and then the next nn onto the next volume. It writes a chunk on the first volume, and then the next volume until it reaches the specified stripe count and then starts again from volume 1 through to volume nn.

It really is worth investigating with your DASD farmers.

As for HyperPAV - it may be installed at your site but possibly not available to the volumes that you are having grief with. That's something else you will need to find out from your shop.

It might also be beneficial to take a look at the volume(s) that are causing the problems to see what else is on the volume and how heavily used it is. In the past I've done this analysis and moved a few datasets about and improved the situation greatly.

I know how much you love sifting through SMF and RMF data

Good luck

vasanthz · Posted: Wed Jul 23, 2014 5:37 pm

Pete Wilson · Posted: Wed Jul 23, 2014 11:56 pm

You need to speak to your hardware support and/or vendor people to establish if there's a hyperpav/ucb issue, if channels are busy or degraded, or if your PPRC/XRC mirrors struggling to keep up due to link/network errors etc which can all contribute to response times. GRS set up can still have an effect if volume reserves are not all converted to global enqueues

In the meantime expats suggestion of Striping and/or moving contentious data around other volumes/pools probably wouldn't go amiss, but that would be your call based on your better knowledge of the data.