Rolling Stack Upgrade (RSU): Process +Caveats

RSU allows for a near-zero downtime firmware upgrade of all members in the stack. This is achieved with a scripted process that doesn't require running the upgrade from each stack member. At the time of writing this, there are a few caveats with the RSU process that require some special steps to make it work. I plan to address those issues here.


Problem
The current documented process for RSU, results in either an extended outage or incomplete upgrade.


Environment

-Two 3750x switches
-15.0(2)SE …This version has proven most stable for me on the 15.x train and contains some relevant bug fixes (referenced below)
-IP Services License
-Two /30 hand-offs from ISP (link and box redundancy)
-Multiple LACP Cross-stack EtherChannels to LAN
-LAN consists of redundant Nexus switches with Dual-attached hosts 

References
-Catalyst 3750-X and 3560-X Switch Software Configuration Guide 
-Release Notes 
-Bugs (CSCts07947 and CSCtx05704)


Useful commands
-Verify health of the switch stack
show switch detail
show switch stack-ring speed


-Monitor status of the upgrade
show switch stack-upgrade status
show switch stack-upgrade sequence



RSU Process +Caveats

In an effort to not reinvent the wheel I would definitely start with the “Rolling Stack Upgrade” section of the config guide (referenced above). However, the three key differences that caused me some digging and troubleshooting:
-Manually remove the current image
-Being on version 15.0(2)SE or higher
-Using the /reload command, which is a little unclear in the config guide

Note: Extracting (archive command) the images took about 15 minutes then the staggered reload process took another 15 minutes.

Enable persistent mac...if not already enabled :)
stack-mac persistent timer 0

Define redundant uplinks to "network", in my case the internet
interface interface-id                      <-connection on the Master switch
rsu active
interface interface-id                      <-connection on the Member switch
rsu passive

Remove current image from all stack members 
delete /force /recursive flash1:image-tar-folder
delete /force /recursive flash2:image-tar-folder

Execute RSU 
archive download-sw /reload /rolling-stack-upgrade tftp://ipaddress/image.tar 

Force master switch (optional)
After the RSU process is completed, the master switch will have changed due to the staggered upgraded process.  If you wish to force a specific switch to become master, you will need to reload only the current master switch.  This is done with the command below “reload slot”.  DO NOT, execute the “reload” command, as it will reload the whole stack and cause a 7 minute outage.  
reload slot switch-number

Results and Conclusions
With this RSU process, downtime to the environment was reduced from 7 minutes to sub-second.  To determine the availability to the environment (in my case a SaaS solution) I monitored these connections:
Inbound SSH, https, RDP, ping
Outbound ping 
With the traditional firmware upgrade process, both stack members had to be reloaded at the same time to avoid a version mismatch and thus the need for a maintenance window.  With RSU I recorded 0-1 packets lost and any loss was not noticeable to the end-user.

Thanks to TAC's help in uncovering the correct process.  Also, please share your experiences, comments, etc. below.