ASR1000 Box to Box HA with Zone-based Firewall: Config

The ASR 1000 series routers have the ability to be deployed in an highly-available design, a solution called Box to Box HA or Inter-chassis Redundancy.  One of the benefits to the ASR is you can stack  Zone-based Firewall (ZBFW) on top of it.  To deploy a highly-available ASR with ZBFW, in a WAN to LAN topology requires asymmetric routing support due to the nature of a stateful firewall.  At the time of writing this there is a lack of information regarding design and config options for this scenario. I plan to address those issues here.

Problem
The documentation regarding the design and configuration required to deploy Box to Box HA with ZBFW on the ASR platform is unclear.  In addition there is a bug which requires a configuration workaround.

Environment 
-Two Cisco ASR1001-4x1GE (Total of 8 x 1GE SFP ports per ASR)
-IOS XE 3.6.1.S ....  ( IOS 15.2(2)S1 ) 
-Advanced IP Services License, Firewall License, Firewall/NAT Stateful Inter-Chassis Redundancy License
-Two /30 hand-offs from ISP
-LAN consists of redundant Cisco Nexus switches using vPC technology

References
-Security Configuration Guide: Zone-Based Policy Firewall, Cisco IOS XE Release 3S (ASR 1000)

-Bug CSCub03015

Design Options
The first gotcha is (at this point in time) the LAN has to use a switch stack to have a full mesh design.  You can not do full mesh and use Cisco's vPC technology seen on the Nexus line or plain old redundant switches for your LAN.  The good news is with the help of Cisco (TAC, SE's, TME's, and DE's) we came up with this solution.  In the process we uncovered a bug (which they have a workaround for). 

Next, in addition to purchasing the Firewall Feature License, you will need to purchase the Firewall/NAT Stateful Inter-Chassis Redundancy License.  Also, note the firewall won't work if you have IP Base, Advanced IP Services or higher is required.

ASR_B2BHA-ZBFW.jpg

Config
To get a baseline of the configuration concepts, I recommend reading the "Interchassis Asymmetric Routing Support for Zone-Based Firewall and NAT" section of the config guide referenced above.  To achieve similar redundancy to a full mesh design but with non-stackable switches requires us to add another link between the ASR's (we will call it the Routed link).  Above shows the key configuration and wiring required for this to work.  Attached here is the detailed config for both ASR's.  Note: there are a handful of places you can add link redundancy with Port-channels.  However this will be limited on the number of ports based on the ASR model you have.  I chose to show the option with the least ports used.  The key point above is to keep the same breakdown and not to stack any traffic types (exception is Control and Data).  In the end the only difference between this design and using stackable switches on the LAN is the full mesh design can survive two kitty-corner device failures.  For example, you could lose ASR1K-A and SWITCH-B, or vice versa and still be up.  This is because with a switch stack for the LAN we would just run a cross-stack etherchannel from each ASR and life would be grand (in theory/per Cisco, I haven't tested).  

Results and Conclusions
With this configuration we are able to achieve a similar level of redundancy to a full mesh design with out the use of stackable switches.  

Rolling Stack Upgrade (RSU): Process +Caveats

RSU allows for a near-zero downtime firmware upgrade of all members in the stack. This is achieved with a scripted process that doesn't require running the upgrade from each stack member. At the time of writing this, there are a few caveats with the RSU process that require some special steps to make it work. I plan to address those issues here.


Problem
The current documented process for RSU, results in either an extended outage or incomplete upgrade.


Environment

-Two 3750x switches
-15.0(2)SE …This version has proven most stable for me on the 15.x train and contains some relevant bug fixes (referenced below)
-IP Services License
-Two /30 hand-offs from ISP (link and box redundancy)
-Multiple LACP Cross-stack EtherChannels to LAN
-LAN consists of redundant Nexus switches with Dual-attached hosts 

References
-Catalyst 3750-X and 3560-X Switch Software Configuration Guide 
-Release Notes 
-Bugs (CSCts07947 and CSCtx05704)


Useful commands
-Verify health of the switch stack
show switch detail
show switch stack-ring speed


-Monitor status of the upgrade
show switch stack-upgrade status
show switch stack-upgrade sequence



RSU Process +Caveats

In an effort to not reinvent the wheel I would definitely start with the “Rolling Stack Upgrade” section of the config guide (referenced above). However, the three key differences that caused me some digging and troubleshooting:
-Manually remove the current image
-Being on version 15.0(2)SE or higher
-Using the /reload command, which is a little unclear in the config guide

Note: Extracting (archive command) the images took about 15 minutes then the staggered reload process took another 15 minutes.

Enable persistent mac...if not already enabled :)
stack-mac persistent timer 0

Define redundant uplinks to "network", in my case the internet
interface interface-id                      <-connection on the Master switch
rsu active
interface interface-id                      <-connection on the Member switch
rsu passive

Remove current image from all stack members 
delete /force /recursive flash1:image-tar-folder
delete /force /recursive flash2:image-tar-folder

Execute RSU 
archive download-sw /reload /rolling-stack-upgrade tftp://ipaddress/image.tar 

Force master switch (optional)
After the RSU process is completed, the master switch will have changed due to the staggered upgraded process.  If you wish to force a specific switch to become master, you will need to reload only the current master switch.  This is done with the command below “reload slot”.  DO NOT, execute the “reload” command, as it will reload the whole stack and cause a 7 minute outage.  
reload slot switch-number

Results and Conclusions
With this RSU process, downtime to the environment was reduced from 7 minutes to sub-second.  To determine the availability to the environment (in my case a SaaS solution) I monitored these connections:
Inbound SSH, https, RDP, ping
Outbound ping 
With the traditional firmware upgrade process, both stack members had to be reloaded at the same time to avoid a version mismatch and thus the need for a maintenance window.  With RSU I recorded 0-1 packets lost and any loss was not noticeable to the end-user.

Thanks to TAC's help in uncovering the correct process.  Also, please share your experiences, comments, etc. below.