Home

Awesome

Declaratively deploy Leaf and Spine fabric

This playbook will deploy a leaf and spine fabric and its related services in a declarative manner. You only have to define a few key values such as naming convention, number of devices and addresses ranges, the playbook is smart enough to do the rest for you.

This came from my project for the IPSpace Building Network Automation Solutions course and was used in part when we were deploying Cisco 9k leaf and spine fabrics in our Data Centers. The playbook is structured in a way that it should hopefully not be too difficult to add templates to deploy leaf and spine fabrics for other vendors. My plan was to add Arista and Juniper but is unlikely to happen.

I now am done with building DCs (bring on the :cloud:) and with this being on the edge of the limit of my programing knowledge I don't envisage making any future changes. If any of it is useful to you please do take it and mold it to your own needs.

This README is intended to give enough information to understand the playbooks structure and run it. The variable files hold examples of a deployment with more information on what each variable does. For more detailed information about the playbook have a look at the series off posts I did about it on my blog.

<hr>

The playbook deployment is structured into the following 5 roles with the option to deploy part or all of the fabric.

If you wish to have a more custom build the majority of the settings in the variable files (unless specifically stated) can be changed as none of the scripting or templating logic uses the actual contents (dictionary values) to make decisions.

This deployment will scale up to a max of 4 spines, 4 borders and 10 leafs, this is how it will be deployed with the default values. net_top

The default ports used for inter-switch links are in the table below, these can be changed within fabric.yml (fbc.adv.bse_intf).

Connection                                          Start Port                                          End Port                                          
SPINE-to-LEAFEth1/1Eth1/10
SPINE-to-BORDEREth1/11Eth1/14
LEAF-to-SPINEEth1/1Eth1/4
BORDER-to-SPINEEth1/1Eth1/4
MLAG Peer-link Eth1/5Eth1/6
MLAG keepalive mgmtn/a

This playbook is based on 1U Nexus devices, therefore using the one linecard module for all the connections. I have not tested how it will work with multiple modules, the role intf_cleanup is likely not to work. This role ensures interface configuration is declarative by defaulting non-used interfaces, therefore could be excluded without breaking the playbook.

As Python is a lot more flexible than Ansible the dynamic inventory_plugin and filter_plugins (within the roles) do the manipulating of the data in the variable files to create the data models that are used by the templates. This helps to abstract a lot of the complexity out of the jinja templates making it easier to create new templates for different vendors as you only have to deal with the device configuration rather than data manipulation.

Fabric Core Variable Elements

These core elements are the minimum requirements to create the declarative fabric. They are used for the dynamic inventory creation as well by the majority of the Jinja2 templates. All variables are proceeded by ans, bse or fbc to make it easier to identify within the playbook, roles and templates which variable file the variable came from. From the contents of these var_files a dynamic inventory is built containing host_vars of the fabric interfaces and IP addresses.

ansible.yml (ans)

dir_path: Base directory location on the Ansible host that stores all the validation and configuration snippets
device_os: Operating system of each device type (spine, leaf and border)
creds_all: hostname (got from the inventory), username and password\

base.yml (bse)

The settings required to onboard and manage device such as hostname format, IP address ranges, aaa, syslog, etc.

device_name: Naming format that the automatically generated 'Node ID' (double decimal format) is added to and the group name created from (in lowercase). The name must contain a hyphen (-) and the characters after that hyphen must be either letters, digits or underscore as that is what the group name is created from. For example using DC1-N9K-SPINE would mean that the device is DC1-N9K-SPINE01 and the group is spine

KeyValueInformation
spinexx-xxSpine switch device and group naming format
borderxx-xxBorder switch device and group naming format
leafxx-xxLeaf switch device and group naming format

addr: Subnets from which the device specific IP addresses are generated based on the device-type increment and the Node ID. The majority of subnets need to be at least /27 to cover a maximum network size of 4 spines, 10 leafs and 4 borders (18 addresses)

KeyValueMin sizeInformation
lp_netx.x.x.x/26/26The range routing (OSPF/BGP), VTEP and vPC loopbacks are from (mask will be /32)
mgmt_netx.x.x.x/27/27Management network, by default will use .11 to .30
mlag_peer_netx.x.x.x/26/26 or /27Range for OSPF peering between MLAG pairs, is split into /30 per-switch pair. Must be /26 if using same range for keepalive
mlag_kalive_netx.x.x.x/27/27Optional keepalive address range (split into /30). If not set uses mlag_peer_net range
mgmt_gwx.x.x.xn/aManagement interface default gateway

mlag_kalive_net is only needed if not using the management interface for the keepalive or you want separate ranges for the peer-link and keepalive interfaces. The keepalive link is created in its own VRF so it can use duplicate IPs or be kept unique by offsetting it with the fbc.adv.addr_incre.mlag_kalive_incre fabric variable.

There are a lot of other system wide settings in base.yml such as AAA, NTP, DNS, usernames and management ACLs. Anything under bse.services are optional (DNS, logging, NTP, AAA, SNMP, SYSLOG) and will use the management interface and VRF as the source unless specifically set. More detailed information can be found in the variable file.

fabric.yml (fbc)

Variables used to determine how the fabric will be built, the network size, interfaces, routing protocols and address increments. At a bare minimum you only need to declare the size of fabric, total number of switch ports and the routing options.

network_size: How many of each device type make up the fabric. Can range from 1 spine and 2 leafs up to a maximum of 4 spines, 4 borders and 10 leafs. The border and leaf switches are MLAG pairs so must be in increments of 2.

KeyValueInformation
num_spines2Number of spine switches in increments of 1 up to a maximum of 4
num_borders2Number of border switches in increments of 2 up to a maximum of 4
num_leafs4Number of leaf switches in increments of 2 up to a maximum of 10

num_intf: The total number of interfaces per-device-type is required to make the interface assignment declarative by ensuring that non-defined interfaces are reset to their default values

KeyValueInformation
spine1,64The first and last interface for a spine switch
border1,64The first and last interface for a border switch
leaf1,64The first and last interface for a leaf switch

adv.bse_intf: Interface naming formats and the 'seed' interface numbers used to build the fabric

KeyValueInformation
intf_fmtEthernet1/Interface naming format
intf_shortEth1/Short interface name used in interface descriptions
mlag_fmtport-channelMLAG interface naming format
mlag_shortPoShort MLAG interface name used in MLAG interface descriptions
lp_fmtloopbackLoopback interface naming format
sp_to_lf1First interface used for SPINE to LEAF links (1 to 10)
sp_to_bdr11First interface used for SPINE to BORDER links (11 to 14)
lf_to_sp1First interface used LEAF to SPINE links (1 to 4)
bdr_to_sp1First interface used BORDER to SPINE links (1 to 4)
mlag_peer5-6Interfaces used for the MLAG peer Link
mlag_kalivemgmtInterface for the keepalive. If it is not an integer uses the management interface

adv.address_incre: Increments added to the 'Node ID' and subnet to generate unique device IP addresses. Uniqueness is enforced by using different increments for different device-types and functions

KeyValueInformation
spine_ip11Spine mgmt and routing loopback addresses (default .11 to .14)
border_ip16Border mgmt and routing loopback addresses (default .16 to .19)
leaf_ip21Leaf mgmt and routing loopback addresses (default .21 to .30)
border_vtep_lp36Border VTEP (PIP) loopback addresses (default .36 to .39)
leaf_vtep_lp41Leaf VTEP (PIP) loopback addresses (default .41 to .50)
border_mlag_lp56Shared MLAG anycast (VIP) loopback addresses for each pair of borders (default .56 to .57)
leaf_mlag_lp51Shared MLAG anycast (VIP) loopback addresses for each pair of leafs (default .51 to .55)
border_bgw_lp58Shared BGW MS anycast loopback addresses for each pair of borders (default .58 to .59)
mlag_leaf_ip1Start IP for leaf OSPF peering over peer-link (default LEAF01 is .1, LEAF02 is .2, LEAF03 is .5, etc)
mlag_border_ip21Start IP for border OSPF peering over peer-link (default BORDER01 is .21, BORDER03 is .25, etc)
mlag_kalive_incre28Increment added to leaf/border increment (mlag_leaf_ip/mlag_border_ip) for keepalive addresses

If the management interface is not being used for the keepalive link either specify a separate network range (bse.addr.mlag_kalive_net) or use the peer-link range and define an increment (mlag_kalive_incre) that is added to the peer-link increment (mlag_leaf_ip or mlag_border_ip) to generate unique addresses.

route: Settings related to the fabric routing protocols (OSPF and BGP). BFD is not supported on unnumbered interfaces so the routing protocol timers have been shortened (OSPF 2/8, BGP 3/9), these are set under the variable file advanced settings (adv.route)

KeyValue MandatoryInformation
ospf.prostring or integerYesCan be numbered or named
ospf.areax.x.x.xYesArea this group of interfaces are in, must be in dotted decimal format
bgp.as_numintegerYesLocal BGP Autonomous System number
authenticationstringNoApplies to both BGP and OSPF. Hash out if don't want to set authentication

acast_gw_mac: The distributed gateway anycast MAC address for all leaf and border switches in the format xxxx.xxxx.xxxx

Dynamic Inventory

The ansible, base and fabric variables are passed through the inv_from_vars.py inventory_plugin to create the dynamic inventory and host_vars of all the fabric interfaces and IP addresses. By doing this in the inventory the complexity is abstracted from the base and fabric role templates making it easier to expand the playbook to other vendors in the future.

With the exception of intf_mlag and mlag_peer_ip (not on the spines) the following host_vars are created for every host.

The devices (host-vars) and groups (group-vars) created by the inventory plugin can be checked using the graph flag. It is the inventory config file (.yml) not the inventory plugin (.py) that is referenced when using the dynamic inventory.

ansible-inventory --playbook-dir=$(pwd) -i inv_from_vars_cfg.yml --graph
@all:
  |--@border:
  |  |--DC1-N9K-BORDER01
  |  |--DC1-N9K-BORDER02
  |--@leaf:
  |  |--DC1-N9K-LEAF01
  |  |--DC1-N9K-LEAF02
  |  |--DC1-N9K-LEAF03
  |  |--DC1-N9K-LEAF04
  |--@spine:
  |  |--DC1-N9K-SPINE01
  |  |--DC1-N9K-SPINE02
  |--@ungrouped:

host shows the host-vars for that specific host whereas list shows everything, all host-vars and group-vars.

ansible-inventory --playbook-dir=$(pwd) -i inv_from_vars_cfg.yml --host DC1-N9K-LEAF01
ansible-inventory --playbook-dir=$(pwd) -i inv_from_vars_cfg.yml --list

An example of the host_vars created for a leaf switch.

{
    "ansible_host": "10.10.108.21",
    "ansible_network_os": "nxos",
    "intf_fbc": {
        "Ethernet1/1": "UPLINK > DC1-N9K-SPINE01 - Eth1/1",
        "Ethernet1/2": "UPLINK > DC1-N9K-SPINE02 - Eth1/1"
    },
    "intf_lp": [
        {
            "descr": "LP > Routing protocol RID and peerings",
            "ip": "192.168.101.21/32",
            "name": "loopback1"
        },
        {
            "descr": "LP > VTEP Tunnels (PIP) and MLAG (VIP)",
            "ip": "192.168.101.41/32",
            "mlag_lp_addr": "192.168.101.51/32",
            "name": "loopback2"
        }
    ],
    "intf_mlag_kalive": {
        "Ethernet1/7": "UPLINK > DC1-N9K-LEAF02 - Eth1/7 < MLAG Keepalive"
    },
    "intf_mlag_peer": {
        "Ethernet1/5": "UPLINK > DC1-N9K-LEAF02 - Eth1/5 < Peer-link",
        "Ethernet1/6": "UPLINK > DC1-N9K-LEAF02 - Eth1/6 < Peer-link",
        "port-channel1": "UPLINK > DC1-N9K-LEAF02 - Po1 < MLAG Peer-link"
    },
    "mlag_kalive_ip": "10.10.10.29/30",
    "mlag_peer_ip": "192.168.202.1/30",
    "num_intf": "1,64"
}

To use the inventory plugin in a playbook reference the inventory config file in place of the normal hosts inventory file (-i).

ansible-playbook PB_build_fabric.yml -i inv_from_vars_cfg.yml

Services - Tenant (svc_tnt)

Tenants, SVIs, VLANs and VXLANs are created based on the variables stored in the service_tenant.yml file (svc_tnt.tnt).

tnt: A list of tenants that contains a list of VLANs (Layer2 and/ or Layer3)

KeyValue MandatoryInformation
tenant_namestringYesName of the VRF
l3_tenantTrue or FalseYesDoes it need SVIs or is routing done off the fabric (i.e external router)
bgp_redist_tagintegerNoTag used to redistributed SVIs into BGP, by default uses tenant SVI number
vlanslistYesList of VLANs within this tenant (see the below table)

vlans: A List of VLANs within a tenant which at a minimum need the layer2 values of name and num. VLANs and SVIs can only be created on all leafs and/ or all borders, you can't selectively say which individual leaf or border switches to create them on

KeyValue               MandInformation
numintegerYesThe VLAN number
namestringYesThe VLAN name
ip_addrx.x.x.x/xNoAdding an IP address automatically making the VLAN L3 (not set by default)
ipv4_bgp_redistTrue or FalseNoDictates whether the SVI is redistributed into BGP VRF address family (default True)
create_on_leafTrue or FalseNoDictates whether this VLAN is created on the leafs (default True)
create_on_borderTrue or FalseNoDictates whether this VLAN is created on the borders (default False)
vxlanTrue or FalseNoWhether VXLAN or normal VLAN. Only need if don't want it to be a VXLAN

The redistribution route-map name can be changed in the advanced (adv) section of services-tenant.yml or services-routing.yml. If defined in both places the setting in services-routing.yml take precedence.

L2VNI and L3VNI numbers

The L2VNI and L3VNI values are automatically derived and incremented on a per-tenant basis based on the start and increment seed values defined in the advanced section (svc_tnt.adv) of services_tenant.yml.

adv.bse_vni: Starting VNI numbers

KeyValueInformation
tnt_vlan3001Starting VLAN number for the transit L3VNI
l3vni10003001Starting L3VNI number
l2vni10000Starting L2VNI number, the VLAN number will be added to this

adv.vni_incre: Number by which VNIs are incremented for each tenant

KeyValueInformation
tnt_vlan1Value by which the transit L3VNI VLAN number is increased for each tenant
l3vni1Value by which the transit L3VNI VNI number is increased for each tenant
l2vni10000Value by which the L2VNI range (range + vlan) is increased for each tenant

For example a two tenant fabric each with a VLAN 20 using the above values would have L3 tenant SVIs of 3001, 3002, L3VNIs or 10003001, 10003002 and L2VNIs of 10020 and 20020.

A new data-model is created from the services_tenant.yml variables by passing them through the format_dm.py filter_plugin method create_svc_tnt_dm along with the BGP route-map name (if exists) and ASN (from fabric.yml). The result is a per-device-type (leaf and border) list of tenants, SVIs and VLANs which are used to render the svc_tnt_tmpl.j2 template and create the config snippet.

Below is an example of the data model format for a tenant and its VLANs.

{
    "bgp_redist_tag": 99,
    "l3_tnt": true,
    "l3vni": 100003004,
    "rm_name": "RM_CONN->BGP65001_RED",
    "tnt_name": "RED",
    "tnt_redist": true,
    "tnt_vlan": 3004,
    "vlans": [
        {
            "create_on_border": true,
            "create_on_leaf": false,
            "ip_addr": "10.99.99.1/24",
            "ipv4_bgp_redist": true,
            "name": "red_inet_vl99",
            "num": 99,
            "vni": 40099
        },
        {
            "ip_addr": "l3_vni",
            "ipv4_bgp_redist": false,
            "name": "RED_L3VNI",
            "num": 3004,
            "vni": 100003004
        }
    ]
}

Services - Interface (svc_intf)

The service_interface.yml variables define single or dual-homed interfaces (including port-channel) either statically or dynamically.

There are 7 pre-defined interface types that can be deployed:

The intf.single_homed and intf.dual-homed dictionaries hold a list of all single-homed or dual-homed interfaces using any of the attributes in the table below. If there are no single-homed or dual-homed interfaces on the fabric hash out the relevant dictionary.

KeyValueMandInformation
descrstringYesInterface or port-channel description
typeintf_typeYesEither access, stp_trunk, stp_trunk_non_ba, non_stp_trunk, layer3, loopback or svi
ip_vlanvlan or ipYesDepends on the type, either ip/prefix, vlan or multiple vlans separated by , and/or -
switchlistYesList of switches created on. If dual-homed needs to be odd numbered switch from MLAG pair
tenantstringNoLayer3, svi and loopbacks only. If not defined the default VRF is used (global routing table)
po_mbr_descrlistNoPO member interface description, [odd_switch, even_switch]. If undefined uses PO descr
po_modestringNoSet the Port-channel mode, 'on', 'passive' or 'active' (default is 'active')
intf_numintegerNoOnly specify the number, the name and module are got from the fbc.adv.bse_intf.intf_fmt
po_numintegerNoOnly specify the number, the name is got from the fbc.adv.bse_intf.mlag_fmt

The playbook has the logic to recognize if statically defined interface numbers overlap with the dynamic interface range and exclude them from dynamic interface assignment. For simplicity it is probably best to use separate ranges for the dynamic and static assignments.

adv.single_homed: Reserved range of interfaces to be used for dynamic single-homed and loopback assignment

KeyValueInformation
first_intfintegerFirst single-homed interface to be dynamically assigned
last_intfintegerLast single-homed interface to be dynamically assigned
first_lpintegerFirst loopback number to be dynamically used
last_lpintegerLast loopback number to be dynamically used

adv.dual-homed: Reserved range of interfaces to be used for dynamic dual-homed and port-channel assignment

KeyValueInformation
first_intfintegerFirst dual-homed interface to be dynamically assigned
last_intfintegerLast dual-homed interface to be dynamically assigned
first_pointegerFirst port-channel number to be dynamically used
last_pointegerLast port-channel number to be dynamically used

The format_dm.py filter_plugin method create_svc_intf_dm is run for each inventory host to produce a list of all interfaces to be created on that device. In addition to the services_interface.yml variables it also passes in the interface naming format (fbc.adv.bse_intf) to create the full interface name and hostname to find the interfaces relevant to that device. This is saved to the fact flt_svc_intf which is used to render the svc_intf_tmpl.j2 template and create the config snippet.

Below is an example of the data model format for a single-homed and dual-homed interface.

{
    "descr": "UPLINK > DC1-BIP-LB01 - Eth1.1",
    "dual_homed": false,
    "intf_num": "Ethernet1/9",
    "ip_vlan": 30,
    "stp": "edge",
    "type": "access"
},
{
    "descr": "UPLINK > DC1-SWI-BLU01 - Gi0/0",
    "dual_homed": true,
    "intf_num": "Ethernet1/18",
    "ip_vlan": "10,20,30",
    "po_mode": "on",
    "po_num": 18,
    "stp": "network",
    "type": "stp_trunk"
},
{
    "descr": "UPLINK > DC1-SWI-BLU01 - Po18",
    "intf_num": "port-channel18",
    "ip_vlan": "10,20,30",
    "stp": "network",
    "type": "stp_trunk",
    "vpc_num": 18
}

Interface Cleanup - Defaulting Interfaces

The interface cleanup role is required to make sure any interfaces not assigned by the fabric or the services (svc_intf) role have a default configuration. Without this if an interface was to be changed (for example a server moved to a different interface) the old interface would not have its configuration put back to the default values.

This role goes through the interfaces assigned by the fabric (from the inventory) and service_interface role (from the svc_intf_dm method) producing a list of used physical interfaces which are then subtracted from the list of all the switches physical interfaces (fbc.num_intf). It has to be run after the fabric or service_interface role as it needs to know what interfaces have been assigned, therefore uses tags to ensure it is run anytime either of these roles are run.

Services - Route (svc_rte)

BGP peerings, non-backbone OSPF processes, static routes and redistribution (connected, static, bgp, ospf) are configured based on the variables specified in the service_route.yml file. The naming convention of the route-maps and prefix-lists used by OSPF and BGP can be changed under the advanced section (adv) of the variable file.

I am undecided about this role as it goes against the simplistic principles used by the other roles. By its very nature routing is very configurable which leads to complexity due to the number of options and inheritance. In theory all these features should work but due to the number of options and combinations available I have not tested all the possible variations of configuration.

Static routes (svc_rte.static_route)

Routes are added per-tenant with the tenant being the top-level dictionary that routes are created under.

Parent dictKeyValueMandInformation
n/atenantlistYesList of tenants to create the routes in. Use 'global' for the global routing table
n/aswitchlistYesList of switches to create all routes on (alternatively can be set per-route)
routeprefixlistYesList of routes that all have same settings (gateway, interface, switch, etc)
routegatewayx.x.x.xYesNext hop gateway address
routeinterfacestringNoNext hop interface, use interface full name (Ethernet), Vlan or Null0
routeadintegerNoSet the admin distance for this group of routes (1 - 255)
routenext_hop_vrfstringNoSet the VRF for next-hop if it is in a different VRF (route leaking between VRFs)
routeswitchlistYesSwitches to create this group of routes on (overrides static_route.switch)

OSPF (svc_rte.ospf)

An OSPF processes can be configured for any of the tenants or the global routing table.

KeyValue                      MandInformation
processinteger or stringYesThe process can be a number or word
switchlistYesList of switches to create the OSPF process on
tenantstringNoThe VRF OSPF is enabled in. If not defined uses the global routing table
ridlistNoList of RIDs, must match number of switches (if undefined uses highest loopback)
bfdTrueNoEnable BFD globally for all interfaces (disabled by default)
default_origTrue, alwaysNoConditionally (True) or always advertise default route (disabled by default)

Interface, summary and redistribution are child dictionaries of lists under the ospf parent dictionary. They inherit process.switch unless switch is specifically defined under that child dictionary.

ospf.interface: Each list element is a group of interfaces with the same set of attributes (area number, interface type, auth, etc)

KeyValue                MandInformation
namelistYesList of one or more interfaces. Use interface full name (Ethernet) or Vlan
areax.x.x.xYesArea this group of interfaces are in, must be in dotted decimal format
switchlistNoWhich switches to enable OSPF on these interfaces (inherits process.switch if not set)
costintegerNoStatically set the interfaces OSPF cost, can be 1-65535
authenticationstringNoEnable authentication for the area and a password (Cisco type 7) for this interface
area_typestringNoBy default is normal. Can be set to stub, nssa, stub/nssa no-summary, nssa default-information-originate or nssa no-redistribution
passiveTrueNoMake the interface passive. By default all configured interfaces are non-passive
hellointegerNoInterface hello interval (deadtime is x4), automatically disables BFD for this interface
typepoint-to-pointNoBy default all interfaces are broadcast, can be changed to point-to-point

ospf.summary: All summaries with the same attributes (switch, filter, area) can be grouped in a list within the one prefix dictionary value

KeyValueMandatoryInformation
prefixlistYesList of summaries to apply on all the specified switches
switchlistNoWhat switches to summarize on, inherits process.switch if not set
areax.x.x.xNoBy default it is LSA5. For LSA3 add an area to summarize from that area
filternot-advertiseNoStops advertisement of the summary and subordinate subnets (is basically filtering)

ospf.redist:: Each list element is the redistribution type (ospf_xx, bgp_xx, static or connected). Redistributed prefixes can be filtered (allow) or weighted (metric) with the route-map order being metric and then allow. If the allow list is not set it will allow any (empty route-map)

KeyValue                      MandInformation
typestringYesRedistribute either OSPF process, BGP AS, static or connected
switchlistNoWhat switches to redistribute on, inherits process.switch if not set
metricdictNoAdd metric to redistributed prefixes. Keys are metric value and values a list of prefixes or keyword ('any' or 'default'). Can't use metric with a type of connected
allowlist, any, defaultNoList of prefixes (connected is list of interfaces) or keyword ('any' or 'default') to redistribute

BGP

Uses the concept of groups and peers with the majority of the settings configured in either

Set inKeyValueMandInformation
groupnamestringYesName of the group, no whitespaces or duplicate names (group or peer)
peernamestringYesName of the peer, no whitespaces or duplicate names (group or peer)
peerpeer_ipx.x.x.xYesIP address of the peer
peerdescrstringYesDescription of the peer
bothswitchlistYesList of switches (even if is only 1) to create the group and peers on
bothtenantlistNoList of tenants (even if is only 1) to create the peers under
bothremote_asintegerYesRemote AS of this peer or if group all peers within that group
bothtimers[kl,ht]NoList of [keepalive, holdtime], if not defined uses [3, 9] seconds
bothbfdTrueNoEnable BFD for an individual peer or all peers in group (disabled by default)
bothpasswordstringNoPlain-text password to authenticate a peer or all peers in group (default none)
bothdefaultTrueNoAdvertise default route to a peer or all peers in the group (default False)
bothupdate_sourcestringNoSet the source interface used for peerings (default not set)
bothebgp_multihopintegerNoIncrease the number of hops for eBGP peerings (2 to 255)
bothnext_hop_selfTrueNoSet the next-hop to itself for any advertised prefixes (default not set)

inbound or outbound: Optionally set under the group or peer to filter BGP advertisements and/ or BGP attribute manipulation

KeyValueDirectionInformation
weightdictinboundKeys are the weight and the value a list of prefixes or keyword ('any' or 'default')
prefdictinboundKeys are the local preference and the value a list of prefixes or keyword
meddictoutboundKeys are the MED value and the values a list of prefixes or keyword
as_prependdictoutboundKeys are the number of times to add the ASN and values a list of prefixes or keyword
allowlist, any, defaultbothCan be a list of prefixes or a keyword to advertise just the default route or anything
denylist, any, defaultbothCan be a list of prefixes or a keyword to not advertise the default route or anything

bgp.tnt_advertise: Optionally advertise prefixes on a per-tenant basis (list of VRFs) using network, summary and redistribution. The switch can be set globally for all network/summary/redist in a VRF and be overridden on an individual per-prefix basis

Set inKeyValueMandInformation
tnt_advertisenamestringYesA single VRF that is being advertising into (use 'global' for global routing table)
allswitchlistYesWhat switches to redistribute on, inherits process.switch if not set
network/summaryprefixlistYesList of prefixes to advertise
summaryfiltersummary-onlyNoOnly advertise the summary, suppress all prefixes within it (disabled by default)
redisttypestringYesRedistribute ospf_process (whitespace before process), static or connected
redistmetricdictNoAdd metric to redistributed prefixes. Keys are the MED value and values a list of prefixes or keyword ('any' or 'default'). Cant use metric with connected
redistallowlist, any, defaultNoList of prefixes (can use 'ge' and/or 'le'), interfaces (for connected) or keyword ('any' or 'default') to redistribute

Advanced settings (svc_rte.adv) allow the changing of the default routing protocol timers and naming format of the route-maps and prefix-lists used for advertisement and redistribution.

The filter_plugin method create_svc_rte_dm is run for each inventory host to produce a data model of the routing configuration for that device. The outcome is a list of seven per-device data models that are used by the svc_rte_tmpl.j2 template.

Passwords

There are four main types of passwords used within the playbooks.

Input validation

Pre-task input validation checks are run on the variable files with the goal being to highlight any problems with variable before any of the fabric build tasks are started. Fail fast based on logic rather failing halfway through a build. Pre-validation checks for things such as missing mandatory variables, variables are of the correct type (str, int, list, dict), IP addresses are valid, duplicate entires, dependencies (VLANs assigned but not created), etc. It wont catch everything but will eliminate a lot of the needless errors that would break a fabric build.

A combination of Python assert within a filter plugin (to identify any issues) and Ansible assert within the playbook (to return user-friendly information) is used to achieve the validation. All the error messages returned by input validation start with the nested location of the variable to make it easier to find.

It is run using the `pre_val' tag and will conditionally only check variable files that have been defined under var_files. It can be run using the inventory plugin but will fail if any of the values used to create inventory are wrong so better use a dummy host file.

ansible-playbook playbook.yml -i hosts --tag pre_val
ansible-playbook playbook.yml -i inv_from_vars_cfg.yml --tag pre_val

A full list of what variables are checked and the expected input can be found in the header notes of the filter plugin input_validate.py.

Playbook Structure

The main playbook (PB_build_fabric.yml) is divided into 3 sections with roles used to do the data manipulation and templating

The post-validation playbook (PB_post_validate.yml) uses the validation role to do the majority of the work

Directory Structure

The directory structure is created within ~/device_configs to hold the configuration snippets, output (diff) from applied changes, validation desired_state files and compliance reports. The parent directory is deleted and re-added at each playbook run.
The base location for this directory can be changed using the ans.dir_path variable.

~/device_configs/
├── DC1-N9K-BORDER01
│   ├── config
│   │   ├── base.conf
│   │   ├── config.cfg
│   │   ├── dflt_intf.conf
│   │   ├── fabric.conf
│   │   ├── svc_intf.conf
│   │   ├── svc_rte.conf
│   │   └── svc_tnt.conf
│   └── validate
│       ├── napalm_desired_state.yml
│       └── nxos_desired_state.yml
├── diff
│   ├── DC1-N9K-BORDER01.txt
└── reports
    ├── DC1-N9K-BORDER01_compliance_report.json

Prerequisites

The deployment has been tested on NXOS 9.2(4) and NXOS 9.3(5) (in theory should be fine with 9.3(6) & 9.3(7)) using Ansible 2.10.6 and Python 3.6.9. See the Caveats section for the few nuances when running the different versions of code.

git clone https://github.com/sjhloco/build_fabric.git
mkdir ~/venv/venv_ansible2.10
python3 -m venv ~/venv/venv_ansible2.10
source ~/venv/venv_ansible2.10/bin/activate
pip install -r build_fabric/requirements.txt

Once the environment has been setup with all the packages installed run napalm-ansible to get the location of the napalm-ansible paths and add them to ansible.cfg under [defaults].

Before any configuration can be deployed using Ansible a few things need to be manually configured on all N9K devices:

interface mgmt0
  ip address 10.10.108.11/24
vrf context management
  ip route 0.0.0.0/0 10.10.108.1
feature nxapi
feature scp-server
boot nxos bootflash:/nxos.9.3.5.bin sup-1
hardware access-list tcam region racl 512
hardware access-list tcam region arp-ether 256 double-wide
copy run start
reload

The default username/password for all devices is admin/ansible and is stored in the variable bse.users.password. Swap this out for the encrypted type5 password got from the running config. The username and password used by Napalm to connect to devices is stored in ans.creds_all and will also need changing to match (is plain-text or use vault).

Before the playbook can be run the devices SSH keys need adding on the Ansible host. ssh_key_playbook.yml (in ssh_keys directory) can be run to add these automatically, you just need to populate the device's management IPs in the ssh_hosts file.

sudo apt install ssh-keyscan
ansible-playbook ssh_keys/ssh_key_add.yml -i ssh_keys/ssh_hosts

Running playbook

The device configuration is applied using Napalm with the differences always saved to ~/device_configs/diff/device_name.txt and optionally printed to screen. Napalm commit_changes is set to True meaning that Ansible check-mode is used for dry-runs. It can take upto 6 minutes to deploy the full configuration when including the service roles so the Napalm default timeout has been increased to 360 seconds. If it takes longer (N9Kv running 9.2(4) is very slow) Ansible will report the build as failed but it is likely the process is still running on the device so give it a minute and run the playbook again, it should pass and with no changes needed.

Due to the declarative nature of the playbook and inheritance between roles there are only a certain number of combinations that the roles can be deployed in.

Ansible tagPlaybook action
pre_valChecks that the var_file contents are of a valid format
bse_fbcGenerates, joins and applies the base, fabric and inft_cleanup config snippets
bse_fbc_tntGenerates, joins and applies the base, fabric, inft_cleanup and tenant config snippets
bse_fbc_intfGenerates, joins and applies the base, fabric, tenant, interface and inft_cleanup config snippets
fullGenerates, joins and applies the base, fabric, tenant, interface, inft_cleanup and route config snippets
rbReverses the last applied change by deploying the rollback configuration (rollback_config.txt)
diffPrints the differences between the current_config (on the device) and desired_config (applied by Napalm) to screen

pre-validation: Validates the contents of variable files defined under var_files. Best to use dummy host file instead of dynamic inventory
ansible-playbook PB_build_fabric.yml -i hosts --tag post_val

Generate the complete config: Creates config snippets, assembles them in config.cfg, compares against device config and prints the diff
ansible-playbook PB_build_fabric.yml -i inv_from_vars_cfg.yml --tag 'full, diff' -C

Apply the config: Replaces current config on the device with changes made automatically saved to ~/device_configs/diff/device_name.txt
ansible-playbook PB_build_fabric.yml -i inv_from_vars_cfg.yml --tag full

All roles can be deployed individually to just to create the config snippet files, no connections are made to devices or changes applied. The merge tag can be used in conjunction with any combination of these role tags to non-declaratively merge the config snippets with the current device config rather than replacing it. As the L3VNIs and interfaces are generated automatically at a bare minimum the variable files will still need current tenants and interfaces as well as the advanced variable sections.

Ansible tagPlaybook action
bseGenerates the base configuration snippet saved to device_name/config/base.conf
fbcGenerates the fabric and intf_cleanup configuration snippets saved to fabric.conf and dflt_intf.conf
tntGenerates the tenant configuration snippet saved to device_name/config/svc_tnt.conf
intfGenerates the interface configuration snippet saved to device_name/config/svc_intf.conf
rteGenerates the route configuration snippet saved to device_name/config/svc_rte.conf
mergeNon-declaratively merges the new and current config, can be run with any combination of role tags

Generate the fabric config: Creates the fabric and interface cleanup config snippets and saves them to fabric.conf and dflt_intf.conf
ansible-playbook PB_build_fabric.yml -i inv_from_vars_cfg.yml --tag fbc

Apply tenants and interfaces non-declaratively: Add additional tenant and routing objects by merging their config snippets with the devices config. The diffs for merges are simply the lines in the merge candidate config so wont be as true as the diffs from declarative deployments
ansible-playbook PB_build_fabric.yml -i inv_from_vars_cfg.yml --tag tnt,rte,merge,diff

Post Validation checks

A declaration of how the fabric should be built (desired_state) is created from the values of the variables files and validated against the actual_state. napalm_validate can only perform a compliance check against anything it has a getter for, for anything not covered by this the custom_validate filter plugin is used. This plugin uses the same napalm_validate framework but the actual state is supplied through a static input file (got using napalm_cli) rather than a getter. Both validation engines are within the same validate role with separate template and task files.

The results of the napalm_validate (nap_val.yml) and custom_validate (cus_val.yml) tasks are joined together to create the one combined compliance report. Each getter or command has a complies dictionary (True or False) to report its state which feeds into the compliance reports overall complies dictionary. It is based on this value that a task in the post-validation playbook will raise an exception.

napalm_validate

As Napalm is vendor agnostic the jinja template file used to create the validation file is the same for all vendors. The following elements are validated by napalm_validate with the roles being validated in brackets.

An example of the desired and actual state file formats.

- get_bgp_neighbors:
    global:
      router_id: 192.168.101.16
      peers:
        _mode: strict
        192.168.101.11:
          is_enabled: true
          is_up: true

custom_validate

custom_validate requires a per-OS type template file and per-OS type method within the custom_validate.py filter_plugin. The command output is collected in JSON format using naplam_cli, passed through the nxos_dm method to create a new actual_state data model and along with the desired_state is fed into napalm_validate using the compliance_report method.

The following elements are validated by napalm_validate with the roles being validated in brackets.

An example of the desired and actual state file formats

cmds:
  - show ip ospf neighbors detail:
      192.168.101.11:
        state: FULL
      192.168.101.12:
        state: FULL
      192.168.101.22:
        state: FULL

To aid with creating new validations the custom_val_builder directory is a stripped down version of custom_validate to use when building new validations. The README has more detail on how to run it, the idea being to walk through each stage of creating the desired and actual state ready to add to the validate roles.

Running Post-validation

Post-validation is hierarchial as the addition of elements in the later roles effects the validation outputs in the earlier roles. For example, extra VLANs added in tenant_service will effect the bse_fbc post-validate output of show vpc (peer-link_vlans). For this reason post-validation must be run for the current role and all applied roles before it. This is done automatically by Jinja template inheritance as calling a template with the extends statement will also render the inheriting templates.

Ansible tag.Playbook action
bse_fbcValidates the configuration applied by the base and fabric roles
bse_fbc_tntValidates the configuration applied by the base, fabric and tenant roles
bse_fbc_tnt_intfValidates the configuration applied by the base, fabric, tenant and interfaces roles
fullValidates the configuration applied by the base, fabric, tenant, interfaces and route roles

Run fabric validation: Runs validation against the desired state got from all the variable files. There is no differentiation between naplam_validate and custom_validate, both are run as part of the validation tasks
ansible-playbook PB_post_validate.yml -i inv_from_vars_cfg.yml --tag full

Viewing compliance report: When viewing the validation report piping it through json.tool makes it more human readable
cat ~/device_configs/reports/DC1-N9K-SPINE01_compliance_report.json | python -m json.tool

Caveats

When starting this project I used N9Kv on EVE-NG and later moved onto physical devices when we were deploying the data centers. vPC fabric peering does not work on the virtual devices so this was never added as an option in the playbook.

As deployments are declarative and there are differences with physical devices you will need a few minor tweaks to the bse_tmpl.j2 template as different hardware can have slightly different hidden base commands. An example is the command system nve infra-vlans, it is required on physical devices (command doesnt exist on N9Kv) in order to use an SVI as an underlay interface (one that forwards/originates VXLAN-encapsulated traffic). Therefore on physical devices unhash this line in bse_tmpl.j2, it is used for the OSPF peering over the vPC link (VLAN2).

{# system nve infra-vlans {{ fbc.adv.mlag.peer_vlan }} #}

The same applies for NXOS versions, it is only the base commands that will change (features commands stay the same across versions) so if statements are used in bse_tmpl.j2 based on the bse.adv.image variable.

Although they work on EVE-NG it is not perfect for running N9Kv. I originally started on nxos.9.2.4 and although it is fairly stable in terms of features and uptime, the API can be very slow at times taking upto 10 minutes to deploy a device config. Sometimes after a deployment the API would stop responding (couldn't telnet on 443) but NXOS CLI said it was listening. To fix this you have to disable and re-enable the nxapi feature. Removing the command nxapi use-vrf management seems to have helped to make the API more stable.

I moved onto to NXOS nxos.9.3.5 and although the API is faster and more stability, there is a different issue around the interface module. When the N9Kv went to 9.3 the interfaces where moved to a separate module.

Mod Ports             Module-Type                      Model           Status
--- ----- ------------------------------------- --------------------- ---------
1    64   Nexus 9000v 64 port Ethernet Module   N9K-X9364v            ok
27   0    Virtual Supervisor Module             N9K-vSUP              active *

With 9.3(5), 9.3(6) and 9.3(7) on EVE-NG up to 5 or 6 N9Ks it is fine, however when you add anymore N9Ks (other device types are fine) things start to become unstable. New devices take an age to boot up and when they do their interface linecards normally fail and go into the pwr-cycld state.

Mod Ports             Module-Type                      Model           Status
--- ----- ------------------------------------- --------------------- ---------
1    64   Nexus 9000v 64 port Ethernet Module                         pwr-cycld
27   0    Virtual Supervisor Module             N9K-vSUP              active *

Mod  Power-Status  Reason
---  ------------  ---------------------------
1    pwr-cycld      Unknown. Issue show system reset mod ...

This in turn makes other N9Ks unstable, some freezing and others randomly having the same linecard issue. Rebooting sometimes fixes it but due to the load times it is unworkable. I have not been able to find a reason for this, it doesn't seem to be related to resources for either the virtual device or the EVE-NG box.

On in N9Kv 9.2(4) there is a bug whereas you cant have '>' in the name of the prefix-list in the route-map match statement. This name is set in the service_route.yml variables svc_rte.adv.pl_name and svc_rte.adv.pl_metric_name. The problem has been fixed in 9.3.

DC1-N9K-BGW01(config-route-map)# match ip address prefix-list PL_OSPF_BLU100->BGP_BLU
Error: CLI DN creation failed substituting values. Path sys/rpm/rtmap-[RM_OSPF_BLU100-BGP_BLU]/ent-10/mrtdst/rsrtDstAtt-[sys/rpm/pfxlistv4-[PL_OSPF_BLU100->BGP_BLU]]

If you are running these playbooks on MAC you may get the following error when running post-validations:

objc[29159]: +[__NSPlaceholderDictionary initialize] may have been in progress in another thread when fork() was called.
objc[29159]: +[__NSPlaceholderDictionary initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.

Is the same behaviour as this older ansible bug, the solution of adding export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES before running the post-validation playbook solved it for me.