Friday, 26 June 2020

Doing YANG Wrong: Part 3 - Using the python bindings

Part 3: Using the python bindings to push a config

Given we generated that python file locally in a machine, we assume here that you are still in that subdirectory.

The below code was stolen fully from the YANG Book. Much love and all credit to them for their work.

from interface_setup import openconfig_interfaces
from pyangbind.lib.serialise import pybindIETFXMLEncoder
from ncclient import manager
# device settings
username = 'yangconf'
password = 'my_good_password.'
device_ip = '192.168.70.21'
# config settings
inside_interface = 'GigabitEthernet4'
inside_ip_addr = '109.109.109.2'
inside_ip_prefix = 24

def send_to_device(**kwargs):
rpc_body = '<config>' + pybindIETFXMLEncoder.serialise(kwargs['py_obj']) + '</config>'
with manager.connect_ssh(host-kwargs['dev'], port=830, username=kwargs['user'], password=kwargs['password'], hostkey_verify=False) as m:
try:
m.edit_config(target='running', config=rpc_body)
print('Successfully configured IP on {}'.format(kwargs['dev']))
except Exception as e:
print('Failed to configure interface: {}'.format(e))


if __name__ == '__main__':
# instanciate the openconfig model
ocintmodel = openconfig_interfaces()
# create an instance of the interfaces
ocinterfaces = ocintmodel.interfaces
# create a new interface instance in that parent object
inside_if = ocinterfaces.interface.add(inside_interface)
# even a routed interface required a subinterface, its just at index 0
inside_if.subinterfaces.subinterface.add(0)
# create an instance of that subinterface object to edit
inside_sub_if = inside_if.subinterfaces.subinterface[0]
# apply an IP to that object
inside_sub_if.ipv4.addresses.address.add(inside_ip_addr)
# read that ip object into an ip object
ip = inside_sub_if.ipv4.addresses.address[inside_ip_addr]
# set the IP and the subnet mask properly
ip.config.ip = inside_ip_addr
ip.config.prefix_length = inside_ip_prefix
send_to_device(dev=device_ip, user=username, password=password, py_obj=ocinterfaces)


When I run this, it fails.

<pyangbind.lib.yangtypes.YANGBaseClass object at 0x7f3e5fac6170>
<pyangbind.lib.yangtypes.YANGBaseClass object at 0x7f3e5f9a69e0>
Traceback (most recent call last):
  File "<stdin>", line 19, in <module>
  File "<stdin>", line 2, in send_to_device
NameError: global name 'pybindIETFXMLEncoder' is not defined


Turns out the serialiser that the book code uses, relies on a library function that isn't in the pip version 0.8.1 of the pyangbind code. Bit of googling says build from the github repo here. Visiting that repo and alarm bells are ringing - the last commits are 2 years ago, and somehow the pip version is still out of date? Why? Anyways.
pip install --upgrade git+https://github.com/robshakir/pyangbind.git
...

python ./push_inside_if.py 
Traceback (most recent call last): 
  File "./push_inside_if.py", line 43, in <module>
    send_to_device(dev=device_ip, user=username, password=password, py_obj=ocinterfaces)
  File "./push_inside_if.py", line 16, in send_to_device 
    rpc_body = '<config>' + pybindIETFXMLEncoder.serialise(kwargs['py_obj']) + '</config>' 
  File "/home/gns3/.local/lib/python2.7/site-packages/pyangbind/lib/serialise.py", line 380, in serialise
    doc = cls.encode(obj, filter=filter)
  File "/home/gns3/.local/lib/python2.7/site-packages/pyangbind/lib/serialise.py", line 375, in encode 
    return cls.generate_xml_tree(obj._yang_name, obj._yang_namespace, preprocessed) 
AttributeError: 'YANGBaseClass' object has no attribute '_yang_namespace'
What fresh hell is this?

So a github issue now tells us that we generated the binding against the old version of pyangbind, so we have to redo our export for the ENV var and then rebuild the python module....

pyang --plugindir $PYBINDPLUGIN -f pybind -o interface_setup.py *.yang                          
[email protected]05.yang:346: warning: node "openconfig-interfaces::state" is config false and is not part of the accessible tree                          
[email protected]:84: warning: the escape sequence "\." is unsafe in double quoted strings - pass the flag --lax-quote-checks to avoid this warning  
[email protected]:100: warning: the escape sequence "\." is unsafe in double quoted strings - pass the flag --lax-quote-checks to avoid this warning 
[email protected]:102: warning: the escape sequence "\*" is unsafe in double quoted strings - pass the flag --lax-quote-checks to avoid this warning 
[email protected]:121: warning: the escape sequence "\." is unsafe in double quoted strings - pass the flag --lax-quote-checks to avoid this warning 
[email protected]:123: warning: the escape sequence "\." is unsafe in double quoted strings - pass the flag --lax-quote-checks to avoid this warning 
[email protected]:125: warning: the escape sequence "\*" is unsafe in double quoted strings - pass the flag --lax-quote-checks to avoid this warning 
[email protected]:130: warning: the escape sequence "\*" is unsafe in double quoted strings - pass the flag --lax-quote-checks to avoid this warning 
[email protected]:131: warning: the escape sequence "\." is unsafe in double quoted strings - pass the flag --lax-quote-checks to avoid this warning 
[email protected]:133: warning: the escape sequence "\." is unsafe in double quoted strings - pass the flag --lax-quote-checks to avoid this warning 
INFO: encountered (<pyang.error.Position object at 0x7fdbf28fb820>, 'XPATH_REF_CONFIG_FALSE', (u'openconfig-interfaces', u'state'))                                      
FATAL: pyangbind cannot build module that pyang has found errors with.

Oh my word - so much rage.

I tried the --lax-quote-checks and that didn't work, so I edited each of those lines in the openconfig-vlan yang file to swap double quotes in the regexes to single quotes. These warnings went away.

pyang --plugindir $PYBINDPLUGIN -f pybind -o interface_setup.py *.yang
[email protected]:346: warning: node "openconfig-interfaces::state" is config false and is not part of the accessible tree
INFO: encountered (<pyang.error.Position object at 0x7fe9b551f2d0>, 'XPATH_REF_CONFIG_FALSE', (u'openconfig-interfaces', u'state'))
FATAL: pyangbind cannot build module that pyang has found errors with.
This one had be stumped. Google had nothing.  I was going around in circles until I broke the cycle by working on my laptop instead of my workstation. During the first time setup of the tools I found myself looking at all the repos again in github, and so I thought I would take a look at the blame on the affected file here. The error stood out like a saw thumb.

In my downloaded model, it referred to oc-if:state and in the repo model it referred to oc-if:config. The error now stands to reason since the state model is more for telemetry - its a read only view of the interface state, not the config. I edited the field and we now have a compiled module again.

Back to running the script...
python push_inside_if.py
Failed to configure interface: expected tag: name, got tag: subinterfaces
WAT? Lets dump out what we generated prior to send...

we add print(pybindIETFXMLEncoder.serialise(ocinterfaces)) just above the send_to_device call, and then run again.
python push_inside_if.py

<interfaces xmlns="http://openconfig.net/yang/interfaces">
  <interface>
    <subinterfaces>
      <subinterface>
        <ipv4 xmlns="http://openconfig.net/yang/interfaces/ip">
          <addresses>
            <address>
              <config>
                <prefix-length>24</prefix-length>
              </config>
              <ip>109.109.109.2</ip>
            </address>
          </addresses>
        </ipv4>
        <index>0</index>
        <config>
          <description>Inside IP Address</description>
        </config>
      </subinterface>
    </subinterfaces>
    <config>
      <enabled>true</enabled>
      <description>Inside Interface</description>
    </config>
    <name>GigabitEthernet4</name>
  </interface>
</interfaces>
Well it looks correct, but maybe it doesn't like the fact the name tag is the bottom? Seems like a dumb complaint to have - its a machine readable structure and the positioning in that structure is technically accurate (it's in the correct layer of the XML?)

Only way to prove this is to make a manual copy of this as a string var and then push it directly instead of rendering it with this tool.

First I comment the existing rendering of the rpc_body in the function to just use the kwargs['py_obj'] verbatim (I provide valid XML in my string) and then I make a multiline string in the main function with a human ordered XML envelope.

python push_inside_if.py 

original
<interfaces xmlns="http://openconfig.net/yang/interfaces">
  <interface>
    <subinterfaces>
      <subinterface>
        <ipv4 xmlns="http://openconfig.net/yang/interfaces/ip">
          <addresses>
            <address>
              <config>
                <prefix-length>24</prefix-length>
              </config>
              <ip>
109.109.109.2</ip>
            </address>
          </addresses>
        </ipv4>
        <index>0</index>
        <config>
          <description>Inside IP Address</description>
        </config>
      </subinterface>
    </subinterfaces>
    <config>
      <enabled>true</enabled>
      <description>Inside Interface</description>
    </config>
    <name>GigabitEthernet4</name>
  </interface>
</interfaces>


ordered
<interfaces xmlns="http://openconfig.net/yang/interfaces">
  <interface>
    <name>GigabitEthernet4</name>
    <config>
      <enabled>true</enabled>
      <description>Inside Interface</description>
    </config>
    <subinterfaces>
      <subinterface>
        <index>0</index>
        <config>
          <description>Inside IP Address</description>
        </config>
        <ipv4 xmlns="http://openconfig.net/yang/interfaces/ip">
          <addresses>
            <address>
              <ip>
109.109.109.2</ip>
              <config>
                <prefix-length>24</prefix-length>
              </config>
            </address>
          </addresses>
        </ipv4>
      </subinterface>
    </subinterfaces>
  </interface>
</interfaces>
  
Successfully configured IP on 192.168.70.21


Ugh. That's so lame. Clearly the problem here is the XML Serialiser is not rendering the objects in an order that the netconf agent on the CSR likes. Kill me now.

But wait. It gets better.

Having returned to my workstation, I decide that sending commands straight from VSCode to the GNS3 simulated CSR via the GNS3 simulated Ubuntu box with a simple NAT on the Ubuntu VM.
sudo sysctl -w net.ipv4.ip_forward=1
sudo iptables -t nat -A POSTROUTING -o ens3 -j MASQUERADE
I then fire up code, pull in the changes from the laptop via my git repo, and fire off the request as seen to see whats what.
python3 ./models/interface/push_inside_if.py
Successfully configured IP on 192.168.70.21
Eh? This is not the same ordered code. This is the standard generated XML blob. Only difference, is Python3.8 is default on my workstation.

If nothing else, what this has taught me is that when it comes to YANG modelling in Python - environment matters - a lot. I get the feeling this is also why the developer of pyangbind let it die on the vine a bit, since moving over to Golang in his day job probably translates better to this use case as well. Golang for the initiated, generates C-like (in speed and performance) binary files that are all inclusive - no dependencies, no libraries. Build an app in Go, and its ready to rock and roll anywhere

At this point, i have been able to build a model saying what I want, and push it to the box, and it "made it so". What happens if i make a change out of band and then push something back to the box?

in1rt001#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
in1rt001(config)#int gi4
in1rt001(config-if)#ip address 109.109.109.3 255.255.254.0
in1rt001(config-if)#^Z

in1rt001#sh run int gi 4 
Building configuration...

Current configuration : 203 bytes
!
interface GigabitEthernet4
 description Inside IP Address
 ip address 109.109.111.3 255.255.254.0 secondary
 ip address 109.109.109.3 255.255.254.0
 negotiation auto
 no mop enabled
 no mop sysid
end

So I hacked up the subnet mask, oh and btw there is a secondary IP there too...

python3 ./models/interface/push_inside_if.py
Failed to configure interface: /native/interface/GigabitEthernet[name='4']/ip/address/secondary[address='109.
109.109.2']/secondary is not configured
woooommmmpp whomp...

maybe i need to make sure that secondary isnt confusing things?

in1rt001#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
in1rt001(config)#int gi 4
in1rt001(config-if)#no  ip address 109.109.110.3 255.255.254.0 secondary
in1rt001(config-if)#^Z
in1rt001#sh run int gi 4
Building configuration...

Current configuration : 153 bytes
!
interface GigabitEthernet4
 description Inside IP Address
 ip address 109.109.109.3 255.255.254.0
 negotiation auto
 no mop enabled
 no mop sysid
end

Try again then...
python3 ./models/interface/push_inside_if.py
Failed to configure interface: /native/interface/GigabitEthernet[name='4']/ip/address/secondary[address='109.109.109.2']/secondary is not configured
 big fat nope.

At this point, I think we need to consider the use of config candidates and the push many, apply once concept. Time for a new post...

Doing YANG Wrong: Part 2 - Python Bindings

Part 2: Python bindings for models

Having a model is one thing.  Using it requires you to ingest that model somewhere, apply values to its elements (leaves) as necessary/appropriate and then submit that completed model to the appliance for application to the config.

Im a python guy, you might like Go, or ruby or whatever. thats up to you, but I use python right now, which means I use pyangbind and pyang to create pythonic modules I can import into a script, and then interact with the model attributes like I would any other object in python. We can then push that out to a device from that script.

I will assume you have pyangbind working, if not, use the first few steps from here.

So, using the collection of model files we extracted for openconfig interfaces on our csr1000v, lets try to make a python module for interacting with this modelset.

pyang --plugindir $PYBINDPLUGIN -f pybind -o interface_setup.py *.yang

[email protected]:13: warning: imported module iana-if-type not used
[email protected]:186: warning: prefix "ianaift" is not defined
[email protected]:12: warning: imported module iana-if-type not used
[email protected]:346: warning: prefix "ianaift" is not defined
[email protected]:15: warning: imported module iana-if-type not used
[email protected]:370: warning: prefix "ianaift" is not defined
INFO: encountered (<pyang.error.Position object at 0x7f04005cc9d0>, 'UNUSED_IMPORT', u'iana-if-type')
INFO: encountered (<pyang.error.Position object at 0x7f0400519a90>, 'WPREFIX_NOT_DEFINED', u'ianaift')
FATAL: pyangbind cannot build module that pyang has found errors with.
Ugh. So we have an import somewhere in the model that isnt actually needed, and another one that isnt actually defined. End result: bad dog - no biscuit.

This one had me up against a wall for ages. I tried a pyang flag called --yang-remove-unused-imports but that didnt work either.

I then decided to look at these warnings properly. Turns out the iana-if-type module is imported with a prefix "ift", but later on these are refered to as "ianaift". Someone changed one part of the module, but not the other. In other words, the module on the CSR is technically broken.


I changed the line in the three affected openconfig modules to
import iana-if-type { prefix ianaift; }
and boom We have a python module.

Now that we have that basic module in place, we can build a script to deliver our first two requirements; an Interface and an IP.


Thursday, 25 June 2020

Doing YANG Wrong - Part 1: Getting started

This is more of a discovery post than anything particularly useful. It charts my adventure from a problem statement, through discovery of a solution, to a dead end. I post it here to help people see the logic I followed, and why, against all wisdom, it didn't work.

Note that if you are familiar with YANG, I know why this is NOT the right way to do this, but I only know that having done this as shown here. Please don't waste your time in the comments with nerdsplaining that to me. This is to show people a process and to help people who feel stupid when this stuff doesnt work, to see that they're not alone. I learn by making mistakes.

Firstly, the build environment. This is all done in GNS3.



Up top, we have a management switch (dumb GNS3 default L2 device), all ports in a single VLAN and a management host. This is an Ubuntu 18.04 VMDK with two NICs, one on the MGMT switch, and one attached to a NAT bubble on my host. The NAT NIC is DHCP with default route enabled and the other NIC is in the MGMT L3 as .254 with no routes obvs.

Underneath I have a Cumulus VX node for backend routing on the lower left, an Arista vEOS for switching and a bit of VRF local routing tests in the middle, and the frontend on the right is a CSR1000v. We then pretend to have a BGP session with a supplier on the far right. This I manually configured to just default-originate for now. That internet cloud is just an image... Each box has an interface in the MGMT L2 configured into a MGMT VRF on that box, so we don't taint any default FIB/RIB stuff we do with the BGP processes.

The idea is that the supplier sends a default route to the front end via eBGP, and all frontends (if we had lots) peer with the backends so they get all the paths they need to make a route calculation decision. The workloads live behind these backends. The switch in the middle only does L2 work in the primary use case. The idea of putting an Arista box in was to try out the VRF capabilities and maybe try to use VRFs as a sort of built-in backend tier, eliminating the VMs in that backend. More on that in another post i'm sure.

So the aim is to configure the CSR1000v interface that faces the vendor from nothing to everything. For reasons its not worth going into in detail yet, I will try to use openconfig models for everything as well.

I should need at the least, the following:
  1. interface config
  2. interface IP address
  3. prefix-list outbound
  4. as-path filter inbound
  5. bgp peer-group
  6. bgp peer

Part 1: The fetching of the models

I started with what I picked up from the Yang book. I can use netconf-console to query the supported models from the hello packet, fetch these down to my machine and then render them on the screen with pyang. This sort of worked. I SSHed from my machine into the management box in the simulation first, then installed all the tools I knew I needed.


sudo apt install python-pip python3-pip build-essential libncurses5-dev xml-twig-tools
pip install lxml netconf-console pyang pyangbind

(depending on what base OS you have you might need other things. google can help you).

I setup netconf-yang on the CSR1000v as well.

aaa new-model
aaa authentication login default local
aaa authentication enable default none
aaa authorization console
aaa authorization exec default local
aaa authorization network default local

username yangconf priv 15 sec 0 my_good_password.

ip vrf mgmt
 description management
 rd 900:1

interface GigabitEthernet1
 ip vrf forwarding mgmt
 ip address 192.168.70.21 255.255.255.0

netconf-yang

To then validate that the system was working, I used netconf-console to query the capabilities from the CSR.

netconf-console --host 192.168.70.21 --port 830 -u yangconf -p my_good_password. --hello
<bla bla bla bla>
     <nc:capability>urn:ietf:params:xml:ns:yang:smiv2:CISCO-RF-MIB?module=CISCO-RF-MIB&amp;revision=2005-09-01</nc:capability>
    <nc:capability>urn:ietf:params:xml:ns:yang:smiv2:CISCO-VLAN-MEMBERSHIP-MIB?module=CISCO-VLAN-MEMBERSHIP-MIB&amp;revision=2007-12-14</nc:capability>
    <nc:capability>urn:ietf:params:xml:ns:yang:smiv2:BRIDGE-MIB?module=BRIDGE-MIB&amp;revision=2005-09-19</nc:capability>
    <nc:capability>urn:ietf:params:xml:ns:yang:smiv2:CISCO-IP-TAP-MIB?module=CISCO-IP-TAP-MIB&amp;revision=2004-03-11</nc:capability>
  </nc:capabilities>
</nc:hello>
Yay. So what I need to do now is find a model that will let me do an interface IP (part 1 of my problem). Easy way to make sure you have a supported model is filter the hello.

netconf-console --host 192.168.70.21 --port 830 -u yangconf -p my_good_password. --hello | grep openconfig | grep ip

    <nc:capability>http://openconfig.net/yang/interfaces/ip?module=openconfig-if-ip&amp;revision=2018-01-05&amp;deviations=cisco-xe-openconfig-if-ip-deviation,cisco-xe-openconfig-interfaces-deviation</nc:capability>
    <nc:capability>http://openconfig.net/yang/cisco-xe-openconfig-if-ip-deviation?module=cisco-xe-openconfig-if-ip-deviation&amp;revision=2017-03-04</nc:capability>
    <nc:capability>http://openconfig.net/yang/interfaces/ip-ext?module=openconfig-if-ip-ext&amp;revision=2018-01-05</nc:capability>

So we can see there is a module called openconfig-if-ip. Lets pull it down. We learned in the book we can read that model from the box, pipe it to some xml extractor and then redirect that model to a file on our machine like this:

netconf-console --host 192.168.70.21 --port 830 -u yangconf -p my_good_password. --get-schema openconfig-if-ip | xml_grep 'data' --text_only > openconfig-if-ip@2018-01-05.yang
You see we took the module name and put it into the --get-schema argument, and then also into the prefix of the redirected output, but we also put @2018-01-05 in the redirected name, to denote the revision the device is using. This is a good habit to get into since these models can update a lot and you can track which ones you are using where.

Note also, there is a deviation listed for this module. The deviations could be additional things added by the vendor in question or things replaced by the vendor. I am going to avoid this until I see I really need it.

I can now visualise this model file on my terminal with pyang.

pyang -f tree [email protected] 

[email protected]:11: error: module "openconfig-inet-types" not found in search path
[email protected]:12: error: module "openconfig-interfaces" not found in search path
[email protected]:13: error: module "openconfig-vlan" not found in search path
[email protected]:13: warning: imported module openconfig-vlan not used
[email protected]:14: error: module "openconfig-yang-types" not found in search path
[email protected]:15: error: module "openconfig-extensions" not found in search path
Ok. No I can't. This module refers to other modules that I havent fetched yet. Off we go to find them in the hello. Repeat the ... grep openconfig | grep thing command where you replace thing with the missing module name, then run the same --get-schema with the correct name and revision id for your device.

netconf-console --host 192.168.70.21 --port 830 -u yangconf -p my_good_password. --hello | grep openconfig | grep inet-types
    <nc:capability>http://openconfig.net/yang/types/inet?module=openconfig-inet-types&amp;revision=2017-08-24</nc:capability>
netconf-console --host 192.168.70.21 --port 830 -u yangconf -p my_good_password. --get-schema openconfig-inet-types | xml_grep 'data' --text_only > openconfig-inet-types@2017-08-24.yang
Rinse and repeat until you get a full model on screen. Realise that as you pull a new dependent model in, you might grow new dependencies.

Weirdly, I ran the pyang -f tree command after fetching this inet-types module, and it looked like it worked.

pyang -f tree *.yang

module: openconfig-interfaces
   +--rw interfaces
      +--rw interface* [name]
         +--rw name             -> ../config/name
         +--rw config
         |  +--rw name?            string
         |  +--rw type             identityref
         |  +--rw mtu?             uint16
         |  +--rw loopback-mode?   boolean
         |  +--rw description?     string
         |  +--rw enabled?         boolean
<snip>
Notice however, that the first module is now the interfaces. If you scroll down you will see the openconfig-vlan one is still missing and firing an error later on...
pyang -f tree * | grep error:
[email protected]:7: error: module "openconfig-extensions" not found in search path
[email protected]:11: error: module "ietf-interfaces" not found in search path
[email protected]:12: error: module "openconfig-yang-types" not found in search path
[email protected]:13: error: module "openconfig-types" not found in search path
[email protected]:11: error: module "openconfig-vlan-types" not found in search path
[email protected]:13: error: module "openconfig-if-ethernet" not found in search path
[email protected]:14: error: module "openconfig-if-aggregate" not found in search path
[email protected]:15: error: module "iana-if-type" not found in search path
[email protected]:15: warning: imported module iana-if-type not used
[email protected]:370: warning: prefix "ianaift" is not defined
This is getting mental... Lets follow along tho. After fetching all these openconfig ones, I still have some ietf/iana ones outstanding...

pyang -f tree * | grep error:
[email protected]:13: warning: imported module iana-if-type not used
[email protected]:186: warning: prefix "ianaift" is not defined
[email protected]:12: error: module "iana-if-type" not found in search path
[email protected]:12: warning: imported module iana-if-type not used
[email protected]:346: warning: prefix "ianaift" is not defined
[email protected]:11: error: module "ietf-interfaces" not found in search path
[email protected]:15: warning: imported module iana-if-type not used
[email protected]:370: warning: prefix "ianaift" is not defined
Lets tweak our capabilities filter and get them too:

.
├── [email protected]
├── ietf-interfaces[email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
└── [email protected]
We should now be able to render the various bits and bobs we need to show us a full model without errors.



Yay. This is now a completed model we could use to make a request to this CSR1000V to deploy something.

Monday, 22 June 2020

Network Automation: from spreadsheets to YANG and everything between

Over the last few years I have spent a significant amount of time in YAML and Ansible. I'm not an expert by any means, but I can probably get anything I need done in maybe a few days worth of tinker time. I'm what you would call a fair weather scripter.

One thing I learned from building about 30,000 lines of YAML whilst orchestrating ACI, was that Ansible is a square peg, and everything networking is a round hole. Sometimes it fits, sometimes it doesn't, but it is never perfect. Never.

This has led some in the trade to poopoo Ansible as a networking automation tool. That in my opinion is vastly over-dramatic and in many cases counter productive.

The saying goes, "don't let perfect be the enemy of good." Ansible is good. Its not perfect, but who is?

When someone asks me what should they learn as a networking person looking to grow, my first answer is 100% Ansible. It works in most of the major platforms and it can help you deliver changes in a structured way, with minimum inputs and usually a reliable output. That is then closely followed by some sort of CI Pipelining tool. Gitlab makes sense cos you can deploy it onprem easily enough and not have to worry about "data protection" by accidental repo exposure. You can just as easily use a hosted version or github even now they changed their billing model, but ultimately you probably don't want to be installing Jenkins or Concourse as a relative beginner.

So, as someone who isn't a relative beginner, why start a post about Ansible, with a subject of YANG? Because its about progression - both mine and the industry.

When I started on the DC automation drive about 3 years ago, our active platform of the time (ACI) was in v2, and the REST API was pretty well documented. There was also a lot of information available on the internal Object Model, and some Python Modules that allowed you to interact with that directly. Ansible was nowhere in networking really, and most of the Cisco SEs we using csv files and Postman to build fabrics. I spoke to a couple about what I was thinking to do, and they all said the CSV/Postman thing is a bootstrap tool - it stands up a fabric to be managed independently thereafter.

I decided I knew best (thanks dunning-kruger) and spent about 3 months taking the csv file and postman approach, to initial great success, before begrudgingly realising that it was absurd to try and make small additive changes in this way. I then started clicking the "generate code" button in postman to dump out python scripts that would complete each task, adapting the computed output into something more generic, first using command line args and then later pulling in the variables from a sqlitedb. Very devops-y. I felt the beard grow. I then tried to put a flask GUI over that DB, which is when the wheels came off. That was another 4 months down the pan.

I then started to mangle the parts of those scripts into more modular scripts that followed a workflow component role (think add new collection of things rather than add this, then that and then something else individually), so that we could feed these from a well known orchestrator like vRO. With every day my python chops grew, but comments and git commits or not, I quietly and internally started to worry that only I knew how this worked. I think I was about half way through this when the ACI modules for Ansible got a serious upgrade by the Network to Code folks. I think on that day I must have looked like I had a stroke, because I was half happy and half sad. Visibly pained for all the lost time, but happy since I would be able to get something more maintainable into production. These python scripts would be intrinsically linked to me forever, whereas these Ansible playbooks were vastly more accessible.

At this point I had been telling everyone that we could, would and must have a reliable orchestration workflow in place before the first major deployment of our new platform. However, by now we were in fact one semi automated deployment down, the second of six was looming, and we were still surrounded by lots of components with "no glue". Ansible was that glue. In maybe 2 months I had not only reproduced all my work with the generic modules, but I had also finished off all the missing parts. We deployed our second site into Prod within I think 3 months, and I back-filled the first site with the critical automation a few months later (more due to time rather than any technical problem).

Having lived with that over the last few years the first cut of that Ansible playbook work is nowhere to be seen. Somewhere like a month or so after the first site was redeployed we hired a new engineer with some thirst for automation and over the following months he took over the lead role from me on the Ansible front. That was a godsend as I was also "managing" a team of 12 engineers around the world and the quotes there speak for themselves.

His first drive was to use AWX (the newly open sourced version of Ansible Tower) as a driving force rather than executing playbooks off a jumphost. In the learning of that, he designed a new structure to the git repo that broke the components up into pieces that were more manageable from a change and risk perspective, and mated that with Templates to execute specific playbooks with more comfort. Having had that go well, we bought proper tower and network device support for Ansible to backstop me and that guy. Enterprise support from RedHat keeps the CIO happy obviously. On completion of the first refactor, we then did some more work on repo variable structure, since there was still no CMDB in sight and the state in the repo was getting quite cumbersome. Today, all our sites are managed off this one git repo and maybe 8 or 9 Ansible Tower templates (playbooks). We still have a lot of state in the repo itself, and a drive to pull much of that repo state into Netbox instead of flat YAML. What we have works, but it won't last forever.

Above all else, the biggest issue we have come across with Ansible as a tool for ACI specifically, is not so much rolling things into the config, but pulling things out. Today, we have no reliable delete tooling. We can add, we can change, but we only ever delete by hand. Thankfully for us, we rarely delete anything, which I guess is a compounding factor for why we don't have much of a solution here either - necessity is the mother of all invention, and there is no major necessity for us to delete things. You would think that state: absent was all you needed, but you soon learn that for reliable rollback, you cant live on idempotency alone. Often you need to put logic into the hierarchical removal of things; i.e. you remove what you added in reverse order. Thus the safest thing to do is for each playbook rolling in, you do an equal and opposite directional one to roll back out. Again, all fine, providing your platform doesn't spit endless errors for things being removed that don't yet exist. ignore_errors: True is rarely a good idea, and tags might work instead, but we found it all very very cumbersome, and it never got out of the dev branch.

Technically, the right answer is the target platform needs to be more stable and consistent in its application of Idempotent actions, or live closer to true promise theory for its restful APIs, but I should technically be able to eat what I want and lose weight. In the real world things don't often work the way we want them too.

So having lived with a very fine and functional automation framework, delivering hand crafted configs via generically accessible tooling for a few years, I have nothing major to complain about in terms of the final product. I can do in a matter of hours what would take a week at least to do in the past. I know that when I ask the system to deliver that config from the normalised YAML values, it will either work or fail cleanly. I also know that I can adapt or change small things within a running config and not risk major disasters either.

Whilst I have little to complain about, being British, I do not find it hard to try.

What changed for me is that I was seconded into a new group in the business focusing on Public Cloud Transformation. Historically the public cloud was off limits to us due to commercial and financial reasons. The pay as you go setup didn't marry with the way we ran our business financially, which preferred capital purchases of assets. A change of leadership and some keen interest from a cloud friendly major shareholder, meant the shackles were off. I was instantly thrust into AWS and Azure and specifically I was knee deep in Terraform. I finally got to see real promise theory and started to envy the ability to force the running state to exactly match the described config. I immediately revisited the Networking modules and poked about the ACI ones, and was significantly underwhelmed. They were all community owned and seemed to be written to scratch a few itches rather than as a serious attempt at providing full support for the tool. I was transported 2 years back to my horrible hacky python. I was sad.

Over the course of the last 9 months or so I have come to really like the hashicorp tooling like terraform, packer and vagrant as an approach to immutable infrastructure, but I have also realised these are ultimately provisioning tools, and they have obvious limitations in day 2 operations. I can provision an asset today, I can bootstrap its config as well, but if something changes it outside of terraform, and then you rerun the tf apply, you can cause some significant drama. This wouldn't be so bad if the tf modules covered all the operations you need, but they don't.

I then see a lot of people chaining TF and Ansible together in CI/CD pipelines, as a sort of treatment to this limitation. I'm one of them too now. It works. It's using the right tool for the job rather than a leatherman for everything. It is just such a shame to look at the terraform workflow being so tight and conformist, then having the point and shoot (and pray) approach in Ansible following behind. It feels wrong.

So finally, we arrive at the juiciest meat on the barbeque today - YANG.

Funnily enough, YANG has nothing to do with networking itself. Read any YANG book (and I strongly suggest you read this one), and all the examples for how YANG structures work together don't use anything networking related at all. YANG itself is a modelling language sort of like UML in programming. It allows you to create strongly typed structures that hold information about a thing, and its inter-relationships within a collection, known as a module.

What the IETF and OpenConfig (to name a few) have then done is use that modelling language to describe all the pieces and parts of networking entities, like a Physical Interface, IPv4 Routing and BGPv4 Unicast or BFD sessions. They went away and built up all the components of each protocol or service or component, and exposed that in a collection of YANG files in a big git repo.

Now of course we all know that whilst OSPF is OSPF, OSPF on Brocade doesn't always play well with OSPF on Cisco straight out of the box. Just because an RFC exists, doesn't mean that all vendors follow these to the letter. So to avoid the pain of having so many of these models out there with vendor specific augmentation, YANG allows you to load up a generic model, and then apply "deviations" over the top which overlay changes or appends additional items to the generated model. All you have to do is load that model up, insert your data where your implementation or the model itself requires values, and then this finished model is ready to be ingested by a platform, and applied. Each vendor has created a small engine as part of its netflow support that will take this model, convert it to the configuration parlance of that vendor and then apply that vendor specific config to the platform for you. In other words, you make a model saying you want an OSPF process with id 100 and redistribute-connected enabled, and when you send that to a cisco box, it will generate:

router ospf 100
 redistribute connected subnets

You could then send that to a juniper or an extreme box and it will generate the equivalent parlance and apply it for you. This is why YANG modelling is so powerful. You say what you want in the abstract based on a domain specific knowledge and leave it to the receiving platform to apply the correct config.

At this point things get optional. YANG is just a blob of structured data in memory and that input needs to go to a device to be parsed and applied. For many, this takes the form of NETCONF, but could also be RESTCONF or gRPC for example.
RESTCONF is usually a point and shoot approach to a single device config or a single model deployment. This is due to its nature of hitting endpoints relating to the leaf of the model one call at a time. Think Ansible with multiple modules being hit in series to deliver a total output. Typically people use json as a request encapsulation scheme, often converted from the originally generated model XML. Due to this endpoint driven design, and the clunky movement between formats, it's actually not a common use case for most people; further reading is left to the reader as appropriate. 
The Google approach is gRPC and this is apparently quite popular in Brocade/Extreme environments, using Protobuf as a request encapsulation (serialisation) scheme. gRPC itself is super lightweight and is very effective for use in the other side of YANG which is streaming telemetry, but I wont go there today. Protobufs themselves have a sort of handshake scheme menaing all participants share the schema of the data in transit first, and then the rest is binary data meeting that schema. As it is sort of unicorn snowflakey, again further reading is left to the reader. 
What I will cover more here is YANG models rendered into python objects, serialised on the wire as XML and sent to the device using NETCONF (over ssh tcp/830) as the transport.

So as discussed we start out with a problem to solve. We want to configure a BGP peering session on a device we own with a device we don't. To do this, we know we need a few key bits of information:
  1. Remote AS
  2. Remote IP
  3. Local IP
  4. Local Interface for Local L3
  5. Networks expected to receive
  6. Networks to announce
  7. Any BGP session settings (timers/graceful restart/bfd etc)
  8. Any local BGP policies we need to apply (filters/route-maps etc)
The first 5 items are directly significant to the BGP peering session itself.
The 6th item is likely router specific, but could be session specific
The 7th is going to be locally standard with perhaps some session specific overrides
The 8th is almost certainly going to be locally standard with some specific overrides (always filter bogons, but maybe additional filters on an IXP peer vs transit, or application of peer significant internal community on learn for example).

Ultimately, you could easily build a YANG model once for all your BGP peering sessions, based on the IETF BGP model, supplemented by the router model and running version deviations, and use that as the base of all BGP sessions. You then create your common values for timers and BFD, use the route-map model to build your standard route maps, and then the access-list and prefix-list models to create your filters and associate them all together with the session itself. At the end, you have a YANG model with all the settings for the BGP session. If you haven't used anything in the vendor specific space, your model is also vendor independent. You can send this config to a Juniper MX, or a Cisco ASR and you get the same result - a BGP session.

Let that sink in a second.

You built a model in code, with all the settings you need, and you can send it to any NETCONF enabled device and it will do the needful for you. ZOMG YAY!
What happens under the hood is that the netconf agent in the device receives the model from your script, renders the model as its native configuration, and places it into a config candidate. You receive a confirmation back from the agent that the config candidate has been accepted and is ready for commit. You then have the choice to drop or apply the config to the running config. ZOMGWAT? Yes, you can get Juniper style commit and rollback features, on devices that don't necessarily expose that functionality to you in the same way.

Brilliant. Everything I always wanted in life. Give it to me! There is a catch however. YANG models are super, super janky to setup.

I have been chasing this dragon now (in personal time) for about 2 months. It took the above mentioned book to loosen the lid.

Big things for me that I would recommend to others is definitely, buy this book. Read it fully. I have not put an affiliate link in there, just buy it. Secondly start small. Get a vEOS or a CSR1000v up in GNS3 and point a MGMT VRF NIC at the NAT bubble and another one at a BGP container. Start with python on your box (or an Ubuntu box as a jump if you prefer), and get pyang and netconf-console installed. The latter is a major gamechanger for figuring this stuff out IMHO. In the book they cover a series of commands that have you pulling YANG models the device supports off the unit directly, saving them to your disk, then generating config in XML to push back as a change. This then opens the door to using blog content like that from NetworkOp. Prior to reading this book, and even with some podcast content and other blogs, I never really got what they were on about in the NetworkOp blogs, and all this compiling things and command line that went over my head. The book fixed this for me.

Once you figure out how to make and utilise those models as python bindings, or maybe ruby or go modules if that's your thing, the benefits start to open up really really fast. Being able to go from a few fields of data to a validated model, and then onto a vendor specific running config without touching those middle things, really is powerful.

Unfortunately, as I saw in my early adventures a few years back in Python, and Ansible, I guess we are up at the leading edge here, and so experience is mostly word of mouth and beard to beard. Over time, this barrier of entry will come down.

So in summary, my journey from nothing to something in the last 3 or so years has been interesting, but for those pushing into YANG now, I leave this high level plan with you:
  • let the book teach you the difference between YANG and NETCONF. Even if you know that, do it anyways. its not a big book
  • setup your environment to talk to something in a sandbox, and get used to pulling supported models out of the capabilities (hello packet), and then extracting and rendering them with pyang on the screen.
  • get used to rendering python bindings from the extracted models and script up a basic config of something like a port.
  • build out a full script to build a from scratch BGP session with lots of models, using minimal input data. 
  • put said data into a CMDB (like netbox) and then call the netconf using ansible (e.g. the napalm-netconf module).
  • break your elements up to atomic units that match netbox CIs and use the netbox workflows to create, read, update and delete (CRUD) the atomic objects 
  • celebrate.

Software Defined Waffle with a gitops topping

Over the last two years or so, I have been on adventure with Data Centre Infrastructure renewal. As past posts may allude to, ACI was a big part of what we did, but before anyone gets all dogmatic about it, know that we didn't go "All in" with that one product, since I personally don't subscribe to the "DC Fabrics cure all ills" mantra.

CLOS fabrics and the various approaches to overlays within them are great at providing stable platforms with predictable properties for speed, latency and scale. Unsurprisingly, they go on to do a great job in server farms that can make the best use of that flexibility. During recent conversations on DC refresh, our Arista friends have been extremely keen to try and get us to run our Internet BGP border on the fabric as well. The 7280SR2K can handle 2M routes in FIB they say, just lob stuff into a VRF, bit of policy and voila. Yeah.

Just because you can, doesn't mean you should.

In the end we blended a lot of the old school Block model, with elements of new tech, and a heavy dose of gitops for config management and deployment. At a high level, the layout is something like this.

Our Dual DC Block Diagram

In general, the idea was that in each position, the equipment selected would do one job and do that job well. As the old saying goes, when you try to be the jack of all trades, you end up being the master of none.

There are Pros and Cons to this idea, with many of them coming back to the premise that the segmentation means a clean, simplified deployment strategically, but operationally leaves you with much more hardware to buy and maintain.

As Russ White famously says, its not about right and wrong, its about the trade-offs and what works for your environment.

For us, our key driving aim was a stable, scalable environment. Our experience of early versions of ACI had taught us that SDN as a concept delivers significant benefits of operational efficiency, with equal parts risk. The faster and easier it is to make a change, the faster and easier it is to break everything. To solve that risk vs reward conundrum, we turned to automation.  Its harder to break things if you have a structured, known good template that you feed with the variables. Its also quick to send that out to multiple places with an orchestrator like Ansible Tower.

It also gave us our own experience in the technology sector hype-cycle.  Software Defined hype is everywhere, and the cycle is the same every time.

* delete as appropriate

Throughout each iteration of that loop, more people make it past that 10% completion rate, and end up customers of said product. Ultimately as things get through a number of version iterations the products stabilise and the amount of blogs explaining how it works improve and the amount of people who get to try it out successfully goes up exponentially.

ACI is a good example of this awkward cycle (we were a very early customer back in v1.1 days). That first deployment was "stable" in so far as what you configured would mostly work fine, but it was so hard to configure, you didn't do that very often, and it rarely worked first time. If and when it did break, you would be on to TAC and you would have to wait for one of the maybe 10 people worldwide who knew how it worked. Version 2 was a marked improvement, making it more accessible as an engineer, but version 3 was a very stable product. We, as veteran operators are only now planning our move to Version 4, and whilst version 5 was just released, the rule of thumb is to only ever run the Long Term Support (LTS) versions which are the .2 releases (3.2 is currently community recommended unless you need 4.2 for the new fancy FX2 switches).

The same is true of SDWAN as it lives today. It started off with DMVPN in a frock, and had a few cycles of integrations with various things like SaaS in the cloud and whatnot to now, essentially losing its product status and becoming a feature in your edge device, or in some cases, becoming your new edge device.

In the first versions it was clunky and whilst the outcome was what you signed up for, the effort and the risk entailed to get there was not really well understood. As early Viptela users (pre-Cisco acquisition), we know full well how the sausage is made. Nowadays they're almost entirely turnkey in terms of "time to value", but to get the very best you still need to care and feed more than you expected when you signed up.

Ultimately, across the enterprise sector, getting into these new technologies are seen as a must in order to survive. That urgency to get into the tech leaves a lot of teams being reactive, rather than proactive and that back foot setting means you don't get time to do proper due diligence, planning and or training. At which point this turnkey capability becomes a double-edged sword. This is why when I see the "network engineers must be programmers" I am so deeply conflicted. Yes, you need to have scripting knowledge, because, yes, the only way to keep up is to automate. Sadly, we don't have the time to do that well; hell we don't even have time to learn our own discipline well half the time, let alone a whole other one. This is the danger of that Copy/Paste bubble in the hype cycle. When we dont have time to be an expert, we ask another expert for help - its human nature. In enterprise that takes one of two forms - Consultancy or Plagiarism. In my opinion, neither are better or worse than the other, since they result in the same outcome - people don't know how things work.

So here is where I come to the meat of this post. Having done quite a few of these deployments over this last few years, I have experienced both ends of the spectrum in terms of under-preparedness and or nonchalance towards the level of effort required over time, and at the other end, excessive automation and overly officious policies about use thereof.

Being able to look back now I can say without any question that investing time in modeling your consuming environment logically, is time well spent x1000. That is to say, you should spend at least one week in a room with a whiteboard talking about how your workloads will attach to the network, both in abstractions, and then down to the lowest detail possible. This time spent now will allow you to at least build solid process workflows for manual admin of the platform, and at best allow you to script it so with minimal input, you get a reliable, repeatable output. Always strive for the latter, but accept that your skills might not permit for that now, and ensure you have no less than the former fully documented, and such manual processes are peer reviewed against that document, and change controlled.

Perhaps contentiously, one of my biggest regrets in our first deployment was being very anal about "Automation only" in change management. We wrote it in stone with blood that no person would manually configure the fabric from the GUI. All changes were done in YAML and pushed out via Ansible Tower. I had listened to so many people who were already in this hallowed place and every one of them regretted every manual change, since back-filling the automation later was so awkward.

The issue that I had was that lack of flexibility, coupled with engineers that were still finding their feet with automation, led to resentment from some, and made our deployments take longer because those people that didn't universally understand the approach, either avoided it or couldn't find time to do their work among their other tasks.

I can tell you that back-filling that automation is indeed hard, but its not impossible, and it is unreasonable to automate all the things in day 1. Try to do that and day 1 will never come, and you'll never deploy anything.

My best advice to people now is to use that modeling information you created to pick your battles, and automate tasks that you know you will repeat at some regular interval. Find the breakeven point that works for your business in terms of effort to automate and iterations of execution - we said anything we do at least once a week. You then set a very strict change process around the manual tasks outside of that range, so that when you do need to back-fill automation, it shouldn't be destructive.

If you're not sure how to do this, XKCD has you covered:


Choosing the framework to base your automation around is probably another thing you should spend a respectable amount of time on, and most critically, ONLY make a decision following a completed, successful POC. I can look at the market today and see perhaps 20 products that claim to be end to end orchestrators and of those maybe 15 are very pricey things that expect you to "buy in" to effectively being your CMDB. That is fine, but you're almost certainly vendor locked in at both a platform and a fabric level thereafter. Good luck in yr3 with your maint renewal as well. Of the remaining ones, Tower is top of the pile since its very generic, but you'll spend a lifetime in YAML - be forewarned. VRO is probably next assuming you are in a virtualised world based on VMware; its very good, but very clicky clicky - make sure you have a spare keyboard and mouse. Atlantis is good if you have a very modern platform with terraform modules in the wild and have to co-exist with the public cloud.

Ultimately we learned the hard way that its not realistic to aim for a single tool for changes - there are many tools out there with different pros and cons. As we matured in our approach, we have been moving more and more towards Pipelines in git, which in our world means gitlab-ci.

Today, what we found to work well is to have gitlab host all our process code/scripts, and then pipelines execute on a change of that state to then chain together ansible, terraform, python code, and or random rest calls, such that a workflow gets from change request, to delivered via a merge request which is appropriately approved by leadership, and a change execution process which doesn't require platform specific compromises. The pipeline executes the right tool for the job, and the qualified engineers can maintain their pieces and parts, leveraging their focused expertise, and not necessarily have to be a full stack engineer genius unicorn.

Finally, the elephant in the room has to be where you host your state. When we started with the gitops adventure, we did not have a business wide CMDB. We had a bunch of localised ones, but we didn't have one central thing. We considered what we should do about this, but very quickly it spiraled from something we could use to something we needed everyone else to use too. In parallel a wider project was created to solve this problem, and so we resolved to use YAML in git for our state. Two years later and our git repos are huge, and unruly. It was a bad call since it didn't scale. That project to backfill a CMDB across a multi billion dollar company is not really much further forward either, so we are now building out a netbox based solution for our Infrastructure operations. Only time will tell if that was a good idea or not, but if you find yourself at the beginning of a big deployment, build from the start with a CMDB, even if you don't use that to trigger your automation like Netbox was designed to do. Have the state somewhere relational and programmatic. If you don't you'll regret it just like me.




The even-ended number problem in Go and Python

 During the Go Essential Training course on LinkedIn, the instructor sets up a problem for you to solve. The solution is in the next slide o...