How Do We Solve the Cloud Visibility Problem? Artwork

Cables2Clouds

Join Chris and Tim as they delve into the Cloud Networking world! The goal of this podcast is to help Network Engineers with their Cloud journey. Follow us on Twitter @Cables2Clouds | Co-Hosts Twitter Handles: Chris - @bgp_mane | Tim - @juangolbez

All Episodes

Cables2Clouds

How Do We Solve the Cloud Visibility Problem?

January 08, 2025 • Cables2Clouds • Episode 50

0:00 | 41:40

Send us Fan Mail

Ready to unravel the complexities of cloud networking and observability? Join us for an enlightening discussion with Craig Johnson from Forward Networks, a pivotal figure in cloud network innovation. Craig's journey from an on-premises networking background to spearheading Forward's public cloud practice offers a treasure trove of insights. Discover how integrated APIs are revolutionizing the landscape of cloud network monitoring and delve into the world of digital twins and cloud observability, where Craig's extensive Cisco experience and passion for Dungeons & Dragons memorabilia weave through the conversation, making it both informative and enjoyable.

Explore the Herculean challenges of achieving network observability across multi-cloud environments like AWS, Azure, and GCP. We unpack the intricacies of navigating these diverse platforms, highlighting the contrast between traditional on-prem networking and the dynamic, ever-shifting cloud landscapes. Learn how Forward's platform paves the way for enhanced network visibility by simplifying network modeling and management, leveraging free APIs for data collection without hefty observability costs. This episode also tackles the frustrations with multi-cloud tools and third-party integrations, shedding light on the efforts required to maintain seamless network performance and troubleshooting capabilities.

As we wrap up, we reflect on our experience at the recent re:Invent event and the exciting anticipation for the new book co-authored by Tim and me, set to release in January. We express our heartfelt gratitude to our listeners, encouraging you to stay connected by subscribing to our podcast and following us on social media for exclusive content and updates. Whether you're a cloud networking aficionado or just curious about the latest in cloud observability, this episode promises a wealth of knowledge and a touch of fun.

Connect with the Guest:
Twitter: https://x.com/captainpacket
Bluesky: https://bsky.app/profile/captainpacket.bsky.social

Guest Content:
https://youtu.be/lpA7A6FYMqc?si=diboYPvvbQJBDsLP
https://youtu.be/_GjGsfM-N2Q?si=2RMx-iZkwBHeZm9D

Purchase Chris and Tim's book on AWS Cloud Networking: https://www.amazon.com/Certified-Advanced-Networking-Certification-certification/dp/1835080839/

Check out the Monthly Cloud Networking News
https://docs.google.com/document/d/1fkBWCGwXDUX9OfZ9_MvSVup8tJJzJeqrauaE6VPT2b0/

Visit our website and subscribe: https://www.cables2clouds.com/
Follow us on BlueSky: https://bsky.app/profile/cables2clouds.com
Follow us on YouTube: https://www.youtube.com/@cables2clouds/
Follow us on TikTok: https://www.tiktok.com/@cables2clouds
Merch Store: https://store.cables2clouds.com/
Join the Discord Study group: https://artofneteng.com/iaatj

Cloud Observability and Digital Twins

Chris 0:00

The human torch was denied a bank loan. I haven't heard that one.

Craig 0:04

Really, I haven't heard that vocal exercise, yeah.

Chris 0:07

Quote from Anchorman.

Tim 0:09

Oh, that's, right, yes, this burrito is delicious, but it is filling. Welcome to the Cables to Clouds podcast, your one-stop shop for all things hybrid and multi-cloud networking. Now here are your hosts Tim, chris and Alex.

Chris 0:30

Hello and welcome back to another episode of the Cables to Clouds podcast. My name is Chris Miles at BGP Main on Blue Sky. I actually went and changed the domain, tim, so you'll be happy.

Tim 0:41

Excellent.

Chris 0:42

Kept it consistent everywhere. And yes, as you heard, as always I'm joined by my good friend, tim McConaughey, my co-host, my partner in crime, my heterosexual life mate, whatever you want to call each other. But yeah, we have a special one today. So this is a post-holiday recording. We're all fat, dumb and happy. From Christmas, we filled our guts and now we want to chat about some cloud observability stuff. So we've brought on a good pal of ours, craig Johnson from Forward Networks. So if you've been involved in any of the podcast and technical media circuit, you've probably seen Craig out there. I think you've been on A1 as well as the Cloud Gambit, right.

Chris 1:26

So, yeah, we're just, he's making his rounds, we hope you know it. But no, we had a good chat at AWS reInvent this year and we thought it'd be a good chance to come on here and talk about some of the stuff that's happening at Forward and the concept of digital twins and how that applies to cloud network observability. So with that, Craig, tell us a little bit about who you are and what you do.

Craig 1:49

Yeah, thanks for having me. So my name is Craig Johnson, live down here in the great nation of Texas. So I've been at Forward for a little over five years. So I'm a technical solution architect. I also lead our public cloud practice, but really I didn't start out in cloud so I really only moved to the public cloud probably about three years ago. On my back left corner you'll see my multiple expired CCIE, so most of my time before that was completely on-prem working with service providers. I spent a decade or so at Cisco worked on the operations side, but most of that was either in the data center storage area networking or campus networking. But yeah, for the past five, five and a half years at Forward been focused on network modeling, digital twin, and for the last three or so has been purely based on cloud modeling.

Chris 2:39

I will say at least you've got one of the old school ccie plaques that is, uh, in tim and my's day. I feel like the plaque we got. We kind of I call it the ugly baby, because it's a hideous thing that I don't want to hang on my wall because I hate the way it looks.

Tim 2:52

Yeah, you see, I don't. I don't have mine behind me. I I didn't have space for it after I put all the posters up, so it definitely degraded over the years.

Craig 2:59

I've got the the old metal, one from metal from like 2001, and then they started doing just kind of a simple one, and then, yeah, by the 2010s it was like come on. It's like when I was there and they took sodas away. It's the same thing, you know.

Tim 3:12

I'm going to play a dude. I'm going to play a completely derail. This for a second. I just noticed that you have the red dragon. I also have are. I have that exact same Red Dragon. I noticed it. I was looking at my bookshelf like holy shit, we have the same Red Dragon.

Craig 3:28

I had no idea. Yes, yes, there is a plethora. I just redecorated this room so the color and the flooring is all new, so I have that. Below the Red Dragon over, there is several D&D third edition books that.

Tim 3:42

I still have. They're not particularly useful anymore, but yeah, absolutely.

Chris 3:46

Yeah, good stuff, man. Uh well, yeah, craig, thanks for that. Thanks for coming on to the show, obviously, and um, yeah, so let's, let's start with, um, I guess, kind of an overview of the problem. Right when we we had the idea for the show, we wanted to talk about what the problem was and and, and then we can kind of venture into how to solve that. So you know, network monitoring and network observability has been a thing in the on-premises world for a very long time.

Challenges of Cloud Network Observability

Chris 4:12

And I'm curious to get your take. How has that made its way into cloud? Because cloud seems like there's. It's this new paradigm where there's integrated APIs everywhere. You should be able to get as much visibility and observability that you need, but that doesn't seem to be the case in day-to-day operations. So, from your perspective, what is broken about cloud networking, monitoring and observability?

Craig 4:37

So what I'll say is particularly broken, in my opinion, is one from the very obvious. In contrast to on-prem networking where you know if I want to swap out a Cisco, you know router for a Juniper, router for an Xtreme one, they all basically do the same thing and you know they're going to have RIBs, they're going to have FIBs, they're going to have forwarding tables, input and output ports, and when you go inside that you know it's going at all. If I want to switch from AWS, azure, gcp or OCI, obviously there's something underneath all of those that do those, but those are completely opaque to us and the visibility that I get within those are very, very much limited. For administrative control reasons and just by nature, I want to have any level of visibility. In AWS I'm limited by, of course, which account I'm in and which region I'm in, because they made that decision to separate the control planes.

Craig 5:30

Azure's a little bit different, but I'm separated by tenants and subscriptions, similarly with GCP and projects. So the scope of what I can actually see is quite limited and the actual way that the forwarding function is is very, very different. As you probably know, the way I forward in AWS is drastically different in the level of tools that I have versus Azure, versus GCP, so it's much, much more difficult. And with many, many analysts saying, oh, you should go to multi-cloud or you should have many, many different accounts for separation, that sounds nice, but the challenge of actually being able to one figure out where my packets are going, if they are moving correctly and if they're taking the best path, is extremely difficult. And I've not seen any other tools out there. There's a lot of other tools that will tell you, give you some visibility into your instances, into your databases. Very few that I've seen will actually do it on any sort of network and transport layer.

Tim 6:26

Yeah, definitely. It's interesting what you've mentioned about how AWS separates control plans and also visibility essentially between accounts, because a lot of people are like, oh yeah, observability is tough, but you have CloudWatch and they give you all these tools, but then they split them all out from each other, even within the same cloud, and it becomes a Herculean effort to try to stitch it all together across the different accounts, you know, and whatnot as well.

Craig 6:53

Yeah, and it's just a very different paradigm from how we do things on oops, I should hit myself using that word than how we do things on-prem. Because, yeah, I mean, I don't care how many applications, clients and tenants I have, it's all going through the same set of infrastructure. When I'm dealing with anything on-prem, the separation just makes it extremely difficult for me to figure out anything in there. And one, even just dealing with you know people that really may not be as familiar with how the forwarding works and that it does just change quite more often than I would like to see versus on-prem. You know when I started, you know 25 years or so ago, you know speeds and feeds have changed but forwarding is roughly the same. And you know, if I would do AWS five years ago, I'd be woefully out of date and things wouldn't work anywhere the same way.

Chris 7:37

Yeah, I mean to your point about you know the way forwarding works differently between the clouds, like the monitoring piece is all different as well. So it's like in AWS you're a network engineer, but you have to come up with this complicated Athena query in order to actually come up with what your traffic's doing on a day-to-day basis, but that doesn't really apply to Azure, GCP any other clouds you could be in.

Craig 8:04

It doesn't apply and I don't get anywhere near the same tools, and I will applaud the cloud providers for starting to get a little better on this. But it's not like I can go log into a transit gateway and do a trace route to figure out where my path is going and hopefully I don't have things that are blocking my standard ICMP tools, which is where you do get other data plane observability tools, which is great for figuring out latency and if things are working or not, but it doesn't actually tell you if something isn't working where the problem happens to be Well.

Tim 8:31

And if you do something like, say, VBC reachability analyzer, you know, God forbid you have any non-AWS architecture anywhere in the way you're done. Right, Like that's it. You're not going to use that tool at all. Yeah, I mean, that's Like, that's it. You're not going to use that tool at all.

Craig 8:43

Yeah, I mean that's. That's of course the pitch, you know. It's like oh, you want to use something else besides AWS network firewall? I'm sorry, yeah, you're out of luck.

Chris 8:53

It's kind of like the. It's kind of like the Apple approach. I feel like when you're, when you're fully in the ecosystem, like things work and thing looks all hging their shoulders like, well, you could buy our shit. And then you know, we know this will work. But you know, I mean majority of network people in the cloud are using some type of third party firewall for integration and stuff like that and that visibility is completely lost Exactly right, like you can see.

Craig 9:18

oh, I'm hitting this. You know ENI interface here to my traffic and you know, cross my fingers that whatever that thing is doing is doing it correctly.

Tim 9:26

Yeah, I mean, people open tickets all the time for the forwarding problems. Chris is killing something over there, but yeah, you know all sorts of forwarding tickets where you have to get support involved, because they're literally the only ones that can actually see what are happening to any of the packets. Right, your VPC flow logs might say everything's great, but there's still a problem. And you go to find out that. Oh well, you know, this availability zone is having an outage at this time and you know so. The tools they give you are opaque, but also not I don't know. I don't feel like they're truly real-time or like they don't dig far enough to really give you the visibility you need.

Craig 10:07

Yeah, I think that's exactly right. I'm reminded when I used to work in operations, the answer to those sorts of problems in a data center or a campus was, you know, let's get the sniffer out and let's see that. And it's the same sort of issues. You can look at flow logs, you can try to generate packets and it's sort of going to tell you something, but it's not going to tell you if you're taking a good path or if I'm hitting any. It might tell me that it's working or not working, but, yeah, it's only going to be working if that traffic is actually working at this time. As an engineer, when I'm troubleshooting, I don't always have the application running at the time that I'm doing it, or I'm trying to pre-do it before the application's even on board. And that's really the problem statement that we set out to try to solve is to be able to not just observe what's going on on the network in the cloud right now, but to model all possible flows and anything as it might be traversing the environment.

Tim 11:00

Yeah, and one more thing that is in the cloud that we haven't thought about or not, that we haven't thought about, but that we didn't have to really think about that much on-prem but we absolutely have to think about in the cloud is the cost of troubleshooting, like the visualization, the observability costs of doing VPC flow logs, a reachability analyzer, port mirroring, right Port mirroring is all these things require. All of the port mirroring, right Port mirroring is all of these things require. You know those all have costs associated with them and you're doing it in every cloud that you exist in right. Like, if you're doing troubleshooting in AWS, you're probably also doing it in Azure if you're there as well. So you're paying for that as well.

Tim 11:38

You know you're double, triple paying for a lot of these tools and it's something you never would have thought of on-prem right. On-prem, you stroke a check to I don't know, solarwinds or whatever the hell tool that you're using and then you just keep it. Once a year you're paying the fee and you're good, right, but here it's consumption-based. So you know, every time you have a problem, you're essentially paying money to solve your own problems.

Craig 12:02

That's exactly right. Like I remember, before I was at Forward I worked for a network packet broker company and that was the line as well.

Craig 12:10

You know put taps and spans on every place in your network and as you're probably aware, that's completely infeasible and it's exactly the same thing on the cloud. Because it's just one, it's far too expensive and two, it's just far too much data to do anything with. I mean, I can't, you know, even if I tap, you know, if I, if I want to look at all North, south and all East West traffic, that's a massive amount of data and it's just too much to actually be actual actionable in any way.

Chris 12:33

Well, it's funny too, because you know, with with the introduction of cloud, it was kind of this whole. One of the big selling pieces was like hey, you get to start from scratch Pretty much. You get to build an ideal architecture that that's built for the cloud, it's built in this, but it's never, ever that way because of costs. Like I, if like it's it's, we've inherited that problem like a tenfold in the cloud just as much as on-prem, whereas, like we've, we've had to put things in these, in these centralized type architectures and and you know, then then if it's a multi-cloud flow, then you're doing it centralized in two different ways. You're paying for the storage, you're paying for the monitoring on everything. It just like it's exacerbated like like crazy.

Craig 13:14

Yeah, you're precisely right. Like to the account example, I've seen people get started with with simple architectures and you know a handful of VPCs to some customers I work with that have literally you know 800 accounts and you know different. You know different transit gateway per account and they're all paired to each other and shared and it's like this is a nightmare.

Tim 13:35

Yeah, that sounds truly awful, yeah, and.

Craig 13:40

AWS's solution is like to your point oh, just use Cloud WAN or just use VPC status or something like that. But you're kind of just masking the complexity which, to be fair, we've done the exact same thing on-prem. You know, if I have too much complexity in my environment, put an overlay on top of it I get it, make a fabric 100%.

Chris 14:00

So that's obviously we've talked about this for about 10, 15 minutes at this point. So we know there's a problem right, there's something to solve. So let's kind of pivot here to talk a little bit about forward networks and how you guys are addressing this and solving some of these problems.

Craig 14:19

So to the example I said before where, if you look at all of your devices on-prem, no matter what the vendor happens to be, they all basically do the same thing.

Cloud Network Traffic Modeling and Analysis

Craig 14:27

Whether it's a switch, router, firewall, load balancer, you know a packet comes into a particular input port, that device does something with the packet, you know does a header rewrite, does a macro, adds a header on top of it, whatever it happens to be, sends it out to an output port and that's all the device does no matter what you know. So it processes through a number of tables and goes through that. So our when we started at Ford that's we, you know we all came from, you know your Cisco's and places like that, so we're very familiar with that problem as we started moving along. Obviously we don't want to have holes in the way we model the network, because once you take all of that data you want it to be easily searchable and normalized. So you don't have to be, you know, a pure expert. You know I'm pretty good at most Cisco devices, but you know Juniper's and Paolo's and Fortinet's. You know I would really struggle on that. So being able to understand the forwarding characteristics and the key to that is not necessarily tell me what's the actual traffic that's going through that device, but tell me you know what's the. You know if I'm looking for a particular source, destination, ip with support characteristics, tell me how it's going to pass through all those devices based on the current RIB, fib, everything, all the tables on that device the innovation that we had is essentially the public clouds work the exact same way. Now they all have their own little quirks to them.

Craig 15:45

But when I go like, if I'm leaving my data center and I'm going off of an express route or direct connect, that's going to connect to a VGW or a TGW or a VNet gateway. Once I hit that TGW, it has essentially the same things. It's going to have a number of it's going to have this that go in and out of it. It's going to have a number of it's going to have this that go in and out of it. It's going to have peering connections and associations. It's going to have a number of route tables which function very much like VRFs. As you pass through each of those constructs, you go through a TGW that's going to connect to a VPC somewhere, or that VPC, more specifically, is going to connect to a route table that's going to have connections into a subnet and that's going to have EC2 ENIs attached to it. And we saw that, okay, well, we already have the concept of doing this on-prem.

Craig 16:30

It's very easy for us to extend that and at the same time, we can take everything that we have, because I can say well, I don't just need to collect from one account. I can collect from multiple accounts using just a simple IAM role. I can do it across multiple regions. The key insight that we had to your point earlier is well, wait, if I do have something like a Palo or a Fortinet or whatever firewall inside my cloud, that's just a collection of a couple of ENI, maybe IPsec tunnels, whatever they happen to be as the traffic hits that ENI, I can process that firewall just like it was a my end-to-end connectivity, just by putting in that source destination IP address. It's going to tell me how it passes through each one of those constructs.

Craig 17:12

That was really the key innovation and that's really where I got. My start is when I started doing any sort of cloud thing, probably around 2020, 2021, I didn't really know very much at all but using the ability to collect from these and like, okay, well, when I'm trying to configure anything on AWS, if I want to set up a file, how's exactly it's forwarding? Did I configure this route table correctly? Did I configure this thing? Did I configure the security group? There's a hundred different places to check inside the cloud for any number of things. Being able to step through it step by step and, to the point earlier, not have to log into a bunch of different accounts, a bunch of different regions, made it a very, very key insight that we were able to use.

Tim 17:50

So no, that's great. So what I'm hearing is basically with Forward. Now, forward got started on-prem, but this is now an add-on. Basically, now You've expanded into cloud, the idea is that you can onboard your accounts to Forward's to to forwards uh, I guess it's a, is a, it's a platform right, it's a appliance or yeah, okay, um, and then basically because of that, forward can go reach out to all of your accounts, uh, in whatever clouds. Also, I assume there's also going to be like login information or or or whatnot, to hit um firewall, any third-party isVs, like firewalls, cisco, whatever.

Craig 18:30

Yeah, API CLI, whatever it happens to be.

Tim 18:32

However, you can access it right, and that's going to help pull all that in, and because you get all that data and because the options are limited, in which case how traffic can be forwarded, you know how it's going to work. You can just model, right, you can just model.

Craig 18:52

It doesn't actually send the traffic, but you can model the entire path because you know everything the packet's going to do. Basically, yeah, and that's really the cool part is, when you start looking into this level of modeling, it's not anything you couldn't do yourself. Like, when I log into router, I can look at the fib, I can look at the route table, I can look at all the ACL tables, I can look at all the cam tables, I can look at any MPLS forwarding, any labels that are getting pushed about. These are all things that you could do and probably have done many times, not really that much different. On the cloud, I'm using EC2 APIs to grab the transit gateways, transit gateway attachments, transit gateway route tables.

Craig 19:22

Now, the cool thing about this is this doesn't have a cost. This is completely free to grab any of that from AWS. There's no observability cost. This is just the same APIs that when you log in the console, you're seeing the exact same thing. We're just grabbing it from an API basis to show you that. So it makes it be something that you have right there and it's also something you can track over time, which is one thing I really like because, yes to your point, you could look at CloudTrail logs and see what's changed. You could look at CloudWatch and see if something is not working. But trying to go back in time to say, hey, what was the state of the forwarding information in the cloud two days ago, a week ago, between changes, that's really really not easy to do, and not easy to do in a way that makes sense to a network engineer Like I. Can grab API output from different points in time, but it's a lot of data to parse through and not in a way that makes a lot of sense for a network engine.

Chris 20:10

So it sounds like there's kind of I mean, the way I'm thinking about it there's kind of two pieces here. There's either the predictive analysis, where you're judging, based on the existing control plane, at a certain snapshot in time, if a packet were to go through this network, how is it going to get from A to B? You can predict that. Is there a bit of a postpartum type analysis piece to this as well, where you are looking at flow logs and you say like, okay, actually this packet did come in and you followed the trajectory in a kind of a, like I said, a postmortem analysis manner versus a predictive manner? Is that?

Craig 20:44

possible. So the postmortem side is absolutely doing kind of diffs between before and after things and saying, ok, between two points in time. Here is not just what changed from routing, security basis, but also on an intent basis. So all those things. If I have a particular application or a flow that says, hey, this application exists in my data center A and it goes, you know, hops between region B and maybe it goes to a different cloud, whatever. That's an important application that we have imported from those kinds of flow logs.

Craig 21:15

And then every time that a digital twin like Forward takes a snapshot of the environment, it's always checking those dozens or hundreds of things to say, hey, is this changed? Is it always taking the shortest path? Are there loops in the network? Is there anything that would stop the connection between all of those? And because it's analyzing everything along the path layer two, layer three, overlays, firewalls, mpls, forwarding all the way up to the cloud it's going to tell you if one, if it's changed, and two, if you've done something to break the connectivity. So you can see that postmortem level like, hey, this particular flow doesn't work anymore and it's because somebody changed the security group or somebody you know. You know it's just an unassociated something from the transgate or whatever happens.

Tim 21:56

So how, I guess, um, the big problem with, uh, cloud observability we were talking about, of course, is the cost of doing that analysis, which ford takes care of, because you know it's just predictive, predictive modeling, but also the storage, how much data? I mean, it sounds like you have a pretty, because you're doing diffs right Of some kind. So what does that data storage model look like? Or how are you storing the data or doing the data?

Craig 22:24

So there's two pieces of data you're dealing with. One is kind of the collected data, which tends to be API data from anyone. Now, this is metadata. So we're not talking about going into an instance and grabbing all of the data about all of the instance storage or anything like that. This is going to be metadata on. You know, in my VPCs these are the entries for all my route tables, these are the associations for all the subnets, these are all the ENIs that are attached to the subnets. So in a large account it can be, you know, dozens or hundreds of megabytes, but we're not talking gigabytes and terabytes of data that you're grabbing Now on the backend.

Craig 22:56

Once that data is gathered, then there's kind of derived data. The IP is what they call a mathematical model. It crunches all of that, figures out all of the literally quadrillions of possible flows between all of the places, and that's really kind of the key to what Forward does a little differently is because there are other things out there, like you've got local stack out there that tries to emulate some AWS things and you've got things on-prem that try to emulate what actual devices would be. You run into that kind of hard limit when you want to emulate a large environment. So that's where modeling really comes in is because we're modeling, you can, you know scale up to, you know 50, 60, 70, 80,000 devices and you know many, many, many hundreds of accounts, without an issue, because all of that data is derived and it's just based on mathematical crunching, it's not based on each individual device and looking at all those quirks, because it's all normalized.

Chris 23:49

Yeah, so on this, you did mention digital twin a moment there and the concept of modeling. So should we, should we kind of maybe define exactly what digital twin means to forward in this context, and I'm assuming this is something that originated in the on-prem product and now has moved into the cloud, right?

Craig 24:07

Yeah, it's a term you'll see a lot and people use it in a lot of different ways. The way we sort of define it is a way to what we said before to essentially take the exact forwarding characteristics of any device that you have and essentially turn it into a common model. So if you're familiar with what OpenConfig used to be which is still around, of course there's already this concept of all of the devices have a sort of common model. Now, some devices are better than some vendors are better than others about conforming to that model and, of course, the cloud providers aren't very. So we've taken that and extended this a little bit to say what's common across all of these and by turning it into this open config plus extensions kind of common model, then we can use that to figure out the forwarding characteristics and then you can simply query that digital twin to do what I said before.

Craig 24:57

You know, tell me the source IP, this destination IP, with these ports, protocols, app IDs, url filtering, whatever you have, and it's going to tell you the exact path that it takes overlays, underlays, whatever it happens. On top of that, because you also have that same sort of common model, now you can also query it not just based on flows, but you can also query it based on configurations. You know, if you want to do golden config checking, if you want to look at security group analysis, all sorts of things that sort of get layered on top and that's what's kind of changed over time, like when I started most of us were in the network troubleshooting sort of business, and it's really expanded a lot. And because a lot of people are trying to, you know, figure out inventory compliance, things like that, but when you have it all in one place, you can query it, not based on here's all of these individual vendor characteristics, but just tell me, you know, something simple across all of my clouds is very easy.

Chris 25:47

I can definitely see. You know, there's kind of probably a I don't know if I'd call it a greater level of difficulty, but definitely within on-premises this is obviously much more intricate. Right, it's going to pull control plan, it's going to look at forwarding bases, things like that, and you know several different vendors in the mix, At least I would think, with the move to cloud, a lot of this being API based. Did you know? We've probably all heard the struggles of the major networking vendors supporting APIs?

Tim 26:34

use really any constructs that you're able to use are purely limited by what the CSP exposes to you, right, and because of that, 99% of the time it's purely control plane.

Chris 26:40

You're not looking at actual data plane stuff. Does that change what you do with the concept of a digital twin or this kind of predictiveness at all?

Evolution of Cloud Network Modeling

Craig 26:48

So when you start looking at data plane, those kinds of things are really more of an enhancement on top of what you're looking at from the from the modeling standpoint. So once you figure out all of the possible flows, then you can figure out what the actual you know low drops, things like that are on a per link basis. And what's handy about that is I'm not just seeing here's all of my hundred thousand links in my environment telling what the load is. I can actually start to troubleshoot based on an application or a per flow basis, like if I know, you know, my this one application is going to take these 15 hops in my network and it's going to traverse these links. I can see on a per app basis what's slowed down, what's being dropped, what's anything that you that you would that would cause issues there.

Craig 27:33

And you can also see it from an overlay underlay basis, because if you're just hitting tunnels on top of it, of course you know VXLAN or whatever it's going to have a VTEP to VTEP. You need to be able to see that as well. So yeah, those things. That's a pretty well wellversed vendor space. So there's no really need to reinvent that, but overlaying that data on top of what's possible in the network winds up being very useful. Where the digital twin becomes more useful as well is when you're pre-provisioning things or, like you said before, that sort of post-mortem analysis, where this is not my high transaction volume right now, but I need to be able to see what the path is before I do the application or afterwards to see what's going on.

Chris 28:10

Right. That makes sense I imagine what a lot of people are using this for is like, hey, I have an upcoming change, right, and I need to, you know, move this VPC or you know, start advertising this route from on-prem, or something like that, and they just want to see what's going to change right.

Craig 28:33

Is that, would you say, whether your customers are extracting the biggest amount of value from the product? Yeah, it's where you get a huge amount of value. Is doing that sort of pre and post change analysis? Because, like I said before, those sort of flows that are important to my network, those sort of intent checks I have those predefined in the network or just created on the fly, and whenever I do my change, wherever I close it out, I can verify.

Craig 28:51

Here's all the checks that I have, here's all the configuration checks, here's all the intent based flow checks. Are all those still functioning so that you can close out whatever change record you do in a more you know, in a more holistic way? And you have, you know we like to joke, we call it meantime to innocence, where it's basically the network is definitely not the problem because I can mathematically verify what the flows are inside my network and that's not just on a manual basis. We have hooks into Ansible and I have a Terraform provider written that will let you do that on a pre-post change basis there, because we know most people in the cloud probably aren't doing manual changes like we're still doing on Brim.

Chris 29:27

You'd be surprised. You'd be surprised More than you think. That is true, you're not wrong, you're not.

Craig 29:32

You're not wrong on that, but it's an ideal so talk to uh.

Tim 29:38

Tell us a little bit about the uh journey, if you will, like. This is a new. It's not new, right, but it's a new where it's. It's not where Ford started. You brought in cloud, so how did you go from like hey, we have no cloud whatsoever, to you know? Actually, I'm kind of curious now how far I guess Ford has gotten in its cloud observability journal journey. Maybe that could be the last little bit that you tie up with.

Craig 30:03

So where it kind of came from is you know when you're starting out in a particular piece of the network. You know when you start out with just modeling the data center or modeling a campus, the question always comes up where, okay, you know, my applications span multiple places, so when are you going to be able to model you know, not just that part, more my SD-WAN provider or anything. That's an overlay. So it was kind of an incremental journey to say here's more and more things that we can model. Now we can do end-to-end everything in the data center. Okay, now we can add your wireless piece if you want to, we can add your SD-WAN to connect all of your sites and, as other SD-WAN providers got into, sort of, hey, we can connect you to any one of your public clouds as well.

Craig 30:50

It became very obvious and people were like well, yeah, we're doing a. You know everyone's somewhere in a cloud migration, either going or coming in some way or another, and you know if they're doing multi-cloud. That's a whole other thing too. So it became very obvious that to really extract the best value and give people that full, you know you don't want to, when you're trying to do an end-to-end path modeling, having a hole in the middle winds up becoming a real, you know, very sore spot. For people it's like, well, yeah, I can go up to this hole here, and then I'm kind of stuck there. To your point earlier, if you're using native AWS tools, you get to that hole whether it's a third-party firewall or something third-party overlay where it's just well, I can check to this point, but that's about it. So it became very obvious that that hole was something that needed to be filled. So you know, we started, started on AWS, added Azure GCP, we're adding Oracle soon. So, yeah, there's ways to add more and more things to that to get the most out of it.

Chris 31:38

It's just a natural progression to start filling gaps, right? You want no holes? Yep, exactly right, no holes in the network.

Craig 31:43

Exactly right.

Networking Observability Challenges and Solutions

Tim 31:45

No holes in the network Actually. So what did you? I'm kind of I'm curious about how, like CloudWind, like CloudWind obviously came out what like two, two and a half years now, when it went GA. So what was the? Because this is just fascinating to me in general. So, like, how did it go from? Like hey, here comes CloudWind to you know, okay, forward has to write Like, what was the? Like what did you actually have to do to start supporting CloudWind? Like, did you have to go figure out like all the APIs?

Craig 32:14

basically, yeah, and that's really the tricky part and that's really what kind of separates us from most. It's not, and same AWS or on-prem as well is trusting what the documentation says is just a recipe for disaster. You have to packet test all of this. You have to check every feature that they have and you have to packet test every bit of it because you have to know exactly how it forwards. If I have a VGW connected to a TG, does it actually forward between each other when it's connected to a Direct Connect gateway?

Craig 32:45

We've added things that customers have that even are hard for us to get access to, like AWS outposts and things like that. So, yeah, there is, it takes one. Yeah, the APIs, fortunately, are very public. Aws is really good about giving you APIs, azure slightly less so. Like they don't give you as good forwarding characteristics. Like, if you want to figure out what's the routing table between any VNet, you have to have a VM attached to it. Look at the effective routes, effective security groups. So it's a little bit more of a pain there. But yeah, it takes packet testing every bit of it to make sure that we know exactly how it's going to forward in all use cases.

Chris 33:21

Not to go down any specific rabbit hole, but how does it par with VMC? Like all the VMware cloud stuff?

Craig 33:33

Yeah, I mean, that's just. That's just NSX T. So NSX T is an overlay. We support that just the same. So, yeah, whether that's on-prem or on the cloud, yeah, it's going to use all of the you know. If it's anything, um, you know on-prem ESX or whatever, that's going to use vCenter APIs. And then the NSX team manager, nsx T edge, that's going to be just another overlay on top of it. So anytime you're doing anything there, yeah, it's going to have that. And yeah, when you have any other firewalls, you know we support gateway load balancer connections to those so you can model that as well.

Tim 33:57

We'll have to see if we can get a demo at some point, cause we've done demos before and put them on our YouTube channel and stuff. And yeah, I I'm having trouble visualizing it, but but you've I mean I say that now, but you've actually you've actually shown me when we were at like RSA or one of the other shows and it looked really cool. I think that's how we first started having the discussion about it, but uh, and that's what I've kind of.

Craig 34:18

Uh, when I talk to people about this, um, a lot of people are reminded with the old like like, like Packet Tracer and things like that where you can do some level. So the idea behind this isn't particularly new. It's like almost everyone says I wish I had this 10 years ago or 15 years ago. And, yeah, there have been small lab versions of this sort of technology with emulation or whatever. It's the scale that really makes a difference and the support for pretty much everything out there changes the game that we've seen. And the support for pretty much everything out there changes the game that we've seen Because, yeah, I mean being able to see your entire network offline and telling what the path is of anything is hugely useful Because, honestly, I mean I spent much of my career doing exactly the same thing and just, you know, pings and trace routes, logging in this, hop to this, hop to this hop, figuring out where it is.

Craig 35:03

And now, yeah, with the cloud, I can't really do that. Like I said, I can't go log into a transit gateway and look at the table. I mean I can pull some APIs, like just recently at Summit DC in the community day in the Bay Area. I did a session on VPC reachability analyzer one and it's fairly painful. It's got some pretty stark limitations on what you can do.

Tim 35:25

Yeah, when we were writing the book that Chris and I were working on, the PACT one Reachability Analyzer was one of the ones I had to play with. My section of the book was observability, and so I got real used to understanding what the limitations are for any business limitations on observability and yeah, I mean Reachability Analyzer. It's one of the things where it works, when it works, yeah.

Craig 35:50

If you know the limitations. Yeah, absolutely yeah.

Chris 35:52

Yeah, because there's like, because they have reachability analyzer, then there's also route analyzer, and then I think the transit gateway network manager TGNM.

Tim 36:01

Yeah.

Craig 36:02

They have several different analyzers and they don't really work together.

Chris 36:05

That's the thing, analyzers, and they don't really work together. I think there's one for AWS Network Firewall too. There is now.

Tim 36:11

But, like you said, I think it's one of those things where AWS is shipping its org chart in a way, because that's how you got ahead at AWS was you come up with a new service? So you'd have a bunch of people creating new services, and I feel like all of these observability services are just like little pet project, like things that became, you know, and that's why they don't. They don't work together. Basically, you need something like Q or whatever they're saying you know to, to stitch all that data together for you.

Craig 36:40

Yeah, not not to get on a rant there, but yeah, every time I go to one of those sessions or one of those places at reInvent or something, it's like you know, I'm struggling to find networking content in any way. It's just like give me one. It's like I get it. This is a developer conference, but it's like give me something here, guys. It's like I'm just trying to fill my schedule with any of the networking content. It's not so easy.

Chris 37:03

Yeah, I couldn't agree more on that one for sure that was rough, maybe because people do networking talks and they put them at the other end of the Vegas strip in the top floor of the main. That was rough Finding my room was hard.

Tim 37:15

I actually I'll be completely honest, I did not. I've been to the Mandalay like 20 times when I worked at Cisco Obviously, we used to have impact or GSX there and everything. And it was this time going to find my room that I realized that it was a third floor to the convention center. Because I've never gone up there in all the years that.

Chris 37:34

I've been there. I didn't know there was a third floor. It was the first time I've heard about it.

Tim 37:39

You know they have the little meeting rooms. It was a nice room. To be fair, it was a nice room but it was hard. I'll be honest. It was hard to find and I appreciate you coming out to support me for that. But yeah, I mean of the of the three other networking talks that were there. It's funny.

Chris 37:51

There's there's only so many, and then one of them is always a repeat of the last years.

Tim 37:58

So it's just like.

Craig 37:59

So it's like, yeah, we're, we're, we're we're definitely bringing up the rear there and you know it's. You know, and I think to your point is it's very much the way they're incentivized there's always a new service that they're coming out with and something that is not really you know. It's kind of orthogonal. You know you've got their minimum viable product out for everything else, so it's like, well, we'll bring something else and it's like, okay, well, yeah all.

Tim 38:26

Yeah, I am curious to see how much more, because every time I think that they've gotten to the point where, like, all right, you guys are probably exposed about as much as you can without impacting the hyperplane. Like they managed to scrape a little bit closer to the skin. But yeah, I'm curious to see how much further down the stack they can go to make available for people before they impact themselves.

Craig 38:47

Basically, yeah, I've noticed they're trying to move a little more into observability. They put out kind of a flow checking system where it will kind of do a service assurance sort of thing. So we'll see how much they go into with that. But I tend to agree, they've only kind of uncovered as much as I think they're probably able to do at this point. They've only kind of uncovered as much as I think they're probably able to do at this point. You know it was definitely a you know trust sort of it works mindset which you know. And they definitely are pushing the more native services which I totally get.

Chris 39:16

Yeah, of course. Yeah, it's just yeah, greg, thanks again for coming on. I think this was a cool, a very cool conversation. We might have to pull you back in and do a demo.

Tim 39:25

Yeah, I would love to see a demo.

Chris 39:27

Prepping for this, I actually did watch one of the Forward Networks videos on YouTube about preparing for an audit and you're kind of showing the visibility piece between AWS Transit Gateway and doing a firewall VPC and then looking for VPC pairings. It was really cool stuff.

Craig 39:43

Might have been me doing it. So yeah, yeah, it was.

Chris 39:47

I will say whatever camera they were using when they were filming you guys talking. It was a very good quality camera.

Craig 39:53

It looked very nice. Oh, I give my regards to the videographer.

Chris 39:57

Yeah.

Tim 40:00

Actually, where can people find you Go ahead and plug anything you want? Yeah?

Craig 40:04

so you can find me on most social medias. I'm at at captain packet, so blue spy x linkedin, xbox live, you know ps pro, you know, so either one of those places I'm I'm pretty well, I'm on the uh, the, the, the discord as well, the all about the journey one, so I'm pretty active on there as well. So, yeah, you can find me at most of those places. So, um, but yeah, I'm always available there. So yeah, awesome.

Podcast Wrap-Up and Book Discussion

Chris 40:26

We'll have to put a link to your uh peer talk in there as well um, yeah because I remember watching that when that was good as well.

Craig 40:32

So yeah, yeah, that's true. At the reinvent I did a couple of peer talks and that was a lot of fun, that's right sweet, all right.

Chris 40:37

Well, that'll do it for today. So thanks again for joining us for another edition of the cables to clouds podcast. I think this is coming out in the first week of january, so hopefully you had a great new year and maybe by now you're starting to receive the book Tim and I wrote together.

Tim 40:54

Maybe we'll receive it at some point. Yeah, maybe we'll get one, who knows.

Chris 40:59

But with that we'll wrap it up and thanks again and we'll talk to you next week. Bye-bye, Hi everyone, it's Chris and this has been the Cables to Clouds podcast. Thanks for tuning in today. If you enjoyed our show, please subscribe to us in your favorite podcatcher, as well as subscribe and turn on notifications for our YouTube channel to be notified of all our new episodes. Follow us on socials at Cables to Clouds. You can also visit our website for all of the show notes at cables to cloudscom. Thanks again for listening and see you next time.

Chris Miles

Co-host

Tim McConnaughy

Co-host