BlueAllyBlueAlly
Blog

Trends and Improving IT

Networking, Security

PETER WELCHER | Solutions Architect 


Technology-focused companies often produce annual surveys and produce reports about general trends in networking or IT. The intent behind producing these reports is to market their solutions and services, probably hoping to catch the eye of managers with questions like, “What else should I be doing or thinking about?”  

The reports  also reflect what the companies are currently emphasizing. But hey, they’d be fools if they didn’t emphasize what their customers think is important, and they’re not fools.  

There’s lots of good information in these reports for non-managers as well.  

I recently downloaded and skimmed two such documents: 

  • Cisco’s 2023 Global Networking Trends Report
  • CatchPoint’s 2024 SRE (System Reliability Engineering) Report

See below for links to get your own report copies. Food for thought!  

Another source of insights: Graphiant sponsored some brief Video Predictions (11 minutes).  

I’m going to flag a few key findings from those reports and use them as a launchpad for thoughts and opinions. I recommend obtaining the reports and reading/skimming them, since some of the items may be more pertinent in your situation than the ones I cover below! 

I used bold blue text below for items more or less straight out of the reports.  

Cisco Insights 

Cisco’s document contained leader insights about key networking issues. I found little to disagree with or expand upon. All good ideas! 

Key findings (or my slight rephrasing in some cases): 

  • Adapt cloud-first networking and security. 
  • Hybrid work continues to pose secure connectivity challenges.  
  • Cloud applications and distributed workforce make traditional security models obsolete.  
  • The transition to cloud/multi-cloud is accelerating. (I personally think CDN’s are becoming rather important as part of the app/service delivery resource chain as well.)
  • Securing user access to cloud applications is the top 2023 networking challenge.  

Key recommendations: 

  • Pursue network and security convergence.  
  • Increase collaboration from access networking to cloud (between all ops teams). 
  • Standardized policies, shared telemetry, and streamlined workflows across security, networking, and cloud operations deliver better and faster IT and business outcomes than environments that operate in technology silos.  
  • Cloud professionals cite the need for better network ops and cloud ops alignment.  
  • SD-WAN is evolving to full SASE.  
  • Adopt cloud-first networking and security.  
  • Move from reactive to proactive operations to improve uptime and performance levels.  

CatchPoint Insights 

CatchPoint’s document is about System Reliability Engineering, or SRE. It involves taking an engineering approach to improving reliability, reducing outages, and reducing performance problems.  

The document starts with seven insights: 

  • Loss of control creates new opportunities for relationships and learning. 

Specifically, monitor (item or app) “productivity” or customer-experience-disrupting endpoints, even if outside their control. 

Monitor third party services, for instance (SaaS, cloud resources, SASE, BGP, CDN, DNS, API, etc.) 

Preparation for cross-vendor and cross-team events can help! Shared tools help as well.  

  • SRE is not Platform Engineering, but they both develop capabilities. 

Teams are formed around platform or capability, typically more so in larger organizations.  

My observation: either way, people need to communicate across organizational boundaries, and tend not to.  

I personally think it is best to have one or two app/platform specialists around applications, but pair them with network, security, storage, etc. experts when there is an outage or performance problem. The app specialist needs to have a pretty good idea of which resources the app uses. Better yet: document that, e.g. all the app flows.  

  • Learning from incidents is a universal business opportunity. 

Respondents said learning from incidents has the most room for improvement in overall incident management activities. With organizations of any size.  

I really like this item: having a “post mortem” or “lessons learned” session after a problem is resolved can be very helpful: process improvement, documentation, faster subsequent problem resolutions. 

Having staff with dedicated time to work on such matters is important. Otherwise, other tasks quickly bubble up to the top priority on the staffers’ to-do lists.  

  • AI is not replacing human intelligence anytime soon.  

I agree, but see my comments below. 

  • When it comes to service levels, ignorance is bliss (in smaller companies). 
  • No single monitoring tool does it all.  

Me: BINGO! But too often one tool is all staff has or is comfortable using. Or that is maintained well enough to be useful! 

Getting to the right toolset and good maintenance and training on it strike me as key. And often there isn’t time/budget for this. You can spend money on tools, or (indirectly) on downtime and frustrated staff.  

  • Efficiency is the enemy of pride.  

In the sense that both are rated highly by respondents as motivating factors, but there is a trade-off between the two. Pride in their work can be important to staff! 

  • Without extending visibility beyond their own network to the Internet and cloud environments, IT teams cannot assure a consistent, high-quality user experience for cloud-based applications and services.  
  • Move from reactive to predictive operations to improve uptime and performance levels.  
  • Speed and agility have increasing importance.  

Graphiant Insights 

Six executives presented briefly. Here are a couple of insights I selected: 

  • SoftIron: Private cloud may become more rational and come closer to being on a par with public cloud, with a possible private cloud renaissance. The VMware acquisition and licensing changes may cause rethinking of the case for on-prem servers or present an opportunity.  
  • StorMagic: Less VMware use, more containers, KVM space, where cost and understanding total cost is a significant related factor.  
  • Graphiant: The Internet may return to its roots, namely peer-to-peer communications.  
  • Keeper Security: More use of passkeys, although secure management of them is needed. SMB no longer as passive about security?  

My Thoughts  

Both reports are written with a manager-level focus, so they are somewhat idealized and represent goals. What is actually done in practice may vary to some extent.  

(No, that NEVER happens) 

Here are my observations, and it might be dated or skewed based on (some of) my prior customer base: 

  • There is often little planning, preparation and thought given to organizing for incidents, forming cross-team relationships, sharing tool access and skills, etc.  
  • When there is an incident, the result is too many people end up dropping everything, scrounge around for data, chaotically compare notes, and some days later (for nasty performance or protocol bug problems), the issue gets resolved. Let’s call that “FLAILEX mode.” (Flailing Exercise mode) 
  • Granted: Network and security tools are costly, complex, and hard to learn. Staff need time and practice to effectively troubleshoot using the tools, and to broaden their skills. Granted. But without that, you don’t have complete data.  
  • Consequence: one or several people spend a day or two examining things they think might be the cause, then discussion happens, or a change is made, more days of exploration ensue … and senior management gets increasingly unhappy.  
  • Yet unhappiness rarely turns into an effort to improve the process. Incidents occur, and when resolved everyone breathes a sigh of relief, staff may gets some downtime, and then life moves on to the next action item (crisis de jour) – nobody has the time to do incident response improvement tasks.  

Some thoughts not particularly tied to the documents: 

  • The CatchPoint report says that respondents indicated the hardest part is knowing there IS a problem, diagnosing, or coordinating responsible parties.  
  • The first two of those sound to me like a preparation and tools problem. The last sounds like an organizational tools problem as well, in terms of having a constantly updated way to track who does what. It’s not an org chart so much as a mix of current areas of responsibility and/or prior ones/skills. Having someone in each of {network, security, server, storage, cloud} who is savvy about assignments and skills might be one way to tackle that, quickly locating the best-informed person in each area as needed. That should perhaps be the manager, but sometimes more tech depth might be needed? “Senior/most savvy” techie / “old hand” in each area? What else have people seen that works for this?  
  • Comprehensive data gathering is already in place and is important for faster problem resolution. If staff must set up monitoring of suspect links or devices, that adds days to troubleshooting.  
  • I personally think cross-tool (cross-team shared access to tools) visibility will be essential going forward.  

This could be via log analysis / reporting tools, but I think AI also has a big role to play here. Cf. my blogs about Selector.AI. If you haven’t noticed: I see LLM’s as having major challenges, especially “hallucinations”.  

What does work is AI (in the form of “advanced statistics”) in reporting anomalies and doing cross-category anomaly time correlation.  

If anomaly A happens in close proximity to B, they may or may not be related. But knowing the time correlation means there might be causal impact or the two events indicate a broader problem.  

Similarly, knowing that some network or other performance factor is 2 or 3 sigmas/standard deviations from normal generally indicates something abnormal is going on, likely something that might be related to the problem you’re trying to solve (cause or another side effect). 

  • The CatchPoint report indicates: 

Infrastructure, application, network, and front-end experience telemetry most often feeds the monitoring.  

Client device, SaaS application, business KPI, and others, less often.  

  • What I’ve seen is that we currently tend to troubleshoot from the device or interface upwards. That may be doing it backwards. Ideally, we should perhaps start with what is affected, and work from the general to more specific details. Doing so avoids fixating on one symptom, and knowing the full problem scope is useful in determining likely causes, or what data to examine.   
  • This happens naturally in terms of outages or performance problems. What’s needed is a quick way to tie the observed app or user symptom to server hardware, network hardware, network links, security along the path, etc. Or CDN provider, or cloud provider’s servers, links, etc. Good up-to-date application documentation is part of that (and I’ve never seen it.) This is where time correlation should be helping.  
  • On a related note, the whole team needs access and skills across the toolset. If there is an app/user slowness problem, and changes are being made to see if they help, being able to observe app or user performance directly means fewer people are tied up troubleshooting, and so a MUCH faster hypothesize, change, data, analysis loop. (“OODA loop”) 

Apropos of institutional learning, one CatchPoint find that resonates for me: “Further, in today’s fast-paced environment, sharing best practice information gleaned from post-incident work as widely, clearly, and succinctly as possible (something a dedicated incident team will have the time and remit to do) will benefit the entire org.”  

Concerning the Cisco findings, I see little to disagree with.  

The one emphasis I’d add is that standardization within each tech specialty is one way to reduce effort and enhance security. If you secure a standard cloud app deployment model well, that is likely to be done better than less time spent securing 20 different variations that use different designs, tools, and technologies to deploy the app. Also likely easier to troubleshoot without having to rediscover/research and accommodate per-app variations.  

The VMware space has apparently been shaken up by the Broadcom acquisition of VMware and the licensing changes. Will VMware become a declining product, supplanted by containers? My bet is yes, although learning curve and ease of managing and securing containers may still be issues.  

Concerning passkeys and peer to peer Internet traffic patterns, I’m not sure I want to bet against Khalid Raza of Graphiant. I do like how Graphiant avoids creating encrypted tunnel spaghetti, although I see it as sort of an agile MPLS-replacement provider and managed services story. Peer to peer “natural” traffic flows over the Graphiant network.  

To me this begs the question of whether SSE-like technologies may eventually replace and greatly simplify what we’ve got now. And along the way, obviate the need to force encrypted flows and push traffic through firewalls, etc.  

Links 

Conclusion 

Troubleshooting now encompasses all aspects of user experience and application delivery. Powerful new network, security, and application management tools can expand our visibility into performance problems and provide copious data to help resolve them quickly. AI can help us focus on the right data to resolve our problems.  

Not everyone has all that in place yet. Even when the tool side is present, focus is needed on improving the troubleshooting process, which includes less technical human skills building and team building, as well as practice and processes! 

 

Disclosure Statement

Contact BlueAlly

Connect with BlueAlly today to learn more.