Applying Data Science within Security Design

Originally Published: 25 November 2016

The concept of Security Zoning, also known as Segmentation, is one of the most important architectural foundations within modern network security design. Security Zoning was first introduced back in the mid 90s when the Firewalls started to hit the market. In those days, firewalls were usually deployed at the Internet Perimeter and the deployment principals were fairly simple (Outside, Inside and DMZ).

Over he last 20 years, the pervasiveness of security zoning has increased significantly moving from its original use at the perimeter to common use inside the organisation, such as within data centres, cloud infrastructure, or controlling access to high value assets. Unfortunately, many zoned architecture deployments are driven by the goal of meeting compliance requirements and not actually being a maximally effective security control.

The intention of this post is to show a new way of thinking about the security zoning design approach in an era of Big Data and Data Science. Security is a field that has many amazing and large data sets just waiting to be analysed.

Over the last decade we have seen huge growth in network size, speed, connectedness and application mix. Application architectures have both grown and become more mission critical at the same time. In response, the complexity of network security architectures, i.e. firewalls and the associated rules sets, has increased exponentially. Today, many deployments have become un-manageable. Either the operational costs have blown out or organisations have simply given up trying to engineer an effective implementation. I still see many organisation who try to manage their firewall rule sets in a spreadsheet. In most cases, this approach (IMHO) just does not work effectively any more.

If we had to boil the problem down, we are dealing with a ‘management of complexity issue’. This is a problem which is ripe for the application of Big Data Tools, Data Science and Machine Learning principals.

Big Data tools are able to ingest massive data sets and process those sets to uncover common sets of characteristics. Let’s look at just two key potential data sources which could be leveraged to improve the design approach;

Endpoint information – A fingerprint of the endpoint to determine its open port and application profile and hence its potential role.
Network flow data – Conversations both within and external to the organisation. In other words, who talks to who, how much, and with which applications.

To obtain Endpoint Information, NMAP is a popular, but often hard to interpret, port scanning tool. NMAP can scan large IP address ranges and gather data on the targets, for example open ports, services running on open ports, versions of the service, etc. Feature extraction is a key part of an unsupervised machine learning process and each of these can be considered a ‘feature’ with each endpoint having a value for each of the features. For example, an endpoint with port 80 open, acting as a web server and running Apache.

Machine Learning techniques can be used to process the large data sets which would be produced by an enterprise wide scan. Groups of endpoints with common, or closely matching feature value sets can be ‘clustered’ using one of a number of machine learning algorithms. In this case, clusters are distinct groups of samples (IP addresses) which have been grouped together. Different algorithms with different configurations group these samples in different ways with K-Means being one of the most commonly used algorithms.

Entry into the domain does not require a deep mathematical understanding (although it helps). Python based machine learning tool kits like Scikit-Learn provide an easy entry point.

Flow Information can be output by many vendor’s networking equipment, through probes, taps and host based agents. There are a number of tools which can ingest network flow information and place it in a NoSQL data store, such as MongoDB or Parquet.

With flow information providing detailed information on conversations, Graph Databases like Neo4j ) can be used to construct a relational map. That is, the relationships which exist between different endpoints on the network. Graph Databases can enable this capability in much the same way social media networks like LinkedIn and Facebook show relationships between people.

Today, a variety of visualisation tools are available to see this information in a human friendly display format.

The real power will emerge when the two sources are combined. Understanding the function of the endpoints, combined with information about their relationships with other endpoints will be a very powerful capability in the design process.

I’m not suggesting this is the only answer as many other potential data sources exist. Additionally, I’ll admit have probably oversimplified the situation. However, my point is that by utilising just these two data sources, coupled with some now commonly available Data Science tools, a new and far more effective security zoning design approach can be created. My key goal is to hopefully spawn some new thinking, discussion and projects in this direction.

Applying Data Science within Security Design

Quick Links

Get the latest updates and news. Keep me informed.

© Neon Knight 2023