Addressing the shortcoming of machine-learning for security

Posted by Hagai Bar-El on Sunday, November 15, 2020 | Categories: Categories: Cyber security, IT security, Secure design, Security analysis

| Defined tags for this entry: machine learning

In a previous post I wrote about cases in which machine-learning adds little to the reliability of security tools, because it often does not react well to novel threats. In this post I will share a thought about overcoming the limitation of machine-learning, by properly augmenting it with other methods. The challenge we tackle is not that of finding additional methods of detection, as we assume such are already known and deployed in other systems. The challenge we tackle is of how to combine traditional detection methods with those based on machine-learning, in a way that yields the best overall results. As promising as machine-learning (and artificial intelligence) is, it is less effective when deployed in silo (not in combination with existing technologies), and hence the significance of properly marrying the two.

I propose to augment the data used in machine-learning with tags that come from other, i.e., traditional, classification algorithms. More importantly, I suggest distinguishing between the machine-learning-based assessment component and the decision component, and using the tagging in both components, independently.

My previous post on the subject made the point that machine-learning can be useful in security tools, but it has the inherent shortcoming of rewarding attacker’s novelty by insufficiently addressing new threats. A reasonable conclusion could be that we should combine machine-learning methods with other, traditional, ones. The key question is how to do so effectively.

The difficulty in combining methods of detection

The quality of detection is measured by two indicators of success, or to be more exact: by two indicators of failure. Those are the well-known ratios of:

false-negative (the ratio of events erroneously not triggered, such as events of malicious payload not being stopped), and
false-positive (the ratio of events erroneously triggered, such as events of benign payload being treated as malicious).

Detection systems try to minimize both, while having certain tolerances for each of the indicators. Such levels of tolerance are essential to define, because decreasing one indicator often comes at the cost of increasing the other. For example, you could configure your spam filter to detect and remove all spam, if only you did not mind how many legitimate messages are mistakenly removed in the process. (Deleting your entire mailbox is the obvious extreme, but there are several other ways that are less draconic and yet are still largely undesirable, such as deleting all messages from unknown senders.)

Our goal in combining methods of filtering, such as methods that utilize machine-learning with methods that don’t, for the purpose of improving overall detection, is essentially to decrease the false-negative ratio, while not increasing the false-positive ratio beyond a certain acceptable threshold. This is hard to accomplish by merely subjecting the payload to both methods and calculating a logical ‘or’ between their block-yes/no verdicts. Doing so will decrease the false-negative ratio, no doubt, but is likely to also increase the false-positive ratio in the process. On the other hand, subjecting the payload to both methods and computing the logical ‘and’ of their block-yes/no verdicts will decrease the false-positive rate, but will also increase the false-negative rate in the process, to the effect that nothing which was not blocked by machine-learning alone will be blocked by the new combination.

Lucky simple combinations

Sometimes we get lucky and deal with detection methods that have negligible false-positive or false-negative ratios, and those can be freely combined with ‘or’ or with ‘and’, respectively. For example, filters that are based on blacklists of known evils have practically no false-positives (all listed evils can safely be assumed to really be evil) and hence blacklists are usually connected by ‘or’, that is, if the blacklist sounds the alarm, the system treats the subject as malicious, regardless of what other tests may have said.

Lucky clever combinations

Sometimes we can combine the filtering methods using models that are more sophisticated than a logical ‘and’ or ‘or’, but that are still straightforward. For example, one method could be used for detection and another for validation. This is useful if the former has a low false-negative ratio (but potentially a high false-positive one) whereas the latter has it the other way around. The example for this case is taken from my own residence:

My home is equipped with a few CCTV cameras covering its front and back yards. Some people asked me how do those cameras help in preventing burglary, given that I do not sit in front of the screen seeking predators all day. The answer is that those cameras are not used for detection at all. Detection is accomplished by an array of 15 IR sensors that are scattered around the same area where the cameras are installed. When any one of those sensors detects anything, it calls me, and the cameras play a role in validating the event, that is, they allow me to properly identify the many false triggers that the IR sensors produce. Validation and detection capabilities are closely intertwined; wasn’t it for the reliable ability to identify false alarms (provided by the cameras), I would never have been able to afford the detection capability provided by 15 sensitive sensors. Now using our terminology: we have sensors that have a low false-negative ratio (they trigger on anything) but a high false-positive ratio (they trigger on anything), combined with cameras that don’t detect a thing, but have a very low false-positive ratio (that of my own eyes). The combination is simple, and it wins.

The need for clever combination methods

The combinations shown so far were called ‘lucky’ after the person who can actually use them… Combination of methods using ‘and’, ‘or’, or detect vs. validate, apply to easier cases, where the detection methods come with clear false-positive and false-negative promises that are independent of one another and where one of the ratios is encouragingly low. Unfortunately, our case of combining filtering abilities is not such. The machine-learning method is not sufficiently low on the false-negative front, otherwise we would never have disembarked on the journey to “improve” it by adding another method. The traditional method we plan to add is also not perfect in terms of false-negative, as that would negate the need for the machine-learning part in the first place. Both the machine-learning method and the traditional methods are introduced to improve detection (i.e., to reduce the false-negative ratio). The false-positive ratio was not explicitly discussed in the BitDam whitepaper that I referenced in my previous post, but of experience – it is never too low either (other than in systems that are blacklist-based, and which suffer from high false-negative ratios as a consequence).

Furthermore, the ratio of false-positive is not an independent variable that “either it is low and we are okay, or it is high and we have a challenge”. Detection systems are calibrated to offer the lowest false-negative ratio while also exercising a tolerable false-positive ratio (not too high). Therefore, the ability to combine detection methods in a way that keeps the false-positive ratio under control implies the ability to configure those detection methods for better detection (lower false-negative ratio). For example: a heuristic virus detection system may have a low false-positive ratio, but in return for that low ratio it may need to be configured to sound an alarm only on well-evident cases of malicious code. Combining such a detection system with a machine-learning one in a way that kills much of the false-positive cases, may allow this same detection technique to be used in a more permissive mode that can result in much more malware being detected.

One tempting-but-undesirable approach

There is one approach to combining traditional methods of filtering with machine-learning methods, that is particularly tempting. This approach is to tag the payload with the verdict of the traditional methods, and feed this tag data into the machine-learning logic itself. Instead of naively mixing the results, we feed the result of the traditional filtering method into the machine-learning logic, as tags on the payload data. This approach is essentially utilizing the power of artificial intelligence to autonomously determine how to use the traditional methods best.

The main difficulty with this approach is that it neglects the reason we convened in the first place. The need for additional methods that scan the payload was raised by the realization that machine-learning in itself is often insufficient. This assertion did not apply to particular machine-learning algorithms that are based on particular data, but referred to the overall notion that machine-learning, of whatever type and using whatever data, leaves a gap that can be exploited by creative attackers.

A proposed way of combining detection methods

The most effective way to combine machine-learning methods with traditional ones, in my opinion, is by combining the results of the two (or more) detection methods using a more complex logic that follows two guidelines:

treating the findings of the non-machine-learning method/s as part of the data that is used by the machine-learning logic, i.e., by tagging the input to the machine-learning logic with the verdict/s of the traditional method/s (as suggested by the “tempting approach” above), and yet more importantly:
treating the machine-learning logic as the originator of an assessment, but not as the originator of a decision, allowing the decision to be computed using logic that considers the tags created by the traditional method/s, regardless of the machine-learning assessment; this is in order to put the machine-learning assessment into a more accurate context, and to make the decision wiser all-in-all.

The first bullet is straightforward. If the traditional detection method “decided” that a certain payload is malicious because of this and that, then the data of that payload shall be tagged as “considered malicious by method A, because this and that.” We need this tag, so the intelligent part of our system, the part which is designed to learn the constantly evolving world for making better decisions tomorrow, gets every piece of useful information. The traditional method brings certain detection capabilities to the table, so it is unlikely that its verdict is entirely useless for improving the machine-learning context.

Machine-learning proponents may claim that if that traditional method is of any worth, then the learning logic will build its own version of it by itself over time. But this indeed takes time. Pragmatic machine-learning shall be designed to keep improving the machine’s understanding of the world on top of what we can readily teach it. After all, we also develop medication rather than trust evolution to solve our medical problems.

The second bullet is less straightforward, and requires understanding the role that machine-learning plays in the decision process.

The role of AI and machine-learning in making decisions

I apologize in advance for this section seeming too philosophical at first.

The system we have at hand, like many other, is one that is designed for making decisions. This system uses a machine-learning component that has the objective of collecting relevant pieces of data about the world, and using certain logic and state information to maintain a current model of certain aspects of this world. This model may, in our case, contain indicators such as: “files that start with A are usually benign, but files that start with A and which are very short are not necessarily so; files that have B in them are always up to no good…”, etc.

When the machine-learning (artificial intelligence) component is working, rather than learning, it is presented with some data and it sounds some opinion, based on what it (thinks it) knows about the world. This machine-learning component provides an assessment that is based on what it knows; it does not make a decision. The decision is made by another part of the system, based also on this assessment. It is important to note that a system that makes a decision, even if it is largely based on a machine-learning component, makes up its mind using some decision logic which does not merely replicate the assessment provided to it. The assessment from the machine-learning component is often probabilistic (“85% that this file is malicious”), and yet the decision is always deterministic; either the file is flagged/deleted/quarantined, or not.

Consider spam-filtering as an example – a domain that uses machine-learning for decades in the form of Bayes classifiers. The Bayes classifier reads all your mails, spam or not, breaks them into their words and tokens, and maintains statistics on how often each token appeared in spam and in genuine messages. After much training, Bayes is the most effective spam filtering mechanism to date. But the spam filtering package you use has more than Bayes to it. The spam filter passes the message in question to the Bayes classifier, which will respond with an assessment like: “based on all past messages I’ve seen, this message is spam with probability of 70%.” The spam filtering package will note this input, and then make the decision. Often, it will determine if to block the message or not by the Bayes classifier, but not always. For example, if at some point you indicated that the message sender is legitimate for you, then the spam filter will let the message through regardless of the whims of the Bayes classifier. The Bayes classifier will humbly record its mistake so perhaps it does not make it next time; perhaps. The AI Bayes classifier provides input to the decision logic, it is not the decision logic.

Tagging data for wiser decisions

When we come to augment AI assessment on payload with other, traditional, methods, we normally cannot just ‘and’ or ‘or’ the verdicts made using those methods. We need to tag the input to the machine-learning component with the outcome of the assessment done by the traditional methods, so to help it improve over time, but in the ever-lasting meantime, we shall use the tagging provided by the traditional methods to augment the decision logic, which must be treated as separate from machine-learning. It is this decision logic that is eventually responsible for putting the assessment made by artificial intelligence into its right context. That right context is determined, to a large extent, by the level to which that artificial intelligence assessment can be relied upon, on a case-specific basis. The fact that the AI component has its own assessment of its imperfection (e.g., “the file is malicious with 80% chance”) is not in itself sufficient to this end, because it only reflects the imperfection of the AI’s internal world model, and is not in itself indicative of the implication of this imperfection on the decision to be made, a decision for which the AI component is not responsible. Note the spam filtering example again: the Bayes classifier says “by what I know, this is spam with 80% chance”, confessing to its imperfect awareness of the world, but it does not know what the implication of this imperfection is on the decision for that message. The filtering logic may filter the message, e.g., if it has no other inputs, or it may just as well decide to keep the message, e.g., if all other non-AI signs are good. The AI machine-learning classifier is the expert witness, not the judge.

In our case as well, the tag that the payload received from the traditional classifier/s, shall be considered by the decision logic, in addition to the machine-learning component output, disregarding the fact that the machine-learning component may have based its output also on that tag, for its evolution may be too slow (as shown by the BitDam whitepaper).

When the decision logic receives payload that is tagged as “likely malicious by traditional method A”, it may decide to block the payload regardless of what the machine-learning component thinks, or it may decide to block it unless the machine-learning component reports that the payload is benign with beyond 90% certainty. This way we can attempt at optimizing the false-positive and false-negative ratios intelligently, while also overcoming the limitation of machine-learning models.

As the system runs, the machine-learning component will improve its models. As it improves, it becomes increasingly tempting to have the decision logic consider only inputs from AI. But learning is forever gradual, and the attacker is not acting at random but will always be attempting tricks believed to be slightly past the point to which machine-learning got. Our only way of maintaining an edge over the creative attacker is by combining machine learning with other techniques, and our only way of combining them effectively is by tagging data not only to be included in the machine-learning model, but also to be directly considered by the decision logic. This could be a safe way to overcome the ever-moving ever-existent security gap caused by AI model imperfection.