The Danger of Over-Valuing Machine Learning

Fennec fox
Getty

Machine learning has become the latest darling of the IT marketing space, a secret sauce that is supposed to turbo-charge computers and bring us closer to the nirvana of artificial intelligence dominance ... or something like that. Like so much of what comes out of IT marketing, most of it is hype, and deceptive hype at that. While there is a lot of power in what machine learning can do, by assigning it magical capabilities the mavens of marketing may actually be setting the whole field up for yet another artificial intelligence winter, as expectations continue to fall short of reality.

It's worth understanding what Machine Learning is, and isn't. At its core, the term machine-learning refers to a certain set of mathematical processing, or algorithms, in which a set of data records with different attributes are normalized (each attribute converted to a value between zero and one) and then run through a statistical process to determine when certain clusters of attributes occur together. This clustering process usually signals that the record falls into a certain category.

A statistician can train the algorithms to identify that when certain clusters happen this typically indicates that something is of a given type. For instance, regular shapes are almost always man-made and artificial (literally, "made as art"), irregular shapes that follow fractal patterns are usually plants, irregular shapes capable of unexpected movements are usually animals and so forth. It is the set of attributes, rather than a single one, that becomes responsible for categorization.

There are four issues that arise with basic machine learning. The first is that you need to go through the process initially of training the algorithms by identifying a large enough (and representative enough sample) to help the system determine the shape of the envelope that identifies a given cluster. This more items that are identified, the better the mapping, but with a large enough set of clusters, the minimal useful sample size can grow geometrically, so that you might need potentially thousands of samples to properly train a set.

The second problem has to do with categorical bias, either inadvertent or intentional. If a machine learning set looked at the credit ratings and similar factors for two neighborhoods, the sample set may overweight the importance of one factor (such as race) because of the way that the sample was gathered. Building a clustering model that can overcome such bias is often difficult because the modeler is at the mercy of the data available to them.

Another problem is that such clustering typically does not handle edge cases well. Because of the relationship between the number of clusters and the required sample size, if the right variables are not captured in the model then two things that may appear to be superficially similar when looked at through the lens of those variables can look very different when those variables are in place.

For instance, if you have a routine that recognizes different dogs and cats, a picture of a fox is almost invariably going to be misrepresented. It won't be rejected because foxes have most of the same characteristics that cats and dogs have in common, but it will be miscategorized because different characteristics of foxes tend to fit into the dog or cat category.

This particular issue is especially important in areas such as determining whether a political post on a social network is "fake news". There are statistical tells that can indicate whether a given post was human vs. machine generated, though these are probabilistic determinations.

However, even in the former case where content was humanly produced, the only way that you can really make that determination is to have enough contextual information (i.e., news) with solid authentication trail to determine whether or not the facts are consistent. This is an extraordinarily complex problem even for humans, especially given that they have their own biases about specific issues that will tend to let one type of news slide while another one comes up for scrutiny.

The final issue comes down to evolution of interests over time. Let's say that you are an entertainment streaming company that uses machine learning to determine what a particular person likes. The training process will narrow down the potential set of candidate media that the customer wants to watch based upon what they've already watched, but one problem that occurs with this is that you will eventually run out of media that the customer has not already seen.

Typically training is done at a fairly broad level (in other words, the customer is also treated as belonging in a certain set of categories), because training is expensive. Any one data record generally has only a small impact upon the overall categorization — its only when you get to hundreds or even thousands of records signally a similar change that most machine learning systems are likely to change their classification criteria.

Now, there are alternative approaches. One of them, reinforcement learning, treats every resource as an autonomous agent capable of holding its own state information, and performing some of these reclassification processes "on the fly", so to speak. This comes at the cost of increased memory and computational cycles, however, sometimes a lot more. Many of these agents learn by tracking their state and filtering their output based upon that state to use as input in the next iteration. In effect the feedback loop (which is now non-linear, or fractal) means that signals that reinforce a new state have more strength than those that are older.

The upside of this is that such agent's behavior can change in response to external stimulus. The downside is that (as with most non-linear operations) causality becomes harder and harder to ascertain. Was the behavior caused due to a certain configuration of states, or was it emergent, coming from moving to a higher level of abstraction? Put another way, the more steps you build in a recursive process, the harder it is to determine why behavior changes, and as one of the goals of machine learning is to classify, this inability to determine causality limits its utility for classification.

This explainability requirement is now one of the big areas of machine learning theory - can you build a recursive processor that can nonetheless be used to determine causality? This task is made harder because recursion also has a tendency to let very minor initial condition differences balloon dramatically - a phenomenon most widely known as the butterfly effect. In theory a weather system should be unaffected by a butterfly's wings moving - the tiny wind so generated gets damped down too quickly in the surrounding environment. However, in non-linear situations, such a movement may prove to be a tipping point in very rare circumstances.

A good example can be seen with a home calculator with scientific functions. Select a number at random, press the cos() (cosine) button. It will return a value between 1 and -1. Press it again, and it will generate another number. Press it a large enough number of times (and it doesn't take much) and the number will converge on a certain value.

This number is not magical. Rather, because most calculators have a limit in the number of bits that it can use to express a floating point number, this creates a bounding condition that is small initially, but eventually dominates any subsequent calculation. If the calculator had infinite digits, it would continually oscillating, but if you plotted a graph, you'd realize that the graph of input vs. output ended up falling on the same (butterfly shaped) curve, known as a strange attractor.

Deep learning systems are highly recursive, though they typically use a kernel, or matrix of transformations to determine values. These neural networks imitate the way that the brain processes information. They are often highly successful in determining shapes and patterns but do so by creating many layers of abstraction, each of which requires an order of magnitude more energy than the previous layer. Deep learning systems are great a discovering patterns, but often provide new clue why they do, so there is always the worrisome possibility that after running long enough, you're looking at a strange attractor that has no significance to your data

Thus, while machine learning is a powerful tool, it is not a magical box. Bias in the data or the model can train machine learning systems in inappropriate ways. Strange butterflies may be lurking within, bringing up artifacts that may be due to factors that have nothing to do with the data. The lack of transparency in the process can make determining what exactly a model returns problematic, because emergent behaviors may be lurking in the background that are often difficult to ferret out. Finally, the deeper the learning, the more energy is required to maintain multiple levels of abstraction, and that energy can often be significant enough to make using such systems uneconomical.

This is not to say that such tools are useless - most of the evolution of machine learning systems in the last decade have proven highly useful and effective for a wide variety of applications, and form an integral part of the artificial intelligence toolkit. The danger comes in thinking that such systems are truly intelligent, rather than simply the clever application of high speed, and occasionally non-linear solutions.

Indeed, its worth noting that even the largest, most complex deep-learning systems have only about 1/10,000th the number of connections in the human brain. Artificial systems are faster, but generate a lot more heat, and there are shortcuts that organic systems take that only really become feasible once you have a large enough number of connections that can also grow and die off over time.

Put another way, there is a tendency to believe that machine learning systems are in fact potentially superior in decision making ability. Yet the reality is more mundane - they are faster, and you can network them, but they are still best in an advisory role rather than a decision-making one. The machine-learning system on the top of your head is still ultimately the one that can account for the variables that may be too difficult to quantify in an AI setting, at least for the foreseeable future.

">

Machine learning has become the latest darling of the IT marketing space, a secret sauce that is supposed to turbo-charge computers and bring us closer to the nirvana of artificial intelligence dominance ... or something like that. Like so much of what comes out of IT marketing, most of it is hype, and deceptive hype at that. While there is a lot of power in what machine learning can do, by assigning it magical capabilities the mavens of marketing may actually be setting the whole field up for yet another artificial intelligence winter, as expectations continue to fall short of reality.

It's worth understanding what Machine Learning is, and isn't. At its core, the term machine-learning refers to a certain set of mathematical processing, or algorithms, in which a set of data records with different attributes are normalized (each attribute converted to a value between zero and one) and then run through a statistical process to determine when certain clusters of attributes occur together. This clustering process usually signals that the record falls into a certain category.

A statistician can train the algorithms to identify that when certain clusters happen this typically indicates that something is of a given type. For instance, regular shapes are almost always man-made and artificial (literally, "made as art"), irregular shapes that follow fractal patterns are usually plants, irregular shapes capable of unexpected movements are usually animals and so forth. It is the set of attributes, rather than a single one, that becomes responsible for categorization.

There are four issues that arise with basic machine learning. The first is that you need to go through the process initially of training the algorithms by identifying a large enough (and representative enough sample) to help the system determine the shape of the envelope that identifies a given cluster. This more items that are identified, the better the mapping, but with a large enough set of clusters, the minimal useful sample size can grow geometrically, so that you might need potentially thousands of samples to properly train a set.

The second problem has to do with categorical bias, either inadvertent or intentional. If a machine learning set looked at the credit ratings and similar factors for two neighborhoods, the sample set may overweight the importance of one factor (such as race) because of the way that the sample was gathered. Building a clustering model that can overcome such bias is often difficult because the modeler is at the mercy of the data available to them.

Another problem is that such clustering typically does not handle edge cases well. Because of the relationship between the number of clusters and the required sample size, if the right variables are not captured in the model then two things that may appear to be superficially similar when looked at through the lens of those variables can look very different when those variables are in place.

For instance, if you have a routine that recognizes different dogs and cats, a picture of a fox is almost invariably going to be misrepresented. It won't be rejected because foxes have most of the same characteristics that cats and dogs have in common, but it will be miscategorized because different characteristics of foxes tend to fit into the dog or cat category.

This particular issue is especially important in areas such as determining whether a political post on a social network is "fake news". There are statistical tells that can indicate whether a given post was human vs. machine generated, though these are probabilistic determinations.

However, even in the former case where content was humanly produced, the only way that you can really make that determination is to have enough contextual information (i.e., news) with solid authentication trail to determine whether or not the facts are consistent. This is an extraordinarily complex problem even for humans, especially given that they have their own biases about specific issues that will tend to let one type of news slide while another one comes up for scrutiny.

The final issue comes down to evolution of interests over time. Let's say that you are an entertainment streaming company that uses machine learning to determine what a particular person likes. The training process will narrow down the potential set of candidate media that the customer wants to watch based upon what they've already watched, but one problem that occurs with this is that you will eventually run out of media that the customer has not already seen.

Typically training is done at a fairly broad level (in other words, the customer is also treated as belonging in a certain set of categories), because training is expensive. Any one data record generally has only a small impact upon the overall categorization — its only when you get to hundreds or even thousands of records signally a similar change that most machine learning systems are likely to change their classification criteria.

Now, there are alternative approaches. One of them, reinforcement learning, treats every resource as an autonomous agent capable of holding its own state information, and performing some of these reclassification processes "on the fly", so to speak. This comes at the cost of increased memory and computational cycles, however, sometimes a lot more. Many of these agents learn by tracking their state and filtering their output based upon that state to use as input in the next iteration. In effect the feedback loop (which is now non-linear, or fractal) means that signals that reinforce a new state have more strength than those that are older.

The upside of this is that such agent's behavior can change in response to external stimulus. The downside is that (as with most non-linear operations) causality becomes harder and harder to ascertain. Was the behavior caused due to a certain configuration of states, or was it emergent, coming from moving to a higher level of abstraction? Put another way, the more steps you build in a recursive process, the harder it is to determine why behavior changes, and as one of the goals of machine learning is to classify, this inability to determine causality limits its utility for classification.

This explainability requirement is now one of the big areas of machine learning theory - can you build a recursive processor that can nonetheless be used to determine causality? This task is made harder because recursion also has a tendency to let very minor initial condition differences balloon dramatically - a phenomenon most widely known as the butterfly effect. In theory a weather system should be unaffected by a butterfly's wings moving - the tiny wind so generated gets damped down too quickly in the surrounding environment. However, in non-linear situations, such a movement may prove to be a tipping point in very rare circumstances.

A good example can be seen with a home calculator with scientific functions. Select a number at random, press the cos() (cosine) button. It will return a value between 1 and -1. Press it again, and it will generate another number. Press it a large enough number of times (and it doesn't take much) and the number will converge on a certain value.

This number is not magical. Rather, because most calculators have a limit in the number of bits that it can use to express a floating point number, this creates a bounding condition that is small initially, but eventually dominates any subsequent calculation. If the calculator had infinite digits, it would continually oscillating, but if you plotted a graph, you'd realize that the graph of input vs. output ended up falling on the same (butterfly shaped) curve, known as a strange attractor.

Deep learning systems are highly recursive, though they typically use a kernel, or matrix of transformations to determine values. These neural networks imitate the way that the brain processes information. They are often highly successful in determining shapes and patterns but do so by creating many layers of abstraction, each of which requires an order of magnitude more energy than the previous layer. Deep learning systems are great a discovering patterns, but often provide new clue why they do, so there is always the worrisome possibility that after running long enough, you're looking at a strange attractor that has no significance to your data

Thus, while machine learning is a powerful tool, it is not a magical box. Bias in the data or the model can train machine learning systems in inappropriate ways. Strange butterflies may be lurking within, bringing up artifacts that may be due to factors that have nothing to do with the data. The lack of transparency in the process can make determining what exactly a model returns problematic, because emergent behaviors may be lurking in the background that are often difficult to ferret out. Finally, the deeper the learning, the more energy is required to maintain multiple levels of abstraction, and that energy can often be significant enough to make using such systems uneconomical.

This is not to say that such tools are useless - most of the evolution of machine learning systems in the last decade have proven highly useful and effective for a wide variety of applications, and form an integral part of the artificial intelligence toolkit. The danger comes in thinking that such systems are truly intelligent, rather than simply the clever application of high speed, and occasionally non-linear solutions.

Indeed, its worth noting that even the largest, most complex deep-learning systems have only about 1/10,000th the number of connections in the human brain. Artificial systems are faster, but generate a lot more heat, and there are shortcuts that organic systems take that only really become feasible once you have a large enough number of connections that can also grow and die off over time.

Put another way, there is a tendency to believe that machine learning systems are in fact potentially superior in decision making ability. Yet the reality is more mundane - they are faster, and you can network them, but they are still best in an advisory role rather than a decision-making one. The machine-learning system on the top of your head is still ultimately the one that can account for the variables that may be too difficult to quantify in an AI setting, at least for the foreseeable future.

Follow me on LinkedIn.

Kurt Cagle is a writer, data scientist and futurist focused on the intersection of computer technologies and society. He is the founder of Semantical, LLC, a smart data ...