How to get started with machine learning and AI

"It's a cookbook?!"
Enlarge / “It’s a cookbook?!”
Aurich Lawson | Getty Images

“Artificial Intelligence” as we know it today is, at best, a misnomer. AI is in no way intelligent, but it is artificial. It remains one of the hottest topics in industry and is enjoying a renewed interest in academia. This isn’t new—the world has been through a series of AI peaks and valleys over the past 50 years. But what makes the current flurry of AI successes different is that modern computing hardware is finally powerful enough to fully implement some wild ideas that have been hanging around for a long time.

Back in the 1950s, in the earliest days of what we now call artificial intelligence, there was a debate over what to name the field. Herbert Simon, co-developer of both the logic theory machine and the General Problem Solver, argued that the field should have the much more anodyne name of “complex information processing.” This certainly doesn’t inspire the awe that “artificial intelligence” does, nor does it convey the idea that machines can think like humans.

However, “complex information processing” is a much better description of what artificial intelligence actually is: parsing complicated data sets and attempting to make inferences from the pile. Some modern examples of AI include speech recognition (in the form of virtual assistants like Siri or Alexa) and systems that determine what’s in a photograph or recommend what to buy or watch next. None of these examples are comparable to human intelligence, but they show we can do remarkable things with enough information processing.

Whether we refer to this field as “complex information processing” or “artificial intelligence” (or the more ominously Skynet-sounding “machine learning”) is irrelevant. Immense amounts of work and human ingenuity have gone into building some absolutely incredible applications. As an example, look at GPT-3, a deep learning model for natural languages that can generate text that is indistinguishable from text written by a person (yet can also go hilariously wrong). It’s backed by a neural network model that uses more than 170 billion parameters to model human language.

Built on top of GPT-3 is the tool named Dall-E, which will produce an image of any fantastical thing a user requests. The updated 2022 version of the tool, Dall-E 2, lets you go even further, as it can “understand” styles and concepts that are quite abstract. For instance, asking Dall-E to visualize “An astronaut riding a horse in the style of Andy Warhol” will produce a number of images such as this:

"An astronaut riding a horse in the style of Andy Warhol," an image generated by AI-powered Dall-E.
Enlarge / “An astronaut riding a horse in the style of Andy Warhol,” an image generated by AI-powered Dall-E.

Dall-E 2 does not perform a Google search to find a similar image; it creates a picture based on its internal model. This is a new image built from nothing but math.

Not all applications of AI are as groundbreaking as these. AI and machine learning are finding uses in nearly every industry. Machine learning is quickly becoming a must-have in many industries, powering everything from recommendation engines in the retail sector to pipeline safety in the oil and gas industry and diagnosis and patient privacy in the healthcare industry. Not every company has the resources to create tools like Dall-E from scratch, so there’s a lot of demand for affordable, attainable toolsets. The challenge of filling that demand has parallels to the early days of business computing, when computers and computer programs were quickly becoming the technology businesses needed. While not everyone needs to develop the next programming language or operating system, many companies want to leverage the power of these new fields of study, and they need similar tools to help them.

Overview of the tool landscape

What goes into building an ML/AI model in today’s world? Ars spoke with Dr. Ellen Ambrose, the director of AI at the Baltimore-area health care startup Protenus, about how to build a new ML model. As she explains it, there are three major factors that go into the creation of a new model: “25 percent asking the right question, 50 percent data exploration and cleaning, feature engineering, and feature selection, [and] 25 percent training and evaluating the model,” she said. While having a huge amount of data within reach is a boon, there are still questions you need to ask before diving in. According to Ambrose, companies need to understand the business problems that can be solved with machine learning and, more importantly, understand what questions they can answer with the data they have.

While available technology can’t necessarily tell you what questions to ask, it can help a team do some data exploration and then aid in the training and evaluation of a given ML model.

Currently, there are multiple companies that sell software packages that do just this—allow groups or individuals to create an artificial intelligence or machine learning model without needing to make a solution from scratch. In addition to these all-in-one packages, there are many freely available libraries that let developers take advantage of machine learning. In fact, in the vast majority of machine learning applications, a developer, data scientist, or data engineer would not be starting from scratch.

Training

Once a team has identified the right questions and has determined that the available data can answer those questions, the model needs to be configured. Some of your data will need to be set aside to be used as a “verification set,” which is used to help ensure your model is doing what it’s supposed to be doing, while the remainder will be used as the basis on which to train your model.

On the surface, this sounds easy to do: Set aside some percentage of your data and never allow the training portion to see it so you can use the data as validation later. But the situation can quickly become complicated. What percentage should this be? What if the event you want to model is very rare? You’ll need to have some data from that event in both the training and the validation data sets, so how do you chop things up?  AI/ML tools can help determine how to break this separation down and possibly overcome structural issues with your data, but configuration is still a critical step to get right.

The next question is what type of machine learning system one should use: a neural network (NN), support vector machine (SVM), or gradient boosted forest. There is no perfect answer, and any method is—at least in the theoretical sense—just as good as any other. However, if we abandon theory and look at the real world, practicality dictates that some algorithms are just better suited for certain tasks as long as you’re not looking for an absolutely optimum result. A quality tool for building an AI/ML model would allow the user to choose the type of algorithm they want to use under the hood. Taking care of all the mathematics and bringing this kind of system to the average developer is key.

Once you have a set of data to train with, a set of data to attempt to verify your model with, and an underlying algorithm, you can try to balance the sensitivity and specificity of your model. At some point in the construction of your model, you will need to make a determination about whether some condition is true or not or whether some value is above or below a given mathematical threshold. These are binary choices, but things aren’t quite that simple. Since the real world and the data that comes from it are messy, there are four possible outcomes of your model: a true positive, a false positive, a true negative, and a false negative.

There. Are. Four. Outcomes.
There. Are. Four. Outcomes.

A perfect model would report only true positives and true negatives, but that is often not mathematically possible. So finding a balance between sensitivity (related to true positives) and specificity (related to the number of true negatives) is paramount.

While there is nothing preventing an individual from doing all this by hand and continually re-training and re-testing, the inclusion of a nice graphical tool can make it a more approachable task. For the longtime devs out there, think of the difference between running a debugger on the command line 25 years ago versus running a full IDE-based debugger today.

Deployment

Once a model is built and trained, it can do what it was made to do: make recommendations about what to buy or watch next, find cats in pictures, or estimate housing prices. However, to actually make your model do stuff, it needs to be deployable. There are many ways to deploy models, and they differ depending on requirements and the environment in which your model will be used.

Those who are familiar with modern IT and cloud environments have undoubtedly heard of Docker and other similar containerization technologies. Using this type of technology, a completely trained model—along with a small web server running within a lightweight container—can be accessible from anywhere via the cloud. Standard web queries (or other equivalent external calls) can be used to pass information to the model, with the expectation that the response will contain the results (“this is a cat” or “this is not a cat”).

This method allows the trained model to exist in isolation so that it can be reused and redeployed in a known state. However, in a dynamic, real-world environment, data is constantly changing, so there is now a burgeoning field of companies that seek to deploy and keep track of models, monitor their accuracy, and provide everything needed to make a complete “ML life cycle.” This field is known as “MLOps,” for “machine learning ops.” (Think “devops,” but focused on this limited ML life cycle instead of the broader SDLC.)

 Python notebooks are invaluable tools in data scientists’ and engineers’ arsenals. These mixtures of code and markup allow scientists, engineers, and developers to share their work in a format that can be viewed and used through a web browser. By coupling this technology with the widespread availability of libraries and systems, a user can simply download or import a trained model with a single Python call.

Say you want to determine the similarity of two sentences. Training and building a full natural language processor model would be a massive undertaking, but with a few lines of code, a developer can download a trained TensorFlow model to determine how similar two sentences are without needing to have access to the full training set of data used to create such a model.

Combining and publicizing collections of trained ML/AI models in model zoos allows independent users to gain the benefit of the work that larger organizations have devoted to the training of a model. For example, a developer interested in tracking people’s movement in a given space could use a trained model that does the heavy lifting of image recognition and path predicting. The developer could then apply a specific business logic to generate value from an idea without needing to worry about the details of how the model was built and trained.

Dr. Ambrose also mentioned a different method of model sharing and deployment. By breaking up a model and its parameters, she said, you can “save a persisted version of a trained model along with its metadata in a [known file] format.” Since an ML model is really just a set of fixed mathematical equations, the model can be “exported” and packaged in a way that allows it to be portable but still functional. A neural network model, for example, is really nothing more than a large set of linear equations with a huge number of parameters. The exact number depends on the inputs and the details of each layer of the network.

Multiple formats for this type of representation exist, from the predictive model markup language (an XML-based schema) to zip-based Spark ML pipeline representation. These files can be used to share and distribute fully trained models between users or workflows. As long as the other user or application knows what the proper underlying model is, it can reconstitute the fully trained setup with the information encoded within these files.

From theoretical to practical

As machine learning and artificial intelligence continue to increase in their usefulness, their adoption will only grow. This article should be enough to give you a basic understanding of the systems—or at least enough to leave you with a whole bunch of open browser tabs to read through.

If this article has piqued your interest, stay tuned—in a few weeks, Ars will be running an entire series on creating, evaluating, and running AI models. We’ll be taking the lessons learned from last year’s experiment with natural language processing and trying our hands at some different problems—and we hope you’ll come along for the ride.

Leave a Reply

Your email address will not be published.