What is AutoML?
AutoML (Automated Machine Learning) defines the methods, processes, and frameworks to automate some or all steps of the machine learning pipeline. It offers off-the-shelf components and tools to optimise, as well as accelerate, the machine learning process. Before jumping into AutoML, it is helpful to understand the machine learning pipeline.
Machine Learning Pipeline (Process)
Machine learning consists of a series of steps, including but not limited to:
- Data pre-processing and cleaning
- Converting raw real-world data into a clean and understandable format for feeding the model
- Predictive modelling
- Using data to make predictions
- Optimise neural network in the case of deep learning
- Improving the model by optimising network design
- Hyperparameter tuning
- Choosing hyperparameters, which define the configurations of the model and impact the outcome, to maximise model performance
- Data interpretation and analytical findings
- Explaining the outcome of models and delivering a system of action or insight
Machine learning involves training an algorithm using features to make predictions. The algorithm maps input variables to output classes or labels based on mathematical functions. The machine learning model’s performance is evaluated in terms of how close the model performed compared to the truth (the loss function can summarise this).
Optimising the algorithm so that the loss is minimised refers to training the model. This process often needs to be repeated to get to the most optimal solution, and this is where AutoML steps in.
Automating Machine Learning
Every dataset has its characteristics and may perform well on a certain combination of models and hyperparameters. Determining optimal hyperparameters requires iteratively evaluating performance on different hyperparameters and models. Other models may perform differently on every dataset.
Although specific heuristics and principles exist for deciding on the right combination of model and parameters, a data scientist spends ample time experimenting and repeating steps to tune the hyperparameters. These repetitive steps can be automated, which constitutes the core principle of AutoML.
Tech companies like Google, Amazon, and Microsoft are working on their version of AutoML. For example, Google’s vastly popular AutoML is a suite of machine learning tools that enable the training of high-performing deep neural networks without the user needing any machine learning experience.
Python’s popular and widely used library, Scikit-learn, provides the functionality to automatically find the best performing machine learning pipeline for the dataset. It exhaustively tries to find the best combination of hyperparameters and algorithms, including even ensemble model configurations, for optimal selection. Similarly, Auto-PyTorch from Meta (Facebook) is another example of Python’s popular PyTorch library optimising hyperparameters and model architecture.
There exist many other similar tools and frameworks for automating entire machine learning pipelines, thereby easing the task for experts.
What Does AutoML Mean for Data Scientists?
As the rate of Artificial Intelligence and Machine Learning adoption accelerates, the requirement for efficient, fast, and accurate ML models has surged. The rapid pace of development means that reliable and state-of-the-art machine learning pipelines need to be developed around the clock.
AutoML allows data scientists to focus on more complex tasks while automation takes over the responsibility and burden of repetitive experimentation. Additionally, it assures improved performance and utility of traditional machine learning pipelines.
AutoML for Non-Professionals
The upsurge in technology has caused an increased demand for machine learning experts in recent years. The need is far greater than the skilled people available, and thus, there has been tremendous research to bridge the gap between tech and non-tech people by introducing user-friendly software. This has led to the development of AutoML, which aims to make technology usable and implementable by non-experts.
Data scientists possess the expertise to detect and resolve any conflicts deep within the code infrastructure, which can be difficult for a computer program to imitate. AutoML is still a plausible solution for basic implementation and collaboration on tech projects, allowing more extensive use of technology to meet increasing demands.
Will AutoML Replace Data Scientists?
Data science is a broad field, which means a data scientist needs to possess different skills that cannot be entirely replicated by a program or set of tools. The area requires a good understanding of the domain and requires identifying and formalising the problem in a certain way before arriving at the possible solutions.
Real-world data is almost always noisy and messy. It consists of inconsistent labels, missing values, misspelt words, duplicates, different units, and outliers. It must be thoroughly pre-processed and prepared before applying any mathematical operations to the data.
The AutoML developed so far is limited to specific problems such as classification and regression. It is not efficient enough to deal with unsupervised machine learning, which involves categorising data after being trained with unlabeled data.
The intended aim of AutoML is to assist data scientists in their work and not replace them. It is a good option for building models and allowing non-experts to contribute to the machine learning domain. But unlike data scientists, AutoML cannot define business problems or apply domain knowledge to derive valuable features from the data.
Most importantly, data scientists can draw actionable insights from data and convert data to information, which is still a difficult task for AutoML. They are well-equipped with various skillsets, allowing them to be experts in their fields. Although AutoML is an efficient and helpful tool for speeding up machine learning development, it will not be replacing data scientists any time soon.