Framework to Evaluate Artificial Intelligence Systems

Published in

CodeX

4 min readApr 9, 2021

ODI: Time to update the DAC evaluation criteria

Introduction

Artificial Intelligence (AI) is a field of computer science where algorithms are implemented to train machines to mimic human abilities such as object recognition, speech recognition, language translation and etc.

Due to its complexity, it often requires high technical expertise to design, train and deploy AI solutions.

However, with the rise of open-source programs, it has made AI technologies highly accessible to the public. Now, anyone can download, train, and deploy advanced AI algorithms easily.

This made it difficult for buyers to evaluate the Orginal Equipment Manufacturer’s (OEM) ability in delivering a performant AI solution without an assessment framework.

Aim

This article aims to provide readers a framework to evaluate AI systems.

Principle Consideration

This article is developed with reference to best practices adopted by various industries so that it is non-domain specific and can be applied generically.

Overview of AI System

Building an AI system is different from a traditional computer program where the solution is largely rule-based and does not improve itself over time.

A comprehensive AI system comprises the following stages to learn and improve itself over time: Data Preparation, AI modeling, Training & Evaluation, and Deployment.

Lacking any of the prescribed stages will inhibit the AI’s ability to learn and improve with new data points.

The following describes the various stages in detail:

(1) Data Preparation. It is the process where data are collected and processed into the right format to train and improve the AI model. This is arguably the most important stage in the AI workflow, as the performance of the AI system is highly dependent on the quality of data available for training. To ensure data quality, this process is usually automated and layered with validation checks to minimise bad data entry.

(2) AI Modeling. It is the process of designing an AI model that can effectively learn from the collected data and enable it to make intelligent decisions and predictions. The AI model architecture will greatly influence the type of training data (e.g. image, text, and audio) and its output information (e.g. categorical output, numerical output, and bounding boxes). Since its architecture is largely immutable after deployment, it is important for OEM to select the right AI model for its intended applications, otherwise, it will not be able to perform even with the availability of data.

(3) Training & Evaluation. It is the process of training AI model to enhance its capability over time. This process is largely similar to how human learns where we will receive a reward for doing things right and penalty for doing things wrongly. By adopting the same methodology, there is a need to select a metric (e.g. accuracy, precision, recall, and F1) to evaluate the AI model performance during the training process so that the AI model is able to improve itself efficiently. As each of the metrics has its pros and cons, it is important for OEM to assess its intended application and select the right metrics to evaluate the AI model performance.

(4) Deployment. It is the process of implementing the trained AI model into a production environment for usage. It usually has more constraints such as limited data storage, processing speed, cooling system and etc. which may affect its expected performance. OEM often needs to further optimise the AI model to ensure that it is able to function in the new environment and meet the initial user requirements.

*The MathWorks, Inc: The four steps that engineers should consider for a complete, AI-driven workflow.*

Evaluating AI Systems

In principle, a capable AI system is designed to be versatile, sustainable, performant, and explainable so that it could provide a seamless and intuitive experience to the user. This set of design principles can be applied to evaluate the overall capability of the AI system.

Versatility

Versatility refers to the AI system's ability to work with varied data types (e.g. image, number, and text), size (e.g. resolution and length of text). This allows the AI system to be deployed in multiple domains such as object detection, speech recognition, and sentiment analysis, thereby increasing its overall value and performance.

Sustainability

Sustainability refers to the AI system’s ability to support continuous development, training, and deployment for a prolonged period of time. The system should be designed to be self-service to minimise system downtime due to upgrades and maintenance. Hence, improving its overall cost performance.

Performance

Performance refers to the AI system’s ability to process data, train and deploy the AI model efficiently while producing credible results for the users. It is largely dependent on the following factors: data processing algorithms, AI Model architecture, and computational power. OEM has to optimise these factors and use relevant metrics (e.g. accuracy, precision, recall, F1) to validate its results, thereby improving its efficiency and result credibility.

Explainability

Explainability refers to the AI system’s ability to provide information to help users to understand and interpret its predicted results. This will allow them to monitor, debug and improve the system’s performance efficiently and increase the confidence in using the predicted results for decision making.

Conclusion

The framework to evaluate AI systems is by no means complete or exhaustive and remains to be an article open to feedback. As AI technologies evolve, there will be a need to update the overall framework to meet future context and needs, especially in areas related to ethics and privacy.

I hope that this article gave you a better understanding of AI systems and shed some insights into evaluating them.