Beyond garbage in, garbage out

Technology04 Jun 2026

The quiet power of feature engineering in the age of generative AI

Even those who aren’t data or AI savvy are likely to have heard the adage “garbage in, garbage out,” or GIGO. It is a turn of phrase that encapsulates the fundamental truth about data analytical models – no matter how sophisticated they are, their output will depend on the quality of data fed into them.

Data, however, is rarely pristine. It can be incomplete, inaccurate, biased or inconsistent, riddled with mismatched naming conventions or incompatible units of measurement. Therefore, before becoming an input, it must go through a complex and cumbersome process to make it suitable for making reliable predictions or informed decisions based on it.

Let’s talk about feature engineering

Yet, while GIGO is widely understood and quoted, a far less familiar concept does the heavy lifting behind the scenes: feature engineering. It is the process that converts raw data into informative inputs for machine learning models. Despite its relative obscurity outside technical circles, its importance can hardly be overstated.

Feature engineering is about creating the right environment for a model to learn within. Machine learning models do not “understand” data in any human sense; they rely on patterns encoded in what is known as feature space. If critical relationships in the data aren’t represented in the feature space, the model can’t learn them, which will negatively impact its performance.

This is why experts often emphasise that strong features can make or break a model. So much so, that a relatively simple model equipped with well-engineered features can outperform a far more complex one built on weak inputs.

Traditionally, feature engineering has been a laborious multi-step process requiring a high level of technical expertise and domain knowledge. It begins with identifying and extracting the most significant characteristics from different types of raw data, ranging from internal corporate databases to vendor feeds, open-source repositories and real-time streaming inputs.

To enrich features, new data points are created from existing data. For instance, if price and quantity are available, revenue can be calculated and added as a new feature.

But not all useful features are available in a model-ready format. These must be transformed into figures that are meaningful for the algorithm in use.

Feature selection, creation and transformation is a highly iterative workflow. Engineers must continuously test different combinations of features against the model to determine which set yields the best performance. One of the central challenges lies in distinguishing signal from noise, when, for example, genuine signal gets buried in a haystack of zeros.

Consider a high-dimensional dataset representing customer purchases across hundreds of products. Most customers buy only a small subset of these products, leaving the majority of entries as zeros.

These zeros will inflate the dataset, adding only noise and no real value. Effective feature engineering involves reducing such redundancy without discarding meaningful information.

Failing to do so can lead to one of the most persistent problems in machine learning: overfitting. In this scenario, the model learns the data too well, capturing its noise and random fluctuations rather than underlying patterns.

When a model is overfit, it will perform excellently on training data but struggle with new data in real world applications.

High dimensionality data is only one of many pitfalls resulting in overfitting. A list of other mistakes a feature engineer must avoid – limited or noisy data, lengthy training or the presence of outliers or data leakage – illustrates well how demanding the job is.

Moreover, telling relevant from excessive data also requires a deep knowledge of the domain in which the model operates.

How generative AI has revolutionised feature engineering

Understanding some of the complexities of feature engineering can help users appreciate the benefits of automation more. A Stanford RelBench study found that data scientists spend an average of 12.3 hours and write 878 lines of code on feature engineering per prediction task.

For organisations running dozens or hundreds of such tasks, the cumulative burden can be immense.

It’s thanks to two generative AI models – that predate ChatGPT, the celebrity of the generative strain of AI – that key aspects of feature engineering have already been automated for the past few years. One of the areas where their transformative power becomes tangible is anomaly detection.

In many real-world applications, such as payment fraud detection, medical diagnosis, predictive maintenance and cyber-security, anomalies are rare. They often make up less than five per cent of available data, making them difficult to model effectively.

General adversarial networks, or GANs – a subgroup of generative AI extensively used in deep fakes nowadays – address this scarcity by generating synthetic data that mimics real-world anomalies.

Their architecture, including a neural network that creates fake data and a discriminator that evaluates its authenticity, lends them readily to creating realistic but synthetically generated anomaly data.

Through iterative feedback, the generator improves its outputs until they become indistinguishable from real data. What makes GANs extremely useful is the fact that they can resolve the abnormal data scarcity problem through training solely on normal data.

Once they learn the distribution of what “normal” looks like, GANs struggle to accurately reconstruct anomalous inputs. These reconstruction errors become a signal, enabling the system to flag unusual patterns without ever having seen them before.

From feature engineering to representation learning

Another transformative approach comes from variational autoencoders (VAEs), which take feature engineering a step further into what is known as representation learning.

VAEs’ two neural layers, and what is referred to as the latent space between them, allows these models to learn meaningful, low-dimensional features from complex data during a process called representation learning. They excel at understanding the structure of the data and generalising to new, unseen data points.

Being capable of understanding the relationships between individual features, they understand context much better than rule-based ML models.

In online fraud detection, for example, a VAE can integrate diverse signals – time, location, transaction amount, device information – and combine them with behavioural biometrics such as typing patterns, mouse movements or swipe gestures into a coherent representation.

This makes them particularly effective at identifying novel threats. When confronted with previously unseen fraud patterns or zero-day cyber-attacks, VAEs produce higher reconstruction errors, signalling anomalies that might otherwise go undetected.

In high-complexity domains such as computer vision, natural language processing and advanced recommendation systems, deep learning models such as GANs and VAEs have already begun to replace manual feature engineering.

In these areas, deep generative models are leveraged to execute workflows they can do best – combinatorial tasks requiring a trial-and error approach.

Their success is well demonstrated by the findings of Mastercard’s 2025 payment fraud prevention report, in which 83 per cent of respondents said AI has significantly reduced false positives and customer churn rates.

They have also played an important role in improving early fault detection accuracy in predictive maintenance and diagnostic accuracy in radiology scenarios. GANs show great potential in detecting zero-day exploits too through the synthesis of malicious patterns and adversarial learning.

For most people, these advances operate quietly in the background. Yet they shape everyday experiences, from making secure online payments and receiving accurate medical diagnoses to standing a better chance of avoiding faulty products.

Perhaps, as false alerts get less frequent when we make a purchase at an unusual time, MRI results arrive much earlier than expected or when we notice that the algorithmic recommendations we’re offered have got more accurate, we should pause, if only briefly, to give credit to all those human feature engineers and software developers who ensure that garbage never makes it through and output genuinely makes human lives better. Or, at least, more hassle-free.