Improving Anti-Money Laundering Models with Synthetic Data

As readers of this blog are well aware, an effective anti-money laundering (AML) regime is crucial for fighting grand corruption, as well as other organized criminal activity. A key part of the AML system is the requirement that banks and other financial institutions identify suspicious transactions and file so-called suspicious activity reports (SARs) with the appropriate government agencies. This is an enormous task, given the volume of financial transactions that banks need to monitor and the challenge of identifying which of those transactions ought to be considered suspicious. Banks spend billions on AML compliance every year, and have developed complex automated systems to assist them in flagging suspect transactions, but existing systems’ ability to efficiently sort suspicious from innocent transactions is limited by the sheer complexity of the task. (False positive rates with current systems, for example, frequently top 90%.)

Many believe that artificial intelligence (AI) systems, such as those employing machine learning (ML), hold enormous promise for improving AML compliance and reducing cost. ML algorithms scrutinize vast datasets to identify patterns that can be used to fashion predictive models. In the AML context, ML algorithms identify those transaction characteristics (or complex combinations of transaction characteristics) that are associated with money laundering, and use these patterns to more efficiently and effectively identify suspicious transactions.

But some commentators have suggested reasons for skepticism, or at least caution. For example, Mayze Teitler recently wrote on this blog about a number of challenges to operationalizing AI-derived algorithms in the AML context, primarily those arising from limitations in the data on which those algorithms are based. As Mayze correctly pointed out, ML algorithms require vast datasets from which to learn, and the data demands are compounded by the relatively rarity of known money laundering cases in the existing datasets.

Despite these concerns, I am more bullish than Mayze regarding the promise of AI-based AML systems. Many of the challenges and concerns regarding the development of effective AI systems in the AML context can be overcome through the use of synthetic data.

Synthetic data is artificially created data based on, but distinct from, real data. Despite having “no one-to-one correspondence with” the data on which it is based, synthetic data retains the statistical properties of the original data such that analyses performed on the synthetic data yield results substantially similar to those performed on the original (real-world) data. Synthetic data is created by algorithms that first compute the intricate correlations between data fields in the real dataset, and then fashion artificial data that, though different from the real data, maintain those same correlations. As you might expect, these algorithms can be enormously complex. Further complicating matters, data synthetization often requires both a data-generating algorithm (a “generator”) as well as a purpose-built data-validating algorithm (a “discriminator”).

Synthetic data can substantially improve the datasets on which AI-based AML systems are built in three ways: anonymization, expansion, and alteration.

First, synthetic data enables financial institutions to pool their transaction data without concern for privacy. Banks are constrained in sharing real customer data by government regulations and internal privacy policies. Simply removing customers’ names from a dataset is insufficient to ensure privacy because data can often be re-identified using ML. Entirely synthetic datasets avoid this problem because the data elements do not bear a one-to-one relationship with any real person. This level of privacy enables banks to share data, and such sharing of transaction data would give banks a larger and more comprehensive dataset on which to train their algorithms. Anonymizing the data would also make it easier for bank to partner with outside firms to develop new and better methods for detecting money laundering, because the anonymity of the data obviates the need for lengthy contractual agreements to protect privacy, cybersecurity audits, and the like.
Second, beside enabling data pooling, synthetic data can vastly increase the size of the datasets that individual banks use to train ML algorithms, and synthetic datasets can be deliberately designed to include a disproportionate number of “positive events” (in this context, transactions that involve money laundering). For these reasons, the use of synthetic data can substantially mitigate several of the key problems that Mayze and others have identified. Synthetic data cannot be concocted from thin air, of course, so it cannot be created where there is extremely limited real data on which to base it. But when there are a sufficient number of positive events, synthetic data algorithms can fortify training data by interpolating between the known positive events—a process known as vectorization.
Third, synthetic data can improve AML AI algorithms by tweaking the composition of the datasets on which those algorithms train in order to customize the resulting AML system. For example, if an ML model was found to under-detect suspicious transactions with specific characterizations, the model could be retrained with a synthetic dataset infused with transactions that have those characteristics and are marked as involving money laundering. In a similar way, regulators could also use synthetic datasets to test the efficacy of banks’ AML models, by seeing how well those models detect the suspicious activity with which regulators are particularly concerned when given synthetic data created by the regulators.

All that said, synthetic data has its limitations. For one thing, synthetic data is only useful to the extent that it is representative of underlying data. Current methods of synthesizing data are not perfect, and, as a result, the accuracy of analyses performed on synthetic data sometimes vary from identical analyses performed on the underlying real dataset. Nevertheless, synthetic data, though no cure-all, promises to address some of the most significant concerns about the expanded use of ML models in the AML context.

GAB | The Global Anticorruption Blog

Law, Social Science, and Policy

Improving Anti-Money Laundering Models with Synthetic Data

1 thought on “Improving Anti-Money Laundering Models with Synthetic Data”

Leave a comment Cancel reply

Share this:

Related

1 thought on “Improving Anti-Money Laundering Models with Synthetic Data”

Leave a comment Cancel reply