Graph Feature Preprocessor: Real-time Subgraph-based Feature
Extraction for Financial Crime Detection
Jovan Blanuša
IBM Research Europe
Switzerland
jov@zurich.ibm.com
Maximo Cravero Baraja
Caltech
USA
mcravero@caltech.edu
Andreea Anghel
IBM Research Europe
Switzerland
aan@zurich.ibm.com
Luc von Niederhäusern
IBM Research Europe
Switzerland
lvn@zurich.ibm.com
Erik Altman
IBM Watson Research
USA
ealtman@us.ibm.com
Haris Pozidis
IBM Research Europe
Switzerland
hap@zurich.ibm.com
Kubilay Atasu
TU Delft
Netherlands
kubilay.atasu@tudelft.nl
Abstract
In this paper, we present
Graph Feature Preprocessor
, a software
library for detecting typical money laundering patterns in financial
transaction graphs in real time. These patterns are used to produce
a rich set of transaction features for downstream machine learning
training and inference tasks such as detection of fraudulent finan-
cial transactions. We show that our enriched transaction features
dramatically improve the prediction accuracy of gradient-boosting-
based machine learning models. Our library exploits multicore
parallelism, maintains a dynamic in-memory graph, and efficiently
mines subgraph patterns in the incoming transaction stream, which
enables it to be operated in a streaming manner. Our solution, which
combines our Graph Feature Preprocessor and gradient-boosting-
based machine learning models, can detect illicit transactions with
higher minority-class F1 scores than standard graph neural net-
works in anti-money laundering and phishing datasets. In addition,
the end-to-end throughput rate of our solution executed on a multi-
core CPU outperforms the graph neural network baselines executed
on a powerful V100 GPU. Overall, the combination of high accuracy,
a high throughput rate, and low latency of our solution demon-
strates the practical value of our library in real-world applications.
ACM Reference Format:
Jovan Blanuša, Maximo Cravero Baraja, Andreea Anghel, Luc von Nieder-
häusern, Erik Altman, Haris Pozidis, and Kubilay Atasu. 2024. Graph Feature
Preprocessor: Real-time Subgraph-based Feature Extraction for Financial
Crime Detection. In
5th ACM International Conference on AI in Finance
(ICAIF ’24), November 14–17, 2024, Brooklyn, NY, USA.
ACM, New York, NY,
USA, 9 pages. https://doi.org/10.1145/3677052.3698674
1 Introduction
Financial transactions serve as records documenting the movement
of financial funds between accounts. Typically, these transactions
are captured in a tabular format, where each row represents a dis-
tinct financial transaction, and columns represent basic transaction
features such as timestamp, source account, target account, amount
This work is licensed under a Creative Commons Attribution-Share Alike
International 4.0 License.
ICAIF ’24, November 14–17, 2024, Brooklyn, NY, USA
©
2024 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-1081-0/24/11
https://doi.org/10.1145/3677052.3698674
(c)
Smurfing
$$$$
$
$
$
$
$$$$
$
$
$
$
(a) Circular money laundering
MAY 3,
2023
MAY 26,
2023
JUN 2,
2023
$
$
$
(b)
Pump and dump
JUN 1, 2023
JUN 8, 2023
$$
DUMP
$
$
$
$
PUMP
$
$$
$$
Figure 1: Crime patterns in financial transaction graphs.
transferred, currency, and payment type [
1
]. While this tabular rep-
resentation offers a structured view of the data, a more insightful
approach emerges when financial transactions are represented as
graphs by treating transactions as edges and accounts as vertices
of a graph, as illustrated in Figure 1. Such a graph representation
enables analysts to uncover insights that may not be immediately
apparent in tabular formats. As a result, financial transaction graphs
facilitate the efficient analysis and interpretation of complex finan-
cial data, aiding in the detection of financial crime [19, 49].
Subgraph patterns in financial transaction graphs can often serve
as indicators of financial crime. A
simple cycle
[
48
], depicted in Fig-
ure 1a, is one such pattern and represents a sequence of transactions
that transfer funds from one bank account back to the same account.
Such a cycle can be an indicator of financial crimes such as money
laundering, tax avoidance [
28
,
69
], credit card frauds [
49
,
56
], or
circular trading used for stock price manipulation [
34
,
36
,
52
]. In ad-
dition, a
gather-scatter
pattern, illustrated in Figure 1b, can suggest
a
pump and dump
stock manipulation scheme [
49
]. In this scheme,
the stock price of a company is artificially increased through the
use of social media to attract other traders for investment. After
This work was performed when Maximo Cravero Baraja and Kubilay Atasu were with
IBM Research Europe, Zurich, Switzerland.
222
ICAIF ’24, November 14–17, 2024, Brooklyn, NY, USA
Blanuša et al.
Gradient
-
boosting
-
based ML model
Batch of financial transactions with
graph
-
based
features
Graph Feature Preprocessor
Payment type
Currency
Amount
Dest. account
Src. account
Timestamp
ID
Cheque
USD
1250
B
A
7 JUN 23, 12:45
100
Wire
EUR
34
C
B
7 JUN 23, 17:12
101
Credit card
CHF
648
C
D
8 JUN 23, 8:22
102
Graph
-
based features
Payment type
Currency
Amount
Dest. account
Src. account
Timestamp
ID
Cheque
USD
1250
B
A
7 JUN 23, 12:45
100
Wire
EUR
34
C
B
7 JUN 23, 17:12
101
Credit card
CHF
648
C
D
8 JUN 23, 8:22
102
Batch of financial transactions
Suspicious transactions
Figure 2: The overview of our graph ML pipeline for the
detection of suspicious financial transactions.
the stock price rises sufficiently, malicious traders sell the stocks.
Due to the artificially inflated stock price, its value drops, and other
traders suffer financial losses. Furthermore, a
scatter-gather
pat-
tern, depicted in Figure 1c, can represent a money laundering tactic
called
smurfing
[
19
,
39
,
42
,
43
,
62
,
67
], in which a malicious actor
employs several intermediary accounts (blue nodes in Figure 1c) to
integrate small sums of illicit funds into the legal banking system.
Similarly, in cryptocurrency transaction networks, criminals use
sophisticated mixing and shuffling schemes to obfuscate the trace
of their activities [
44
]. Such schemes can usually be represented by
subgraph structures [
15
,
64
,
73
,
76
]. The discovery of such suspi-
cious subgraph patterns may enable locating and stopping criminal
activities and their perpetrators.
Rapid detection and processing of suspicious financial trans-
actions are important to avoid financial losses. As financial data
is often represented in a tabular format [
1
], the fastest and most
accurate machine learning models [
27
] for this input format are
gradient-boosting-based models [
14
,
38
]. However, these models
cannot take into account the underlying graph structure and can-
not discover graph patterns that could be associated with financial
crime. Furthermore, a limited set of basic features associated with
financial transactions (see Figure 2) does not provide sufficient infor-
mation to gradient-boosting-based models for detecting suspicious
transactions with sufficient accuracy. As a result, the detection of
suspicious transactions using these methods poses a challenge.
To overcome the aforementioned limitations, we propose a solu-
tion shown in Figure 2. Specifically, we develop the
Graph Feature
Preprocessor
(GFP) library to produce a rich set of graph-based
features for financial transactions. Our library searches for typi-
cal financial crime patterns, such as money laundering cycles and
scatter-gather patterns (see Figure 1), and encodes these graph pat-
terns into additional columns (i.e., features) of the transaction table.
The transaction table enriched with the graph-based features is
then forwarded to a pre-trained gradient-boosting-based machine
learning model that performs the classification of financial transac-
tions and detects suspicious transactions. As a result, the machine
learning model is provided with additional transaction features
extracted from the financial transaction graph, which facilitates the
detection of transactions associated with financial crime.
fit
transform
Dynamic graph management
update graph
create new graph
Graph pattern mining
partial_fit
fan
-
in/out
scatter
-
gather
cycle
vertex
statistics
in
-
memory graph
Graph Feature Preprocessor
Input
t
ransactions
with
basic
features
T
ransactions
with
basic
and
graph
-
based
features
Figure 3: Our Graph Feature Preprocessor is offered as a
scikit-learn preprocessor with
fit
and
transform
methods.
Our contributions can be summarised as follows:
•
We present a graph-based feature extraction library called
Graph Feature Preprocessor for enriching the feature set of
edges in financial transaction graphs by enumerating suspi-
cious subgraph patterns in graphs as well as by computing
various statistical properties of graph vertices. We then use
this library to develop a graph machine learning (graph ML)
pipeline for monitoring financial transaction networks. Sec-
tion 2 introduces this library.
•
We conduct experiments that demonstrate an improvement of
up to
36%
in the minority-class F1 score compared to graph
neural network (GNN) baselines [
11
,
20
,
31
] for money laun-
dering detection tasks. In addition, we demonstrate that our
graph ML pipeline executed using 32 cores of an Intel Xeon
processor achieves higher throughput rates compared to those
GNN baselines executed on an NVIDIA Tesla V100 GPU. Our
experimental evaluation is presented in Section 4.
The GFP library is publicly available on PyPI as part of Snap
ML [
59
–
61
]. In addition, it is offered with IBM
1
mainframe software
products
Cloud Pak for Data on Z
[
33
] and
AI Toolkit for IBM Z and
LinuxONE
[
32
]. Furthermore, an
AI on IBM Z Anti-Money Launder-
ing Solution Template
[
63
], which demonstrates how to develop and
deploy a graph ML pipeline with GFP using an IBM Z environment,
is open-sourced and publicly available
2
.
2 Graph Feature Preprocessor
An overview of the Graph Feature Preprocessor (GFP) is given in
Figure 3. It operates in a streaming fashion, receiving as input a
batch of transactions with only basic features, such as in Figure 2,
and producing additional graph-based features as output. GFP stores
past financial transactions in an in-memory graph, which is dynam-
ically updated as new transactions are received. The graph-based
features are computed by enumerating subgraph patterns in the
graph and by generating statistical properties of the accounts stored
in that graph. GFP can compute the graph-based features across
several CPU cores in parallel, which, together with the dynamic
graph representation, enables real-time feature extraction.
We have implemented GFP as a scikit-learn preprocessor with the
fit/transform
interface [
66
] and made it publicly available on PyPI
as part of the Snap ML package [
59
–
61
]. The main functionality of
GFP is implemented by the
transform
function, which is illustrated
1
IBM, the IBM logo, and IBM Cloud Pak are trademarks or registered trademarks
of International Business Machines Corporation, in the United States and/or other
countries.
2
https://github.com/ambitus/aionz-st-anti-money-laundering
223
Graph Feature Preprocessor: Real-time Subgraph-based Feature Extraction for Financial Crime Detection
ICAIF ’24, November 14–17, 2024, Brooklyn, NY, USA
in Figure 3. This function inserts a batch of input transactions into
the in-memory graph and computes graph-based features for these
transactions. Creating the initial in-memory graph is performed
by providing some past transactions as an input to the
fit
function.
The existing in-memory graph can be updated without computing
any graph features by using the
partial_fit
function. Other stan-
dard preprocessor functions supported by GFP are described in the
publicly available documentation [
60
]. In the rest of this section,
we describe the dynamic graph management and graph pattern
mining components of GFP (see Figure 3), and we describe how the
graph-based features produced by the library are encoded.
2.1 Dynamic Graph Management
The dynamic graph management component in GFP uses an in-
memory graph to represent the financial transaction network. In
this scenario, each account is treated as a graph vertex, and each
transaction represents an edge from its source account to its destina-
tion account. As financial transactions typically include a
timestamp
indicating when a transaction was created (see Figure 2), financial
transaction graphs are considered
temporal graphs
[
30
]. Further-
more, financial transaction graphs are also
multigraphs
[
3
], as there
can be several
parallel edges
, i.e., edges that connect the same pair
of source and destination vertices. Hence, our in-memory graph
must be capable of representing temporal multigraphs.
To enable the seamless processing of transactions in a streaming
fashion, our in-memory graph must support the insertion of new
transactions and the removal of outdated transactions. We define
new transactions as those with timestamps greater than the times-
tamp of any transaction currently in the in-memory graph. Out-
dated transactions are identified as those with timestamps smaller
than a value
푡
now
−
훿
, where
푡
now
represents the largest times-
tamp among the transactions in the in-memory graph and
훿
de-
notes a user-defined time window. Consequently, the in-memory
graph retains only transactions that fall within the time window
[
푡
now
−
훿
:
푡
now
]
, effectively constraining its memory usage.
Our in-memory graph comprises two main data structures: a
transaction log
and an
index
. The transaction log, implemented as
a double-ended queue, maintains a list of edges sorted in ascend-
ing order of their timestamps. This data structure facilitates the
detection and removal of outdated edges by supporting an
푂
(
1
)
operation for removing the edge with the smallest timestamp. The
index data structure employs an
adjacency list
representation to
enable fast access to the neighbours of a vertex [
18
]. Implemented
as a vector of hash maps [
58
], each entry in the vector represents a
vertex
푣
, and the hash map associated with that vertex
푣
signifies
the adjacency list of
푣
. Vertices are internally mapped to integers
in the range of
0
,
1
, . . .,푛
−
1
, where
푛
is the number of vertices in
the graph. These integers are used to access the adjacency list of a
vertex
푣
in this vector. Furthermore, each edge can be accessed in
푂
(
1
)
time using the index, facilitating traversal through the graph,
as required by the graph pattern mining component.
To support the maintenance of parallel edges in the index, each
entry in an adjacency list of the vertex
푣
, representing a neighbour
푢
of the vertex
푣
, also contains a list of edges connecting
푣
with
푢
, referred to as the
parallel edge list
. The edges in this list, also
implemented as a double-ended queue, are represented with their
Algorithm 1:
ScatterGatherStream