Bachelor thesis demo

Thesis title: Generative methods applicable for data anonymization and test data creation in the banking industry


This notebook is a part of a bachelor thesis at FIT, CTU in Prague. It contains a demonstrational implementation of a CTGAN model, which is trained on a production dataset provided by Komerční Banka. The dataset is used in practice as a learning dataset for a classification model, which aims to predict a binary flag (GOOD_BAD).

The notebook shows data preprocessing and analysis of the dataset. Based on that, the dataset is then used to train a CTGAN synthesizer and a fully synthetic dataset is sampled. It is then evaluated against the original dataset used for the synthesizer training.

Initial imports:

Config

Contains configuration for this notebook.

It can be used to configure paths, set different split ratios or skip certain parts of the demonstration.


Flags:

Data paths:

Training & ratios:

Other:

CTGAN hyperparams:

Default hyperparameters are used now.

Imports:

Disabling warnings

Helper functions

Data load

For demonstrative purpose, the data are loaded from a .csv file defined in the config (instead of using an ODBC driver).

Data down-sampling

Only simple selective down-sampling is implemented. Although the CTGAN model solves class inbalances on its own, the sampling allows experiments with different target variable ratios and allows to run the notebook faster when necessary. The sampling ratios are defined in the config.

Data preprocessing

This part of the notebook explores the dataset and preprocesses the dataset so that it can be used to train a CTGAN synthesizer.

Dropping the ID column:

The column is unique identifier of a row and has no value for the generator. All synthetic rows can later be assigned wny IDs based on the task they are meant for.

Converting date column

There is one date column in the dataset. It needs to be converted into an integer so that a synthesizer can learn its distribution. Although there are many ways to represent dates as integers, a good way here is to simply normalize the dates from 0 to n, where n is the difference in days between the first and the last date in the dataset. There are no gaps in this simple data format so it behaves just like any other integer.

Helper class for date conversion:

Conversion within the dataset:

New integer column is added instead of the date one:

Converting categorical columns:

Finding a threshold for categorical columns:

Marking the columns as categorical / continuous:

Incorrect formats in numeric columns

Some columns could not be converted, because they are incorrectly formatted.

All columns which contain a comma:

It can be seen that all object columns represent continuous numeric columns, but they were not converted automatically, because they use ',' instead of '.' as a decimal point (most of the columns contain the ',' in every value, the rest represent a whole number). Appart from that, few categorical columns use the same notation too.

Anyway, all of them can be converted to float.

Conversion to the correct data types:

Null values

There is no need to impute the null values for the generation itself as the CTGAN model can work with null values withou any trouble. With that being said, there are constraints among the columns which have to be respected in order to generate completly realistic data.

Some of the ratios appear more than once.

The function bellow compares all null values is a template column to the rest, telling us if the null values only appear together.

Comparison of each column pair:

The null values appear a lot together in multiple columns. This has to be projected into the synthetic dataset, so we have to define additional flags and constraints for the CTGAN synthesizer.

Although there are 100+ matches, there is a pattern in the output. After dividing the columns into several groups, this patter can be characterized by simple rules:

and

A column:

B column:

C column:

D column:

E column:

F column:

G column:

X is used as a value for all other columns.


Definition of the groups:

A check that all A columns and all D columns are correlated:

Other columns (which contain null values) do not necessarily need to be constrained with respect to the null values.

First, we need to define new column that would indicate which group the columns belong to.

Definition of the constraints:

Data after the preprocessing

Final look at the data after the preprocessing is complete.

Model training

A model selected for the implementation is CTGAN from a Synthetic Data Vault framework (https://sdv.dev/).


Field transformers can be defined for each column to specify the conversion needed for that column.

(docs: https://sdv.dev/SDV/api_reference/tabular/api/sdv.tabular.ctgan.CTGAN.html#sdv.tabular.ctgan.CTGAN)

Definition of the field transformers:

Data split

A split of the real dataset into a trains set and a test set

Training

Training of the CTGAN synthesizer:

Data synthesis & handling constraints:

The code bellow samples new synthetic dataset. The constraints are enforced by a reject-sampling method.

Reject-sampling

All data which violate the constraints are dropped.

We want to end up with the same amount of synthetic and real data at the end of the process.

A check that all null values constraints are ok:

Convert data types of the synthetic model

They might be loaded from a CSV file and thus have ambiguous data types.

Removing unnecessary columns

The columns which are not useful for any part of the evaluation are removed from all datasets.

Evaluation

There are 3 evaluation metrics implemented.

Visual evaluation using table-evaluator library

(library docs: https://baukebrenninkmeijer.github.io/table-evaluator/)

Helper functions:

Real / synthetic datasets definiton:

Changing the categorical type of all categorical columns back to float so that they can be compared by the library:

TableEvaluator initialiazation:

Vizualizations:

Mean and standard deviation:

Data distributions:

Comparison of marginal distributions.

Helper functions:


NOTE: The labels cannot be customized, but the table-evaluator library is the easiest way to compare the datasets. As a result of that, all dataset comparisons use "Fake" and "Real" labels, although they might compare something else.


Real versus synthetic dataset:

Real dataset.

GOOD_BAD = 0 (plotted as Real) versus GOOD_BAD = 1 (plotted as Fake):

Synthetic dataset.

GOOD_BAD = 0 (plotted as Real) versus GOOD_BAD = 1 (plotted as Fake):

GOOD_BAD = 0 dataset.

Real versus synthetic dataset:

GOOD_BAD = 1 dataset.

Real versus synthetic dataset:

SDV evaluation

Changing the data types back:

Tests used for the evaluation:

Wrappers:

Complete statistical evaluation:

Different statistical tests results:

Aggregated score:

ML efficacy evaluation

The implemented version of ML efficacy evaluation uses a simple decision tree as a benchmark.

All null values are imputed with a placeholder value.

Get synthetic train set:

Compute the score:

Final convertion to the original format

The columns modified in the preprocessing need to be converted back.

Since the original dataset is very clean, only the date column has to be converted back to obtain the original dataset format. Altough many float values originally used ',' instead of '.', it is not converted back iin this part, because the values are meant to represent real numbers anyway.

Date Column

The DateConverter object defined in the preprocessing part of this notebook is used to decode the date column back to its original format.

Final look at the data:

Saving generated data