syndat.scores

Scoring functions that normalize metrics for easier comparison.

syndat.scores.correlation(real, synthetic, method='spearman', cat_cols=None, ignore_cat=False)

Computes the Correlation Similarity Score (normalized between 0-100) of real and synthetic data. The score is calculated by comparing the correlation matrices of real and synthetic data.

Parameters:
  • real (DataFrame) – The real data.

  • synthetic (DataFrame) – The synthetic data.

  • method (Literal['pearson', 'kendall', 'spearman']) – The correlation method to use. Options are ‘pearson’, ‘kendall’, or ‘spearman’.

  • cat_cols (Optional[List[str]]) – List of categorical column names.

  • ignore_cat (bool) – Whether to ignore categorical columns.

Return type:

float

Returns:

Correlation score / Norm Quotient

syndat.scores.discrimination(real, synthetic, n_folds=5, drop_na_threshold=0.9)

Computes the Discrimination Complexity Score (normalized between 0-100) of a classifier trained to differentiate between real and synthetic data. The score is calculated based on the classifier performance of a Random Forest classifier differentiating between real and synthetic data.

Parameters:
  • real (DataFrame) – The real data.

  • synthetic (DataFrame) – The synthetic data

  • n_folds – Number of k folds for cross-validation.

  • drop_na_threshold – Percentage of non-missing values required of any column

Return type:

float

Returns:

Differentiation Complexity Score

syndat.scores.distribution(real, synthetic, n_unique_threshold=10)

Computes a normalized score (0-100) quantifying the feature distribution similarity of real and synthetic data using the average Jensen-Shannon distance for all features.

Parameters:
  • real (DataFrame) – The real data.

  • synthetic (DataFrame) – The synthetic data.

  • n_unique_threshold – Threshold to determine at which number of unique values bins will span over several values.

Return type:

float

Returns:

Distribution Similarity Score