syndat.scores
Scoring functions that normalize metrics for easier comparison.
- syndat.scores.correlation(real, synthetic, method='spearman', cat_cols=None, ignore_cat=False)
Computes the Correlation Similarity Score (normalized between 0-100) of real and synthetic data. The score is calculated by comparing the correlation matrices of real and synthetic data.
- Parameters:
real (
DataFrame) – The real data.synthetic (
DataFrame) – The synthetic data.method (
Literal['pearson','kendall','spearman']) – The correlation method to use. Options are ‘pearson’, ‘kendall’, or ‘spearman’.cat_cols (
Optional[List[str]]) – List of categorical column names.ignore_cat (
bool) – Whether to ignore categorical columns.
- Return type:
float- Returns:
Correlation score / Norm Quotient
- syndat.scores.discrimination(real, synthetic, n_folds=5, drop_na_threshold=0.9)
Computes the Discrimination Complexity Score (normalized between 0-100) of a classifier trained to differentiate between real and synthetic data. The score is calculated based on the classifier performance of a Random Forest classifier differentiating between real and synthetic data.
- Parameters:
real (
DataFrame) – The real data.synthetic (
DataFrame) – The synthetic datan_folds – Number of k folds for cross-validation.
drop_na_threshold – Percentage of non-missing values required of any column
- Return type:
float- Returns:
Differentiation Complexity Score
- syndat.scores.distribution(real, synthetic, n_unique_threshold=10)
Computes a normalized score (0-100) quantifying the feature distribution similarity of real and synthetic data using the average Jensen-Shannon distance for all features.
- Parameters:
real (
DataFrame) – The real data.synthetic (
DataFrame) – The synthetic data.n_unique_threshold – Threshold to determine at which number of unique values bins will span over several values.
- Return type:
float- Returns:
Distribution Similarity Score