| Title: | Synthetic Tabular Data Generation with Gaussian Copulas |
|---|---|
| Description: | Generates synthetic tabular data from real datasets using Gaussian copula models, with parametric marginal selection for numerical columns and a cumulative-frequency embedding that brings categorical and boolean columns into the same joint copula. Includes a metadata system with column types and primary keys, declarative constraints enforced via rejection sampling, conditional sampling, and quality, validity and privacy reports modeled on those of the 'SDMetrics' library. Inspired by the Python 'SDV' (Synthetic Data Vault) library by 'DataCebo'; see Patki, Wedge and Veeramachaneni (2016) "The Synthetic Data Vault" <doi:10.1109/DSAA.2016.49>. |
| Authors: | Kailas Venkitasubramanian [aut, cre] |
| Maintainer: | Kailas Venkitasubramanian <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.2.0 |
| Built: | 2026-06-09 09:01:53 UTC |
| Source: | https://github.com/kvenkita/rsdv |
Add a constraint to metadata
add_constraint(meta, constraint)add_constraint(meta, constraint)
meta |
An |
constraint |
A constraint object from |
Updated rsdv_metadata (for piping).
meta <- metadata() |> set_column_type("low", "numerical") |> set_column_type("high", "numerical") |> add_constraint(inequality_constraint("low", "high", type = "lt"))meta <- metadata() |> set_column_type("low", "numerical") |> set_column_type("high", "numerical") |> add_constraint(inequality_constraint("low", "high", type = "lt"))
A 500-row random sample of the UCI Adult Income dataset, used in package examples and vignettes.
adult_incomeadult_income
A tibble with 500 rows and 16 variables:
Row identifier (integer)
Age in years (integer)
Employment type (character)
Final weight, a census sampling weight (integer)
Highest level of education (character)
Education encoded as an integer (integer)
Marital status (character)
Occupation category (character)
Relationship to householder (character)
Race (character)
Sex (character)
Capital gains (integer)
Capital losses (integer)
Hours worked per week (integer)
Country of origin (character)
Income bracket: <=50K or >50K (character)
https://archive.ics.uci.edu/dataset/2/adult
Estimates the fraction of synthetic rows where a sensitive column value can be correctly inferred from known columns via a k-NN lookup in the real training data.
attribute_disclosure_risk(real, synthetic, sensitive_col, known_cols, k = 1L)attribute_disclosure_risk(real, synthetic, sensitive_col, known_cols, k = 1L)
real, synthetic
|
Data frames. |
sensitive_col |
Name of the column to protect. |
known_cols |
Character vector of numeric columns assumed known to an adversary. Categorical columns are rejected with a clear error. |
k |
Number of nearest neighbors used in inference. |
known_cols must be numeric, because nearest-neighbour lookup operates on
Euclidean distance over the columns. If you want to use a categorical
column as a known attribute, one-hot encode it first (e.g. with
model.matrix(~ col - 1, data)).
A scalar in [0, 1]; lower = more private.
real <- data.frame(age = sample(20:60, 50, replace = TRUE), income = sample(c("low", "high"), 50, replace = TRUE), stringsAsFactors = FALSE) syn <- real[sample(50), ] attribute_disclosure_risk(real, syn, sensitive_col = "income", known_cols = "age")real <- data.frame(age = sample(20:60, 50, replace = TRUE), income = sample(c("low", "high"), 50, replace = TRUE), stringsAsFactors = FALSE) syn <- real[sample(50), ] attribute_disclosure_risk(real, syn, sensitive_col = "income", known_cols = "age")
Bar chart of per-column validity scores.
## S3 method for class 'rsdv_diagnostic_report' autoplot(object, ...)## S3 method for class 'rsdv_diagnostic_report' autoplot(object, ...)
object |
An |
... |
Unused. |
A ggplot object.
meta <- metadata(adult_income) syn <- gaussian_copula_synthesizer(meta) |> fit(adult_income) synth <- sample(syn, n = 500) ggplot2::autoplot(diagnostic_report(adult_income, synth, meta))meta <- metadata(adult_income) syn <- gaussian_copula_synthesizer(meta) |> fit(adult_income) synth <- sample(syn, n = 500) ggplot2::autoplot(diagnostic_report(adult_income, synth, meta))
Plots the NNDR score as a gauge-style bar.
## S3 method for class 'rsdv_privacy_report' autoplot(object, ...)## S3 method for class 'rsdv_privacy_report' autoplot(object, ...)
object |
An |
... |
Unused. |
A ggplot object.
syn <- gaussian_copula_synthesizer(metadata(adult_income)) |> fit(adult_income) synth <- sample(syn, n = 500) pr <- privacy_report(adult_income, synth) ggplot2::autoplot(pr)syn <- gaussian_copula_synthesizer(metadata(adult_income)) |> fit(adult_income) synth <- sample(syn, n = 500) pr <- privacy_report(adult_income, synth) ggplot2::autoplot(pr)
Produces a bar chart of per-column similarity scores, with a horizontal line at the overall score.
## S3 method for class 'rsdv_quality_report' autoplot(object, ...)## S3 method for class 'rsdv_quality_report' autoplot(object, ...)
object |
An |
... |
Unused. |
A ggplot object.
syn <- gaussian_copula_synthesizer(metadata(adult_income)) |> fit(adult_income) synth <- sample(syn, n = 500) qr <- quality_report(adult_income, synth, metadata(adult_income)) ggplot2::autoplot(qr)syn <- gaussian_copula_synthesizer(metadata(adult_income)) |> fit(adult_income) synth <- sample(syn, n = 500) qr <- quality_report(adult_income, synth, metadata(adult_income)) ggplot2::autoplot(qr)
Check a single constraint against each row of a data frame
check_constraint(data, constraint)check_constraint(data, constraint)
data |
A data frame. |
constraint |
An |
Logical vector of length nrow(data).
df <- data.frame(a = c(1, 2, 3), b = c(1, 2, 9)) check_constraint(df, equality_constraint("a", "b"))df <- data.frame(a = c(1, 2, 3), b = c(1, 2, 9)) check_constraint(df, equality_constraint("a", "b"))
Check all constraints in metadata against a data frame
check_constraints(data, meta)check_constraints(data, meta)
data |
A data frame. |
meta |
An |
Logical vector of length nrow(data). TRUE = row passes all constraints.
meta <- metadata() |> set_column_type("x", "numerical") |> add_constraint(custom_constraint(function(row) row$x > 0)) check_constraints(data.frame(x = c(1, -1, 2)), meta)meta <- metadata() |> set_column_type("x", "numerical") |> add_constraint(custom_constraint(function(row) row$x > 0)) check_constraints(data.frame(x = c(1, -1, 2)), meta)
For each pair of categorical columns, compares the joint (normalized
contingency) distributions of real and synthetic data via total variation
distance, scoring 1 - TVD (the SDMetrics ContingencySimilarity score).
This is the categorical analogue of correlation similarity and captures
categorical-vs-categorical dependence.
contingency_similarity(real, synthetic, meta)contingency_similarity(real, synthetic, meta)
real |
A data frame of real data. |
synthetic |
A data frame of synthetic data. |
meta |
An |
A list with pairs (a tibble of column_1, column_2, score) and
score (the mean over pairs). score is NA_real_ when there are fewer
than two categorical columns — there is no dependence to measure, so
propagating NA (rather than 1) avoids overstating fidelity in the
aggregated quality report.
meta <- metadata(adult_income) syn <- gaussian_copula_synthesizer(meta) |> fit(adult_income) synth <- sample(syn, n = 500) contingency_similarity(adult_income, synth, meta)meta <- metadata(adult_income) syn <- gaussian_copula_synthesizer(meta) |> fit(adult_income) synth <- sample(syn, n = 500) contingency_similarity(adult_income, synth, meta)
For each pair of numerical columns, computes 1 - |corr_real - corr_syn| / 2
(the SDMetrics CorrelationSimilarity score), where corr is the Pearson
correlation. Returns one row per pair plus the mean.
correlation_similarity(real, synthetic, meta)correlation_similarity(real, synthetic, meta)
real |
A data frame of real data. |
synthetic |
A data frame of synthetic data. |
meta |
An |
A list with pairs (a tibble of column_1, column_2, score) and
score (the mean over pairs). score is NA_real_ when there are fewer
than two numerical columns — there is no dependence to measure, so
propagating NA (rather than 1) avoids overstating fidelity in the
aggregated quality report.
syn <- gaussian_copula_synthesizer(metadata(adult_income)) |> fit(adult_income) synth_data <- sample(syn, n = 500) correlation_similarity(adult_income, synth_data, metadata(adult_income))syn <- gaussian_copula_synthesizer(metadata(adult_income)) |> fit(adult_income) synth_data <- sample(syn, n = 500) correlation_similarity(adult_income, synth_data, metadata(adult_income))
With vectorized = FALSE (default) fn is invoked once per row with a
one-row data frame and must return a single logical — easy to write but
slow on large frames. With vectorized = TRUE fn is invoked once
with the full data frame and must return a logical vector of length
nrow(data); use this when your predicate is vectorisable for substantial
speedups on large synthetic samples.
custom_constraint(fn, vectorized = FALSE)custom_constraint(fn, vectorized = FALSE)
fn |
A predicate function. If |
vectorized |
Logical. See above. Default |
An rsdv_constraint object.
custom_constraint(function(row) row$x > 0) # Vectorised — usually much faster: custom_constraint(function(data) data$x > 0, vectorized = TRUE)custom_constraint(function(row) row$x > 0) # Vectorised — usually much faster: custom_constraint(function(data) data$x > 0, vectorized = TRUE)
Checks whether synthetic data is structurally valid against the real data
and metadata — independent of how closely it matches the real distributions
(that is the job of quality_report()). Mirrors the SDMetrics
DiagnosticReport two-property hierarchy:
diagnostic_report(real, synthetic, metadata)diagnostic_report(real, synthetic, metadata)
real |
A data frame of real data. |
synthetic |
A data frame of synthetic data. |
metadata |
An |
Data Validity — per-column checks:
numerical: boundary adherence (fraction of values within the real min/max range),
categorical: category adherence (fraction of values whose category was seen in the real data),
boolean: always valid,
primary key: key uniqueness (all values unique and non-missing).
Data Structure — fraction of expected columns present in the synthetic data.
Missing (NA) values are excluded from adherence denominators, since
missingness is modeled separately.
An rsdv_diagnostic_report object.
meta <- metadata(adult_income) syn <- gaussian_copula_synthesizer(meta) |> fit(adult_income) synth <- sample(syn, n = 500) diagnostic_report(adult_income, synth, meta)meta <- metadata(adult_income) syn <- gaussian_copula_synthesizer(meta) |> fit(adult_income) synth <- sample(syn, n = 500) diagnostic_report(adult_income, synth, meta)
For continuous numerical columns, exact == is almost never satisfied by
the copula sampler; use the tolerance argument or
inequality_constraint() with a narrow band. With tolerance > 0, equality
is abs(a - b) <= tolerance for numeric columns and exact == otherwise.
equality_constraint(col_a, col_b, tolerance = 0)equality_constraint(col_a, col_b, tolerance = 0)
col_a, col_b
|
Column names (character). |
tolerance |
Numeric. When non-zero, numeric columns compare with
|
An rsdv_constraint object.
equality_constraint("city", "city_copy") equality_constraint("price_left", "price_right", tolerance = 1e-6)equality_constraint("city", "city_copy") equality_constraint("price_left", "price_right", tolerance = 1e-6)
Constraint: only observed column combinations are valid
fixed_combinations_constraint(columns, reference_data)fixed_combinations_constraint(columns, reference_data)
columns |
Character vector of column names. |
reference_data |
Data frame containing the allowed combinations. |
An rsdv_constraint object.
ref <- data.frame(city = c("NY", "LA"), state = c("NY", "CA"), stringsAsFactors = FALSE) fixed_combinations_constraint(c("city", "state"), ref)ref <- data.frame(city = c("NY", "LA"), state = c("NY", "CA"), stringsAsFactors = FALSE) fixed_combinations_constraint(c("city", "state"), ref)
Fits a single Gaussian copula over all modeled columns. Numerical
columns use a fitted parametric marginal (see default_distribution);
categorical and boolean columns are embedded into the copula via their
cumulative-frequency intervals, so cross-column dependence (numeric vs.
categorical, categorical vs. categorical) is preserved.
gaussian_copula_synthesizer( metadata, enforce_min_max = TRUE, numerical_distributions = list(), default_distribution = "auto" )gaussian_copula_synthesizer( metadata, enforce_min_max = TRUE, numerical_distributions = list(), default_distribution = "auto" )
metadata |
An |
enforce_min_max |
Logical. Clamp sampled numerical values to the
observed range. Default |
numerical_distributions |
Optional named character vector/list mapping
numerical column names to a distribution in |
default_distribution |
Distribution used for numerical columns not named
in |
An unfitted gaussian_copula_synthesizer object.
meta <- metadata(adult_income) |> set_column_type("age", "numerical") |> set_column_type("occupation", "categorical") syn <- gaussian_copula_synthesizer(meta, default_distribution = "auto") syn <- fit(syn, adult_income)meta <- metadata(adult_income) |> set_column_type("age", "numerical") |> set_column_type("occupation", "categorical") syn <- gaussian_copula_synthesizer(meta, default_distribution = "auto") syn <- fit(syn, adult_income)
Constraint: col_a must be less than / greater than col_b
inequality_constraint(col_a, col_b, type = c("lt", "lte", "gt", "gte"))inequality_constraint(col_a, col_b, type = c("lt", "lte", "gt", "gte"))
col_a, col_b
|
Column names (character). |
type |
One of |
An rsdv_constraint object.
inequality_constraint("low", "high", type = "lt")inequality_constraint("low", "high", type = "lt")
Check whether a synthesizer has been fitted
is_fitted(x)is_fitted(x)
x |
A synthesizer object. |
TRUE if fit() has been called; FALSE otherwise.
syn <- gaussian_copula_synthesizer(metadata()) is_fitted(syn) # FALSE before fittingsyn <- gaussian_copula_synthesizer(metadata()) is_fitted(syn) # FALSE before fitting
Kolmogorov-Smirnov similarity score per numerical column
ks_similarity(real, synthetic, meta)ks_similarity(real, synthetic, meta)
real |
A data frame of real data. |
synthetic |
A data frame of synthetic data. |
meta |
An |
A tibble with columns column (chr) and score (dbl, 0–1, higher = better).
syn <- gaussian_copula_synthesizer(metadata(adult_income)) |> fit(adult_income) synth <- sample(syn, n = 500) ks_similarity(adult_income, synth, metadata(adult_income))syn <- gaussian_copula_synthesizer(metadata(adult_income)) |> fit(adult_income) synth <- sample(syn, n = 500) ks_similarity(adult_income, synth, metadata(adult_income))
Load metadata from a JSON file
load_metadata(path)load_metadata(path)
path |
Path to a JSON file produced by |
An rsdv_metadata object.
meta <- metadata() |> set_column_type("age", "numerical") tmp <- tempfile(fileext = ".json") save_metadata(meta, tmp) load_metadata(tmp)meta <- metadata() |> set_column_type("age", "numerical") tmp <- tempfile(fileext = ".json") save_metadata(meta, tmp) load_metadata(tmp)
Create a metadata object describing a dataset's column types
metadata(data = NULL)metadata(data = NULL)
data |
Optional data frame. If supplied, column types are
auto-detected. You can override them with |
An rsdv_metadata object.
meta <- metadata(adult_income) |> set_column_type("age", "numerical") |> set_column_type("occupation", "categorical")meta <- metadata(adult_income) |> set_column_type("age", "numerical") |> set_column_type("occupation", "categorical")
Deserialize metadata from a JSON string
metadata_from_json(json)metadata_from_json(json)
json |
A JSON character string produced by |
An rsdv_metadata object. Constraints are reconstructed with their
original S3 classes so check_constraint() dispatches correctly.
meta <- metadata() |> set_column_type("a", "numerical") |> set_column_type("b", "numerical") |> add_constraint(inequality_constraint("a", "b", type = "lt")) metadata_from_json(metadata_to_json(meta))meta <- metadata() |> set_column_type("a", "numerical") |> set_column_type("b", "numerical") |> add_constraint(inequality_constraint("a", "b", type = "lt")) metadata_from_json(metadata_to_json(meta))
Round-trips column types, primary key, and the structural constraints
(equality, inequality, fixed_combinations). custom_constraint cannot
be serialized — it holds an R closure — and is dropped with a warning.
metadata_to_json(meta)metadata_to_json(meta)
meta |
An |
A JSON character string. Inverse of metadata_from_json().
meta <- metadata() |> set_column_type("a", "numerical") |> set_column_type("b", "numerical") |> add_constraint(inequality_constraint("a", "b", type = "lt")) json <- metadata_to_json(meta) meta2 <- metadata_from_json(json)meta <- metadata() |> set_column_type("a", "numerical") |> set_column_type("b", "numerical") |> add_constraint(inequality_constraint("a", "b", type = "lt")) json <- metadata_to_json(meta) meta2 <- metadata_from_json(json)
Trains an rpart decision tree on synthetic data and on a real training
split, evaluates both on a real held-out test set, and returns the ratio
TSTR / TRTR. A score near 1 means synthetic data is as informative as
real data for this prediction task.
ml_efficacy( real, synthetic, meta, target_col, test_fraction = 0.2, seed = NULL )ml_efficacy( real, synthetic, meta, target_col, test_fraction = 0.2, seed = NULL )
real |
A data frame of real data. |
synthetic |
A data frame of synthetic data. |
meta |
An |
target_col |
Name of a categorical column to use as the outcome. |
test_fraction |
Fraction of |
seed |
Optional integer seed. When supplied, the train/test split is reproducible across calls without affecting the caller's RNG stream. |
A list with elements tstr (accuracy), trtr (accuracy), and
score (ratio, capped at 1).
meta <- metadata(adult_income) syn <- gaussian_copula_synthesizer(meta) |> fit(adult_income) synth_data <- sample(syn, n = 500) ml_efficacy(adult_income, synth_data, meta, target_col = "income", seed = 1)meta <- metadata(adult_income) syn <- gaussian_copula_synthesizer(meta) |> fit(adult_income) synth_data <- sample(syn, n = 500) ml_efficacy(adult_income, synth_data, meta, target_col = "income", seed = 1)
For each synthetic row, computes the ratio of its distance to the nearest real row vs. its distance to the second-nearest real row. A high ratio (close to 1) means the synthetic row is not unusually close to any specific real row — low disclosure risk. Score = mean(ratio > 0.5).
nndr(real, synthetic, normalize = TRUE)nndr(real, synthetic, normalize = TRUE)
real, synthetic
|
Data frames; only numerical columns are used. |
normalize |
Logical. When |
By default columns are z-scored using the real-data mean and standard deviation before the Euclidean distance is computed; without this, a single large-scale column (e.g. income in dollars) dominates the distance and the score becomes a function of measurement units rather than of similarity.
A scalar score in [0, 1]; higher = more private.
real <- data.frame(x = rnorm(50), y = rnorm(50)) syn <- data.frame(x = rnorm(50), y = rnorm(50)) nndr(real, syn)real <- data.frame(x = rnorm(50), y = rnorm(50)) syn <- data.frame(x = rnorm(50), y = rnorm(50)) nndr(real, syn)
Print method for a custom_constraint
## S3 method for class 'custom_constraint' print(x, ...)## S3 method for class 'custom_constraint' print(x, ...)
x |
A |
... |
Unused. |
x, invisibly.
Print method for an equality_constraint
## S3 method for class 'equality_constraint' print(x, ...)## S3 method for class 'equality_constraint' print(x, ...)
x |
An |
... |
Unused. |
x, invisibly.
Print method for a fixed_combinations_constraint
## S3 method for class 'fixed_combinations_constraint' print(x, ...)## S3 method for class 'fixed_combinations_constraint' print(x, ...)
x |
A |
... |
Unused. |
x, invisibly.
Print method for an inequality_constraint
## S3 method for class 'inequality_constraint' print(x, ...)## S3 method for class 'inequality_constraint' print(x, ...)
x |
An |
... |
Unused. |
x, invisibly.
Print method for rsdv_diagnostic_report
## S3 method for class 'rsdv_diagnostic_report' print(x, ...)## S3 method for class 'rsdv_diagnostic_report' print(x, ...)
x |
An |
... |
Unused. |
x, invisibly.
Print method for rsdv_metadata
## S3 method for class 'rsdv_metadata' print(x, ...)## S3 method for class 'rsdv_metadata' print(x, ...)
x |
An |
... |
Unused. |
x, invisibly.
print(metadata())print(metadata())
Print method for rsdv_privacy_report
## S3 method for class 'rsdv_privacy_report' print(x, ...)## S3 method for class 'rsdv_privacy_report' print(x, ...)
x |
An |
... |
Unused. |
x, invisibly.
syn <- gaussian_copula_synthesizer(metadata(adult_income)) |> fit(adult_income) synth <- sample(syn, n = 500) pr <- privacy_report(adult_income, synth) print(pr)syn <- gaussian_copula_synthesizer(metadata(adult_income)) |> fit(adult_income) synth <- sample(syn, n = 500) pr <- privacy_report(adult_income, synth) print(pr)
Print method for rsdv_quality_report
## S3 method for class 'rsdv_quality_report' print(x, ...)## S3 method for class 'rsdv_quality_report' print(x, ...)
x |
An |
... |
Unused. |
x, invisibly.
Generate a privacy report comparing real and synthetic data
privacy_report(real, synthetic, sensitive_col = NULL, known_cols = NULL)privacy_report(real, synthetic, sensitive_col = NULL, known_cols = NULL)
real |
A data frame of real data. |
synthetic |
A data frame of synthetic data. |
sensitive_col |
Optional. Column name for attribute disclosure risk. |
known_cols |
Optional. Column names known to an adversary (required if
|
An rsdv_privacy_report object.
syn <- gaussian_copula_synthesizer(metadata(adult_income)) |> fit(adult_income) synth <- sample(syn, n = 500) pr <- privacy_report(adult_income, synth) print(pr)syn <- gaussian_copula_synthesizer(metadata(adult_income)) |> fit(adult_income) synth <- sample(syn, n = 500) pr <- privacy_report(adult_income, synth) print(pr)
Aggregates metrics into the two-property hierarchy used by SDMetrics:
quality_report(real, synthetic, metadata, target_col = NULL)quality_report(real, synthetic, metadata, target_col = NULL)
real |
A data frame of real data. |
synthetic |
A data frame of synthetic data. |
metadata |
An |
target_col |
Optional. Name of a categorical column for ML efficacy. Reported alongside the score but excluded from the overall. |
Column Shapes — per-column marginal fidelity: KS similarity for numerical columns and TVD similarity for categorical columns.
Column Pair Trends — pairwise dependence: correlation similarity for numerical pairs and contingency similarity for categorical pairs.
The overall score is the mean of the two property scores, so a table with many categorical columns and few numerical ones is not weighted by raw column counts. ML efficacy, when requested, is reported separately and does not enter the overall score (matching SDMetrics).
An rsdv_quality_report object.
meta <- metadata(adult_income) |> set_column_type("age", "numerical") |> set_column_type("occupation", "categorical") syn <- gaussian_copula_synthesizer(meta) |> fit(adult_income) synth <- sample(syn, n = 500) qr <- quality_report(adult_income, synth, meta) print(qr)meta <- metadata(adult_income) |> set_column_type("age", "numerical") |> set_column_type("occupation", "categorical") syn <- gaussian_copula_synthesizer(meta) |> fit(adult_income) synth <- sample(syn, n = 500) qr <- quality_report(adult_income, synth, meta) print(qr)
Dispatches to the synthesizer-specific method when x is an
rsdv_synthesizer. For plain R vectors, integers, or characters it
falls back to base::sample(), preserving backward compatibility.
sample(x, n = NULL, ...)sample(x, n = NULL, ...)
x |
A fitted synthesizer object, or a vector for |
n |
Number of synthetic rows to generate (synthesizer path), or sample size (base::sample path). |
... |
Additional arguments passed to the method or to |
When x inherits from rsdv_synthesizer, a data frame of n
synthetic rows whose columns match the metadata. When x is any other
object, the value returned by base::sample() — typically a vector of
the same type as x and length n.
# Falls back to base::sample for non-synthesizer objects: sample(1:10, 3) meta <- metadata(adult_income) |> set_column_type("age", "numerical") |> set_column_type("income", "categorical") syn <- gaussian_copula_synthesizer(meta) |> fit(adult_income) synth <- sample(syn, n = 100) head(synth)# Falls back to base::sample for non-synthesizer objects: sample(1:10, 3) meta <- metadata(adult_income) |> set_column_type("age", "numerical") |> set_column_type("income", "categorical") syn <- gaussian_copula_synthesizer(meta) |> fit(adult_income) synth <- sample(syn, n = 100) head(synth)
Generates rows in which one or more categorical or boolean columns are held to specified values, via rejection sampling against the fitted copula. This preserves the modeled dependence between the conditioned columns and the rest of the table (unlike overwriting values after the fact).
sample_conditions(x, conditions, max_tries = 100L)sample_conditions(x, conditions, max_tries = 100L)
x |
A fitted |
conditions |
A data frame whose columns are the variables to fix. Each
row is one condition; an optional integer column |
max_tries |
Maximum rejection-sampling rounds per condition. |
A data frame of synthetic rows satisfying the conditions.
meta <- metadata(adult_income) syn <- gaussian_copula_synthesizer(meta) |> fit(adult_income) sample_conditions(syn, data.frame(income = ">50K", .n = 20))meta <- metadata(adult_income) syn <- gaussian_copula_synthesizer(meta) |> fit(adult_income) sample_conditions(syn, data.frame(income = ">50K", .n = 20))
Save metadata to a JSON file
save_metadata(meta, path)save_metadata(meta, path)
meta |
An |
path |
File path to write to. |
Invisibly returns meta.
meta <- metadata() |> set_column_type("age", "numerical") tmp <- tempfile(fileext = ".json") save_metadata(meta, tmp) meta2 <- load_metadata(tmp)meta <- metadata() |> set_column_type("age", "numerical") tmp <- tempfile(fileext = ".json") save_metadata(meta, tmp) meta2 <- load_metadata(tmp)
Set the type of a column in metadata
set_column_type(meta, column, type)set_column_type(meta, column, type)
meta |
An |
column |
Column name (character). |
type |
One of For categorical columns the level order used by the synthesizer follows
the input: a |
The updated rsdv_metadata object (for piping).
metadata() |> set_column_type("age", "numerical")metadata() |> set_column_type("age", "numerical")
Set the primary key column of the metadata
set_primary_key(meta, column)set_primary_key(meta, column)
meta |
An |
column |
Name of the primary key column. Must already be registered
via |
The updated rsdv_metadata object (for piping).
meta <- metadata() |> set_column_type("id", "id") |> set_primary_key("id")meta <- metadata() |> set_column_type("id", "id") |> set_primary_key("id")
Total variation distance similarity score per categorical column
tvd_similarity(real, synthetic, meta)tvd_similarity(real, synthetic, meta)
real |
A data frame of real data. |
synthetic |
A data frame of synthetic data. |
meta |
An |
A tibble with columns column (chr) and score (dbl, 0–1, higher = better).
syn <- gaussian_copula_synthesizer(metadata(adult_income)) |> fit(adult_income) synth <- sample(syn, n = 500) tvd_similarity(adult_income, synth, metadata(adult_income))syn <- gaussian_copula_synthesizer(metadata(adult_income)) |> fit(adult_income) synth <- sample(syn, n = 500) tvd_similarity(adult_income, synth, metadata(adult_income))
Checks that all columns registered in meta are present in data.
validate_data(data, meta)validate_data(data, meta)
data |
A data frame. |
meta |
An |
Invisibly TRUE; throws an error if validation fails.
meta <- metadata() |> set_column_type("age", "numerical") validate_data(data.frame(age = 1:5), meta)meta <- metadata() |> set_column_type("age", "numerical") validate_data(data.frame(age = 1:5), meta)