Skip to main content
Do you like Artifex? Give it a ⭐ star on GitHub!

Generate data

Generate a dataset with the specified schema, examples and requirements. The model will learn from the provided examples and generate a dataset that matches the specified schema.

Arguments


  • schema_definition
    dict

    A dictionary which specifies the output dataset's schema. It must have the following format:
    {
    "<col_1_name>": {"type": "<col_1_type>"},
    "<col_2_name>": {"type": "<col_2_type>"},
    ...
    "<col_n_name>": {"type": "<col_n_type>"}
    }
    where the possible values of <col_n_type> are integer, float and string.
  • examples
    list[dict]

    A list of dictionaries, which specifies a few (3 to 5 are enough) sample datapoints that will help the data generation model understand what the output data should look like. Each dictionary must follow the schema specified in the schema_definition parameter, or an exception will be raised.
  • requirements
    list[str]

    A list of strings, where each string specifies a requirement or constraint for the job. It must be an empty list if no specific requirements are present.
  • output_path
    str

    A string which specifies the path where the output dataset will be generated. It does not need to specify a file name, as this will be added automatically if one is not provided. If a file name is specified, its extension must be consistent with the output_type parameter. If this is the case, the provided output_path is used in its entirety. Otherwise, the provided extension is replaced with one that is consistent with output_type.
  • number_of_samples
    int

    An integer which specifies the number of datapoints that the model should generate. The maximum number of datapoints you can generate with a single job depends on whether you are on a free or paid plan.
  • output_type
    str

    A string which specifies the format of the output dataset. Only "csv" (meaning a .csv file will be generated) is supported at this time, but we will soon add more options.
from synthex import Synthex

client = Synthex()

client.jobs.generate_data(
schema_definition = {
"surface": {"type": "float"},
"number_of_rooms": {"type": "integer"},
"construction_year": {"type": "integer"},
"city": {"type": "string"},
"market_price": {"type": "float"}
},
examples = [
{
"surface": 104.00,
"number_of_rooms": 3,
"construction_year": 1985,
"city": "Nashville",
"market_price": 218000.00
},
{
"surface": 98.00,
"number_of_rooms": 2,
"construction_year": 1999,
"city": "Springfield",
"market_price": 177000.00
},
{
"surface": 52.00,
"number_of_rooms": 1,
"construction_year": 2014,
"city": "Denver",
"market_price": 230000.00
}
],
requirements = [
"The 'market price' field should be realistic and should depend on the characteristics of the property.",
"The 'city' field should specify cities in the USA, and the USA only"
],
output_path = "output_data/output.csv",
number_of_samples = 100,
output_type = "csv"
)
{
"success": true,
"message": "Job started successfully. Output will be saved to 'output_data/output.csv' upon completion.",
}