-
Notifications
You must be signed in to change notification settings - Fork 129
Description
In fine_tune_deepspeed.py, the first part of the load_training_dataset function looks like this:
def load_training_dataset(
tokenizer,
path_or_dataset: str = DEFAULT_TRAINING_DATASET,
max_seq_len: int = 256,
seed: int = DEFAULT_SEED,
) -> Dataset:
logger.info(f"Loading dataset from {path_or_dataset}")
dataset = load_dataset(path_or_dataset)
logger.info(f"Training: found {dataset['train'].num_rows} rows")
logger.info(f"Eval: found {dataset['test'].num_rows} rows")
The way this function is written, it seems like I have to upload a path to a huggingface dataset. Because this is in Databricks, I would like to pass in a spark dataframe, but load_dataset doesn't accept pyspark dataframes, so I edited to line to read dataset = Dataset.from_spark(path_or_dataset) but this gave me the error pyspark.errors.exceptions.base.PySparkRuntimeError: [MASTER_URL_NOT_SET] A master URL must be set in your configuration. You also cannot pass in an already created dataset object to load_dataset(). Should I just change the code to dataset = path_or_dataset? Or should I keep the code as-is and pass in a dbfs path to a, dataset object?