Finetuning Mistral with deepspeed

In fine_tune_deepspeed.py, the first part of the load_training_dataset function looks like this:
```
def load_training_dataset(
    tokenizer,
    path_or_dataset: str = DEFAULT_TRAINING_DATASET,
    max_seq_len: int = 256,
    seed: int = DEFAULT_SEED,
) -> Dataset:
    logger.info(f"Loading dataset from {path_or_dataset}")
    dataset = load_dataset(path_or_dataset)
    logger.info(f"Training: found {dataset['train'].num_rows} rows")
    logger.info(f"Eval: found {dataset['test'].num_rows} rows")
```
The way this function is written, it seems like I have to upload a path to a huggingface dataset. Because this is in Databricks, I would like to pass in a spark dataframe, but load_dataset doesn't accept pyspark dataframes, so I edited to line to read `dataset = Dataset.from_spark(path_or_dataset)` but this gave me the error `pyspark.errors.exceptions.base.PySparkRuntimeError: [MASTER_URL_NOT_SET] A master URL must be set in your configuration.` You also cannot pass in an already created dataset object to load_dataset(). Should I just change the code to `dataset = path_or_dataset`? Or should I keep the code as-is and pass in a dbfs path to a, dataset object?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Finetuning Mistral with deepspeed #101

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Finetuning Mistral with deepspeed #101

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions