Skip to content

Conversation

@sal-uva
Copy link
Collaborator

@sal-uva sal-uva commented Jan 21, 2026

Adds a dataset.finish_with_warning() function.

  • This needs the no. of processed items and a warning message. If the processed items is zero, the dataset items shows as a regular (red) failed dataset.
  • Migrate script adds a new warning bool column to the datasets table to indicate if this dataset's last status update is a warning.
  • Implemented in LLM prompter.

Can be tested by batching prompts with Gemini 2.5 Flash (it often fails).

Any processors that may use it?

@sal-uva sal-uva requested a review from dale-wahl January 21, 2026 12:35
@sal-uva sal-uva added enhancement New feature or request processors Involves self-contained analyticalprocessors. labels Jan 21, 2026
@dale-wahl
Copy link
Member

What do you think about instead of a warning column, we have a status column and status_message column? (Or, if simplicity is desired, add a status_type column.) Right now we just use heuristics to determine a dataset's status such as num_items > 0 for finished and if 'queued' in item.status|lower for queued. If we are adding a column for this, it seems like a good time for something more robust and definitive. I actually managed to fail the tokenizer processor the other day but because it counted the metadata file and showed as Finished. I was even able to run processors on it (which of course failed).

I think I could rearrange this and add that pretty easily. queued|in_process|finished|warning|error?

Otherwise this looks fine. Added the feature to the download_videos processor and tested it out. Looks good. All the downloaders could make use of it, but that one in particular is great since new downloaders will just copy already downloaded videos.

@sal-uva
Copy link
Collaborator Author

sal-uva commented Jan 23, 2026

I think that makes a lot of sense, also with an eye on the future. status_type seems the most logical to me. The migration script will be more complicated though, but I guess it's a matter of settings status_type to failed if rows == 0 and else finished.

@sal-uva
Copy link
Collaborator Author

sal-uva commented Jan 23, 2026

(we should through the processors and add this where relevant before merging)

@dale-wahl
Copy link
Member

I replaced the warning column with status_type as discussed. It is updated in two places in DataSet: update_status (optionally) and finish. I also created a StatusType class just to ensure we are consistent and no typos end up in the db. The flow: DataSet.init sets to queued, BasicProcessor.process updates to processing and back to queued on interrupt or crash. Then Dataset.finish updates final status_type with our two warning and error helpers passing their status_type. @sal-uva I left the check in finish_with_warning for num_rows=0 to switch to error, but I actually think we should no do that. If finish_with_warning is called, we probably ought to keep the warning status (if it is really an error, call finish_with_error). But that's just an opinion.

I also added a status_type called empty. error felt wrong for any dataset with zero items as things like filters and others could be empty but are not really errors. For the UI I just said either empty or error, but this would allows us more nuance if later desired (and we can still check num_row > 0 if that's actually what we want). This mimics check_dataset_finished functionality.

One oddity I found via the migrate script. I thought this would be sufficient for the remaining statues:
UPDATE datasets SET status_type = 'queued' WHERE is_finished = FALSE. What I found though was that we have a number of orphan datasets. They are unfinished but appear to not have any jobs associated to ever finish them. Not positive how they occurred, but looks common via imports (just by mine own and what I found on cat4smr).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request processors Involves self-contained analyticalprocessors.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants