Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #10880 +/- ##
==========================================
+ Coverage 90.68% 90.97% +0.28%
==========================================
Files 504 509 +5
Lines 39795 41418 +1623
Branches 3141 3282 +141
==========================================
+ Hits 36087 37678 +1591
- Misses 3042 3089 +47
+ Partials 666 651 -15 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Specific question for reviewers: are there any parts of the code for which there are existing helpers in the codebase I don't know about (I haven't done much dev in DVC) |
|
Note: reviewers can get a feel for the tool by running: # 1. Initialize a Git repo
git init dvc-repo
cd dvc-repo
# 2. Initialize DVC
dvc init
# 3. Create a few 1MB junk files
for i in (seq 1 5)
head -c 1M </dev/urandom > file_$i.bin
end
# 4. Add the files to DVC
dvc add file_*.bin
# 5. Commit changes to Git
git add .
git commit -m "Initialize DVC repo with 1MB junk files"(be sure to have the dvc version installed). 1. Preview what files would be deleted$ dvc purge --dry-run
WARNING: This will show what local DVC-tracked outputs would be removed for the entire workspace.
(dry-run: showing what would be removed, no changes).
ERROR: No default remote configured. Cannot safely purge outputs without verifying remote backup.
Use `--force` to purge anyway.2. Preview what files would be deleted (with
|
|
Hi, thank you for creating the pull request. I am OOO, so please give me a few days for me to review this (and the problem statement/issue itself). |
|
@rgoya does this fit your needs? Are there any features that you need that aren't represented |
Thanks for setting this up @Wheest ! Description of functionality looks pretty good. Maybe one more flag for your consideration: purging cache to match workspace. In cases in which
|
I think that might already be covered by I'd say that |
Unfortunately # Setup repo
dvc init --subdir .
dvc remote add -d --project test s3://$BUCKET/purge_test/
git add .dvc/config
# Add two files
echo foo > bar.txt
echo baz > qux.txt
dvc add bar.txt qux.txt
dvc push
git add bar.txt.dvc qux.txt.dv
git commit -m "dvc test files" -n
# Check remote status
dvc status --cloud
# Cache and remote 'test' are in sync.
cat bar.txt.dvc | grep "md5:"
#- md5: d3b07384d113edec49eaa6238ad5ff00
cat qux.txt.dvc | grep "md5:"
#- md5: 258622b1688250cb619f3c9ccaefb7eb
find . | egrep "b07384d|bar.txt|8622b|qux.txt"
#./.dvc/cache/files/md5/d3/b07384d113edec49eaa6238ad5ff00
#./.dvc/cache/files/md5/25/8622b1688250cb619f3c9ccaefb7eb
#./qux.txt
#./bar.txt.dvc
#./bar.txt
#./qux.txt.dvc
find . | egrep "b07384d|bar.txt|8622b|qux.txt" | wc -l
# 6
# This should do nothing
dvc gc --workspace
find . | egrep "b07384d|bar.txt|8622b|qux.txt" | wc -l
# 6
# Goal: remove bar.txt in shell, purge cache of unreferenced data
# Desired behaviour: command should clear bar.txt's entry in cache, but leave qux.txt's entry
# Let's try removing bar.txt
rm bar.txt
dvc gc --workspace
find . | egrep "b07384d|bar.txt|8622b|qux.txt"
#./.dvc/cache/files/md5/d3/b07384d113edec49eaa6238ad5ff00
#./.dvc/cache/files/md5/25/8622b1688250cb619f3c9ccaefb7eb
#./qux.txt
#./bar.txt.dvc
#./qux.txt.dvc
# This doesn't remove the cache entry of bar.txt
# Let's try removing bar.txt.dvc
rm bar.txt.dvc
dvc gc --workspace
find . | egrep "b07384d|bar.txt|8622b|qux.txt"
#./.dvc/cache/files/md5/25/8622b1688250cb619f3c9ccaefb7eb
#./qux.txt
#./qux.txt.dvc
# This removes the cache entry of bar.txt
#
# Conclusion: lack of bar.txt.dvc is what is interpreted as "not in the workspace"
# What about --not-in-remote?
#
# Restore workspace
git restore bar.txt.dvc
dvc pull
# This removes cache entries of things in remote, but leaves workspace instances
dvc pull bar.txt
dvc gc --workspace --not-in-remote
find . | egrep "b07384d|bar.txt|8622b|qux.txt"
#./qux.txt
#./bar.txt.dvc
#./bar.txt
#./qux.txt.dvc |
|
Thanks @rgoya, I have added this feature, an exclusive flag I made it exlcusive, otherwise the table of interactions with the other arguments in |
This PR introduces a new
dvc purgecommand to remove DVC-tracked outputs and their cache, while leaving stage metadata (.dvcfiles, dvc.yaml) intact. It's intended as a safer/faster alternative to manually deleting files and cache when cleaning up a workspace.CLI
dvc purge [targets...] [--recursive] [--dry-run] [-f|--force] [-y]targets...: optional list of specific files/directories to purge. If omitted, the entire repo is considered.--recursive,-r: recurse into directories.--dry-run: show what would be removed, without deleting anything.--force,-f: bypass safety checks (dirty outputs, remote backup).--yes,-y: skip confirmation prompt.Behaviour
Safety Checks
Before purging, DVC performs two safety checks:
Dirty outputs – if an output has been modified in the workspace and differs from cache:
--forceis used.Remote backup – if a default remote is configured, verify that all outputs are present remotely:
--force.--force.--force, purge proceeds but logs a warning that data may be permanently lost.Example
Tests
--force--forceFixes #10874
Docs will be added in treeverse/dvc.org#5464