zz / scripts

Commit History

Add working directory note to AMD download script docstring
941c05c

lzwjava commited on

Add FineWeb download script for AMD/US environment (direct HuggingFace, no mirror)
eb61d41

lzwjava commited on

Add deepseek run_lite and fineweb extract/tokenize scripts
108bc6b

lzwjava commited on

The changes update the download plan script to handle .tar.gz files (for Llama 3 models) in addition to the previously supported .tar files. The code checks if the target filename ends with .tar.gz and adjusts the extraction command accordingly by using tar xfzf format gzip files. This ensures compatibility with Llama 3 models which use the .tar.gz format.
ce535a0

lzwjava commited on

feat(download): track shard progress in progress.json for resumability
9b45962

lzwjava Claude Opus 4.7 (1M context) commited on

refactor(download): hardcode hf-mirror endpoint for China access
ca6fdcb

lzwjava Claude Opus 4.7 (1M context) commited on

chore: add ruff pre-commit hooks and apply formatting
468f6c2

lzwjava Claude Opus 4.7 (1M context) commited on

feat(download): add hardcoded 100B-token GPT-3 ablation downloader
693fe79

lzwjava Claude Opus 4.7 (1M context) commited on

feat(download): add token-budgeted FineWeb shard planner/downloader
8af1a22

lzwjava Claude Opus 4.7 (1M context) commited on

refactor: reorganize project structure
4f685ca

lzwjava Claude Opus 4.6 (1M context) commited on