Collecting Clean Images for Data Augmentation in ML

danito

Augmentation amplifies whatever you feed it, including its flaws. Flip and crop a duplicate-ridden, inconsistently sized set and you have simply multiplied your problems. So the real groundwork for images for data augmentation happens before any rotation or color jitter: assembling a clean, deduplicated, uniformly sized base set. A browser pipeline like Bulk Image Downloader From URL List covers the collecting and the preprocessing without a line of glue code.

Sourcing images for data augmentation across many pages

Training material rarely sits on one page. Collect your candidate source URLs into a list, load it from a file, and set a max-URL cap with a request delay so you stay a polite client. Deep Scan pulls lazy-loaded and infinite-scroll images that a simple grabber skips, and Stack Mode merges results across paginated galleries into one set. Export the full results to CSV first so you have a provenance manifest tying every sample back to its origin.

Filter for label quality up front

Garbage samples poison a class. Use the scraper filters to enforce a baseline before anything downloads:

  • Set a minimum dimension so you drop thumbnails and icons that carry no signal.
  • Filter by file type to keep raster photos and exclude vector or placeholder art.
  • Use domain include/exclude and text-in-URL search to scope a class to trusted sources.

Dedupe so augmentation does not double-count

Near-duplicates leak between train and validation splits and inflate your accuracy. URL deduplication strips exact repeats; the perceptual duplicate finder goes further, catching visual twins across different URLs using hashes, histograms, texture, and shape signals. Tune sensitivity toward Aggressive when you want a strict, diverse set, review the groups, and remove redundant samples from the task. This is the single highest-leverage step for a trustworthy benchmark. If you maintain separate splits, run the finder across all task lists at once so a sample that lands in training cannot reappear, slightly altered, in your validation set. Save the per-signal weights you settle on so future collection rounds dedupe identically.

Normalize size and format on download

Most models want consistent input. Turn on Canvas-based resizing with exact dimensions or a fit mode to standardize the base set, set quality deliberately, and convert everything to JPEG or PNG for a uniform pipeline. Note that AVIF is not part of on-download conversion. Strip EXIF so orientation and camera metadata do not skew preprocessing. All of it runs locally in Web Workers, so nothing about your dataset is uploaded to a third party.

Make the collection reproducible

Reproducibility is non-negotiable in machine learning. Use the filename constructor to encode a class label and sequence number into each file, so folder structure doubles as labels. Save the scan settings, filters, and naming you used to collect the images for data augmentation as a reusable rule, and keep the CSV manifest with your experiment notes. When a reviewer asks how the set was built, you can hand over the rule, the manifest, and the exact filters. Clean inputs, documented and repeatable, are what make your augmented data actually worth training on.