Collect an Image Dataset for Machine Learning
The dataset is only as good as the collection step
Anyone who has trained a vision model knows the quiet truth: most of the effort goes into the data, not the model. A folder full of mislabeled, duplicate, and tiny images poisons results before training even starts. Collecting images cleanly the first time saves hours of cleanup later. Bulk Image Downloader From URL List is a Chrome extension that handles the gathering and first-pass cleaning entirely in the browser, which keeps your collection step lightweight while you focus on labeling and training.
Gather candidates from your sources
Start in the side panel. If your class images live across many pages — say, dozens of category or search-result pages — paste that list of page URLs into the bulk URL box or load it from a .txt or .csv file, then run Scrape from list. Set a Max URLs cap to sample a source before pulling everything, and add a Delay between page loads to stay polite to the host. For a single rich page, Deep Scan scrolls and waits so lazy-loaded images end up in your results instead of being missed.
Filter for quality and consistency
Resolution matters for training. Tiny thumbnails and icons add noise and can hurt a model, so the first filter to set is minimum dimensions. In the Filters tab, set a sensible minimum width and height — enough to exclude sprites and favicons while keeping usable samples. From there:
- Restrict file types to the formats your pipeline expects, so you are not feeding SVGs into an image loader that wants raster files.
- Use aspect ratio filters when a class benefits from consistent framing.
- Use domain include and exclude to keep only images from sources you trust and drop hotlinked third-party junk.
Filters apply to the current collection without re-scraping, so you can tighten the rules and re-apply until the grid holds only the samples you want.
Remove duplicates so your splits stay honest
Duplicates are a real problem for datasets. The same image appearing in both your training and validation sets inflates accuracy and hides overfitting. This extension attacks duplicates on two fronts. URL deduplication catches files referenced more than once by their address. The perceptual duplicate finder goes further by comparing images visually, so near-identical shots with different filenames or slight recompression get flagged. You set the sensitivity, choose which copy to keep, and clear the rest — with an export of the duplicate list if you want a record.
Save into organized, predictable folders
Clean data needs clean structure. On the options page, route each download into a named folder so classes stay separated on disk — one task and folder per class is a simple, scalable pattern. The filename constructor builds consistent names with a sequence number and pad width, which makes files sort correctly and keeps labels traceable. If your training code wants a specific resolution or format, resize or convert during download — set a max width and height with a fit mode, or convert everything to a single format like JPG or WebP in one pass. You can also strip EXIF metadata to drop camera and GPS data you do not need. The result is a tidy, deduplicated, correctly sized set of folders ready to hand to your loader, assembled without a single scraping script to maintain.
