Notes
Scraping Images From Tumblr
With my Reaction.Pics project, I had to scrape a bunch of tumblr accounts for data to assemble its database. Since I was trying to not hotlink to thousands of images, I made local copies of images (about 22 GB raw). However, given the uncurated nature of tumblr posts, I found that there were tons of broken images. Going through them, I noticed a few common themes, including:
- empty files (I assume from 404s)
- malformed files (just binary crap)
- HTML (also mostly 404s from sites that don’t obey
Accept
HTTP headers) - Non-standard images like
.raw
and.tiff
After processing the database several times with multiple scripts that checked various heuristics like file extension and guessing MIME encoding, I found that the single most useful way of checking images is having Python Pillow parse the image binary:
# Given an image path
path = "abcd.gif"
# Have PIL verify the image
from PIL import Image
Image.open(path).verify()
After filtering images and removing duplicates, I was able to bring the image databaes down to 8 GB.
Thanks to these sites for providing data: