Notes

Scraping Images From Tumblr

February 24, 2019

With my Reaction.Pics project, I had to scrape a bunch of tumblr accounts for data to assemble its database. Since I was trying to not hotlink to thousands of images, I made local copies of images (about 22 GB raw). However, given the uncurated nature of tumblr posts, I found that there were tons of broken images. Going through them, I noticed a few common themes, including:

empty files (I assume from 404s)
malformed files (just binary crap)
HTML (also mostly 404s from sites that don’t obey Accept HTTP headers)
Non-standard images like .raw and .tiff

After processing the database several times with multiple scripts that checked various heuristics like file extension and guessing MIME encoding, I found that the single most useful way of checking images is having Python Pillow parse the image binary:

# Given an image path
path = "abcd.gif"

# Have PIL verify the image
from PIL import Image
Image.open(path).verify()

After filtering images and removing duplicates, I was able to bring the image databaes down to 8 GB.

Thanks to these sites for providing data: