To build increasingly general-purpose artificial intelligence systems that can deal with unknown variables across unknown domains, we need benchmarks that measure precisely how well these systems perform on tasks they have never seen before. A prerequisite for this is a measure of a task's generalization difficulty, or how dissimilar it... Show more