Large & Interesting Datasets

This is a list of impressive / interesting data-sets I have come across. They are typically free (or cost-price of media), machine readable, downloadable and extensive.

Text:

  • ‘Project Gutenberg’ massive collection of books older than 80 years (expired copyright). via custom iso image here, iso of popular books here, or via rsync here.
  • ‘Enron corpus’  500,000ish actual emails. here
  • ‘ClueWeb09 Datase’ 2009 Archive of 1 billion web pages (various languages) [5 TB, compressed. 25 TB, uncompressed.]. here
  • ‘arXiv archive’. Giant collection of scientific papers in pdf format.  here
  • ‘SMS Spam Collection’ 5000+ SMS messages tagged as spam or not. here
  • Google n-grams (from web pages)  here
  • Google n-grams (from books) here

Linguistics

  • ‘Wordnet’ Database of English words.  Words are grouped, linked and organised. here.
  • Yahoo labs. A collection of data from Yahoo searches/questions/ratings/images here

Knowledge

  • ‘Freebase’  cira 1 billion facts. [will be retired July 2015] here
  • ‘Mizar Project’ 10’s of thousands of mathematical definitions, formulas and proofs; in machine readable format. here
  • ‘Stack Exchange’ Questions and answers from many popular forums. Computer readable dumps and a query API. here
  • ‘DBpedia’ Parsed and structured information extracted from Wikipedia.  here
  • ‘Cross-Lingual Dictionary for English Wikipedia Concepts’  Maps concepts to relevant Wikipedia articles. here

Images

  • ‘Visual Dictionary’ Images for 50,000+ nouns in the English language, by MIT.  here
  • ‘Imagenet’ Pictures matching the hierarchy of Word-net noun nodes. here
  • ‘MNIST database’  70,000 handwritten digits. here
  • ” An index to all sorts of computer vision data-sets. here
  • ‘Label me’ Labeled things in images from MIT here
  • ‘KTH-TIPS’ Textures under varying illumination, pose and scale. here
  • ‘MS COCO’ 91 object-types, 2.5 million labels in 328,000 images. here

Music

  • ‘Million Song Data Set’ Audio features and other meta data for a million modern popular music tracks. here
  • ‘Last.fm API’ Access to all  last.fm user/playlist/music/artist/geo/genre/tag data. A history of who listened to what/when/where   here

Society

  • WHO ‘Global Health Observatory’  Lots of usable data about all heal related issues.  here
  • ‘ICPSR 4572’  Extensive stats on (USA) prisoners.  here
  • Drug usage (USA) , large survey with many variables, who/what/why  here

Social Networks

  • ‘Social Network Analysis Interactive Dataset Library’. Contains computer readable images of 300+ online social networks. here
  • ‘Stanford Large Network Dataset Collection’ Graphs of assorted webpages (eg. Facebook / Twitter) here

Geology:

  • ‘Global Historical Earthquakes’  All known earthquakes (prior to 1903). here and also this.

Misc

  • ‘Open Product Data’. International set of product bar-codes. here
  • ‘Amazon reviews’ 34 million product reviews covering 2 million products. here.
  • ‘USDA National Nutrient Database for Standard Reference’ The nutrients (typically) found in different foods. here
  • ‘Open Library Data Dumps’ Very large dump of library records (author, work, revision, etc). here
  • ‘List of lists of lists’  The pages listing lists got so prolific… we needed a list of them.  here