This is a list of impressive / interesting data-sets I have come across. They are typically free (or cost-price of media), machine readable, downloadable and extensive.
Text:
- ‘Project Gutenberg’ massive collection of books older than 80 years (expired copyright). via custom iso image here, iso of popular books here, or via rsync here.
- ‘Enron corpus’ 500,000ish actual emails. here
- ‘ClueWeb09 Datase’ 2009 Archive of 1 billion web pages (various languages) [5 TB, compressed. 25 TB, uncompressed.]. here
- ‘arXiv archive’. Giant collection of scientific papers in pdf format. here
- ‘SMS Spam Collection’ 5000+ SMS messages tagged as spam or not. here
- Google n-grams (from web pages) here
- Google n-grams (from books) here
Linguistics
- ‘Wordnet’ Database of English words. Words are grouped, linked and organised. here.
- Yahoo labs. A collection of data from Yahoo searches/questions/ratings/images here
Knowledge
- ‘Freebase’ cira 1 billion facts. [will be retired July 2015] here
- ‘Mizar Project’ 10’s of thousands of mathematical definitions, formulas and proofs; in machine readable format. here
- ‘Stack Exchange’ Questions and answers from many popular forums. Computer readable dumps and a query API. here
- ‘DBpedia’ Parsed and structured information extracted from Wikipedia. here
- ‘Cross-Lingual Dictionary for English Wikipedia Concepts’ Maps concepts to relevant Wikipedia articles. here
Images
- ‘Visual Dictionary’ Images for 50,000+ nouns in the English language, by MIT. here
- ‘Imagenet’ Pictures matching the hierarchy of Word-net noun nodes. here
- ‘MNIST database’ 70,000 handwritten digits. here
- ” An index to all sorts of computer vision data-sets. here
- ‘Label me’ Labeled things in images from MIT here
- ‘KTH-TIPS’ Textures under varying illumination, pose and scale. here
- ‘MS COCO’ 91 object-types, 2.5 million labels in 328,000 images. here
Music
- ‘Million Song Data Set’ Audio features and other meta data for a million modern popular music tracks. here
- ‘Last.fm API’ Access to all last.fm user/playlist/music/artist/geo/genre/tag data. A history of who listened to what/when/where here
Society
- WHO ‘Global Health Observatory’ Lots of usable data about all heal related issues. here
- ‘ICPSR 4572’ Extensive stats on (USA) prisoners. here
- Drug usage (USA) , large survey with many variables, who/what/why here
Social Networks
- ‘Social Network Analysis Interactive Dataset Library’. Contains computer readable images of 300+ online social networks. here
- ‘Stanford Large Network Dataset Collection’ Graphs of assorted webpages (eg. Facebook / Twitter) here
Geology:
Misc
- ‘Open Product Data’. International set of product bar-codes. here
- ‘Amazon reviews’ 34 million product reviews covering 2 million products. here.
- ‘USDA National Nutrient Database for Standard Reference’ The nutrients (typically) found in different foods. here
- ‘Open Library Data Dumps’ Very large dump of library records (author, work, revision, etc). here
- ‘List of lists of lists’ The pages listing lists got so prolific… we needed a list of them. here