Large & Interesting Datasets

duckman — Sun, 08 Mar 2015 16:00:03 +0000

This is a list of impressive / interesting data-sets I have come across. They are typically free (or cost-price of media), machine readable, downloadable and extensive.

Text:

‘Project Gutenberg’ massive collection of books older than 80 years (expired copyright). via custom iso image here, iso of popular books here, or via rsync here.
‘Enron corpus’ 500,000ish actual emails. here
‘ClueWeb09 Datase’ 2009 Archive of 1 billion web pages (various languages) [5 TB, compressed. 25 TB, uncompressed.]. here
‘arXiv archive’. Giant collection of scientific papers in pdf format. here
‘SMS Spam Collection’ 5000+ SMS messages tagged as spam or not. here
Google n-grams (from web pages) here
Google n-grams (from books) here

Linguistics

‘Wordnet’ Database of English words. Words are grouped, linked and organised. here.
Yahoo labs. A collection of data from Yahoo searches/questions/ratings/images here

Knowledge

‘Freebase’ cira 1 billion facts. [will be retired July 2015] here
‘Mizar Project’ 10’s of thousands of mathematical definitions, formulas and proofs; in machine readable format. here
‘Stack Exchange’ Questions and answers from many popular forums. Computer readable dumps and a query API. here
‘DBpedia’ Parsed and structured information extracted from Wikipedia. here
‘Cross-Lingual Dictionary for English Wikipedia Concepts’ Maps concepts to relevant Wikipedia articles. here

Images

‘Visual Dictionary’ Images for 50,000+ nouns in the English language, by MIT. here
‘Imagenet’ Pictures matching the hierarchy of Word-net noun nodes. here
‘MNIST database’ 70,000 handwritten digits. here
” An index to all sorts of computer vision data-sets. here
‘Label me’ Labeled things in images from MIT here
‘KTH-TIPS’ Textures under varying illumination, pose and scale. here
‘MS COCO’ 91 object-types, 2.5 million labels in 328,000 images. here

Music

‘Million Song Data Set’ Audio features and other meta data for a million modern popular music tracks. here
‘Last.fm API’ Access to all last.fm user/playlist/music/artist/geo/genre/tag data. A history of who listened to what/when/where here

Society

WHO ‘Global Health Observatory’ Lots of usable data about all heal related issues. here
‘ICPSR 4572’ Extensive stats on (USA) prisoners. here
Drug usage (USA) , large survey with many variables, who/what/why here

Social Networks

‘Social Network Analysis Interactive Dataset Library’. Contains computer readable images of 300+ online social networks. here
‘Stanford Large Network Dataset Collection’ Graphs of assorted webpages (eg. Facebook / Twitter) here

Geology:

‘Global Historical Earthquakes’ All known earthquakes (prior to 1903). here and also this.

Misc

‘Open Product Data’. International set of product bar-codes. here
‘Amazon reviews’ 34 million product reviews covering 2 million products. here.
‘USDA National Nutrient Database for Standard Reference’ The nutrients (typically) found in different foods. here
‘Open Library Data Dumps’ Very large dump of library records (author, work, revision, etc). here
‘List of lists of lists’ The pages listing lists got so prolific… we needed a list of them. here

Artificial Intelligence – Busy Ducks

Large & Interesting Datasets