These datasets are used in machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for supervised and semi-supervised machine-learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality unlabeled datasets for unsupervised learning can also be difficult and costly to produce.

Many organizations, including governments, publish and share their datasets, often using common metadata formats (such as Croissant). The datasets are classified, based on the licenses, into two groups: open data and non-open data.

The datasets from various governmental-bodies are presented in List of open government data sites. The datasets are ported on open data portals. They are made available for searching, depositing and accessing through interfaces like Open API.[citation needed] The datasets are made available as various sorted types and subtypes.[citation needed]

List of sorting used for datasets

TypeSubtypes
Specific categoryFinance, Economics, Commerce, Societal, Health, Academy, Sports, Food, Agriculture, Travel, Geospatial, Political, Consumer, Transport, Logistics, Environmental, Real-Estate, Legal, Entertainment, Energy, Hospitality
ScopeSupranational Union, National, Subnational, Municipality, Urban, Rural
LanguageMandarin Chinese, Spanish, English, Arabic, Hindi, Bengali
TypeTabular, Graph, Text, Image, Sound, Video
UsageTraining, validating, and testing
File-FormatsCSV, JSON, XML, KML, GeoJSON, Shapefile, GML
LicensesCreative-Commons, GPL, Other Non-Open data licenses
Last-UpdatedLast-Hour, Last-Day, Last-Week, Last-Month, Last-Year
File-SizeMinimum, Maximum, Range
Verified, In-Preparation, Deactivated(or Deprecated)
Number of records100s, 1000s, 10000s, 100000s, Millions
Number of variablesLess than 10, 10s, 100s, 1000s, 10000s
ServicesIndividual, Aggregation

The data portal is classified based on its type of license. The open source license based data portals are known as open data portals which are used by many government organizations and academic institutions.

List of open data portals

Portal-nameLicenseList of installations of the portalTypical usages
Comprehensive Knowledge Archive Network (CKAN)AGPLData repository for government or non-profit organisations, Data Management Solution for Research Institutes
GPLData repository for government or non-profit organisations, Data Management Solution for Research Institutes
DataverseApacheData Management Solution for Research Institutes
DSpaceBSDData Management Solution for Research Institutes
BSDData Management Solution to share datasets, algorithms, and experiments results through APIs.

List of portals suitable for multiple types of applications

The data portal sometimes lists a wide variety of subtypes of datasets pertaining to many machine learning applications.

Academic Torrents
Amazon Datasets
Awesome Public Datasets Collection
data.world
Datahub – Core Datasets
DataONE
DataPortals
Datasetlist.com
Global Open Data Index – Open Knowledge Foundation25 May 2020 at the Wayback Machine
Google Dataset Search
Hugging Face
IBM's Data Asset Exchange
Jupyter – Tutorial Data
Kaggle
Machine learning datasets
Major Smart Cities with Open Data
Microsoft Datasets
Open Data Inception
Opendatasoft
OpenDOAR
OpenML
Papers with Code
Penn Machine Learning Benchmarks
Public APIs
Registry of Open Access Repositories
REgistry of REsearch Data REpositories
UCI Machine Learning Repository
Speech Dataset
Visual Data Discovery

List of portals suitable for a specific subtype of applications

The data portals which are suitable for a specific subtype of machine learning application are listed in the subsequent sections.

Image data

Text data

These datasets consist primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.

Reviews

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Netflix PrizeMovie ratings on Netflix.100,480,507 ratings that 480,189 users gave to 17,770 moviesText, ratingRating prediction2006Netflix
Amazon reviewsUS product reviews from Amazon.com.None.233.1 millionTextClassification, sentiment analysis2015 (2018)McAuley et al.
OpinRank Review DatasetReviews of cars and hotels from Edmunds.com and TripAdvisor respectively.None.42,230 / ~259,000 respectivelyTextSentiment analysis, clustering2011K. Ganesan et al.
MovieLens22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users.None.~ 22MTextRegression, clustering, classification2016GroupLens Research
Yahoo! Music User Ratings of Musical ArtistsOver 10M ratings of artists by Yahoo users.None described.~ 10MTextClustering, regression2004Yahoo!
Car Evaluation Data SetCar properties and their overall acceptability.Six categorical features given.1728TextClassification1997M. Bohanec
YouTube Comedy Slam Preference DatasetUser vote data for pairs of videos shown on YouTube. Users voted on funnier videos.Video metadata given.1,138,562TextClassification2012Google
Skytrax User Reviews DatasetUser reviews of airlines, airports, seats, and lounges from Skytrax.Ratings are fine-grain and include many aspects of airport experience.41396TextClassification, regression2015Q. Nguyen
Teaching Assistant Evaluation DatasetTeaching assistant reviews.Features of each instance such as class, class size, and instructor are given.151TextClassification1997W. Loh et al.
Vietnamese Students' Feedback Corpus (UIT-VSFC)Students' Feedback.Comments16,000TextClassification1997Nguyen et al.
Vietnamese Social Media Emotion Corpus (UIT-VSMEC)Users' Facebook Comments.Comments6,927TextClassification1997Nguyen et al.
Vietnamese Open-domain Complaint Detection dataset (ViOCD)Customer product reviewsComments5,485TextClassification2021Nguyen et al.
ViHOS: Hate Speech Spans Detection for VietnameseSocial Media TextsCommentsContaining 26k spans on 11k commentsTextSpan Detection2021Hoang et al.

News articles

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
NYSK DatasetEnglish news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn.Filtered and presented in XML format.10,421XML, textSentiment analysis, topic extraction2013Dermouche, M. et al.
The Reuters Corpus Volume 1Large corpus of Reuters news stories in English.Fine-grain categorization and topic codes.810,000TextClassification, clustering, summarization2002Reuters
The Reuters Corpus Volume 2Large corpus of Reuters news stories in multiple languages.Fine-grain categorization and topic codes.487,000TextClassification, clustering, summarization2005Reuters
Thomson Reuters Text Research CollectionLarge corpus of news stories.Details not described.1,800,370TextClassification, clustering, summarization2009T. Rose et al.
Saudi Newspapers Corpus31,030 Arabic newspaper articles.Metadata extracted.31,030JSONSummarization, clustering2015M. Alhagri
RE3D (Relationship and Entity Extraction Evaluation Dataset)Entity and Relation marked data from various news and government sources. Sponsored by DstlFiltered, categorisation using Baleen typesnot knownJSONClassification, Entity and Relation recognition2017Dstl
Examiner Spam Clickbait CatalogueClickbait, spam, crowd-sourced headlines from 2010 to 2015Publish date and headlines3,089,781CSVClustering, Events, Sentiment2016R. Kulkarni
ABC Australia News CorpusEntire news corpus of ABC Australia from 2003 to 2019Publish date and headlines1,186,018CSVClustering, Events, Sentiment2020R. Kulkarni
Worldwide News – Aggregate of 20K FeedsOne week snapshot of all online headlines in 20+ languagesPublish time, URL and headlines1,398,431CSVClustering, Events, Language Detection2018R. Kulkarni
Reuters News Wire Headline11 Years of timestamped events published on the news-wirePublish time, Headline Text16,121,310CSVNLP, Computational Linguistics, Events2018R. Kulkarni
The Irish Times Ireland News Corpus24 Years of Ireland News from 1996 to 2019Publish time, Headline Category and Text1,484,340CSVNLP, Computational Linguistics, Events2020R. Kulkarni
News Headlines Dataset for Sarcasm DetectionHigh quality dataset with Sarcastic and Non-sarcastic news headlines.Clean, normalized text26,709JSONNLP, Classification, Linguistics2018Rishabh Misra

Messages

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Enron CorpusEmails from employees at Enron organized into folders.Attachments removed, invalid email addresses converted to user@enron.com or no_address@enron.com.~ 500,000TextNetwork analysis, sentiment analysis2004 (2015)Klimt, B. and Y. Yang
Ling-Spam DatasetCorpus containing both legitimate and spam emails.Four version of the corpus involving whether or not a lemmatiser or stop-list was enabled.2,412 Ham 481 SpamTextClassification2000Androutsopoulos, J. et al.
SMS Spam Collection DatasetCollected SMS spam messages.None.5,574TextClassification2011T. Almeida et al.
Twenty Newsgroups DatasetMessages from 20 different newsgroups.None.20,000TextNatural language processing1999T. Mitchell et al.
Spambase DatasetSpam emails.Many text features extracted.4,601TextSpam detection, classification1999M. Hopkins et al.

Twitter and tweets

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
MovieTweetingsMovie rating dataset based on public and well-structured tweets~710,000TextClassification, regression2018S. Dooms
Twitter100kPairs of images and tweets100,000Text and ImagesCross-media retrieval2017Y. Hu, et al.
Sentiment140Tweet data from 2009 including original text, time stamp, user and sentiment.Classified using distant supervision from presence of emoticon in tweet.1,578,627Tweets, comma, separated valuesSentiment analysis2009A. Go et al.
ASU Twitter DatasetTwitter network data, not actual tweets. Shows connections between a large number of users.None.11,316,811 users, 85,331,846 connectionsTextClustering, graph analysis2009R. Zafarani et al.
SNAP Social Circles: Twitter DatabaseLarge Twitter network data.Node features, circles, and ego networks.1,768,149TextClustering, graph analysis2012J. McAuley et al.
Twitter Dataset for Arabic Sentiment AnalysisArabic tweets.Samples hand-labeled as positive or negative.2000TextClassification2014N. Abdulla
Buzz in Social Media DatasetData from Twitter and Tom's Hardware. This dataset focuses on specific buzz topics being discussed on those sites.Data is windowed so that the user can attempt to predict the events leading up to social media buzz.140,000TextRegression, Classification2013F. Kawala et al.
Paraphrase and Semantic Similarity in Twitter (PIT)This dataset focuses on whether tweets have (almost) same meaning/information or not. Manually labeled.tokenization, part-of-speech and named entity tagging18,762TextRegression, Classification2015Xu et al.
Geoparse Twitter benchmark datasetThis dataset contains tweets during different news events in different countries. Manually labeled location mentions.location annotations added to JSON metadata6,386Tweets, JSONClassification, Information Extraction2014S. E. Middleton et al.
Sarcasm, Perceived and Intended, by Reactive Supervision (SPIRS)Intended and perceived sarcastic tweets along with their context collected using reactive supervision; an equal number of negative (non-sarcastic) samples30,000Tweet IDs, CSVClassification2020B. Shmueli et al.
Dutch Social media collectionThis dataset contains COVID-19 tweets made by Dutch speakers or users from Netherlands. The data has been machine labeledclassified for sentiment, tweet text & user description translated to English. Industry mention are extracted271,342JSONLSentiment, multi-label classification, machine translation2020Aaaksh Gupta, CoronaWhy
ReactionGIF datasetA dataset of 30K tweets and their GIF reactionsClassified for sentiment, reaction, and emotion30,000Tweet IDs, JSONLClassified for sentiment, reaction, and emotion2021B. Shmueli et al.

Dialogues

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
NPS Chat CorpusPosts from age-specific online chat rooms.Hand privacy masked, tagged for part of speech and dialogue-act.~ 500,000XMLNLP, programming, linguistics2007Forsyth, E., Lin, J., & Martell, C.
Twitter Triple CorpusA-B-A triples extracted from Twitter.4,232TextNLP2016Sordini, A. et al.
UseNet CorpusUseNet forum postings.Anonymized e-mails and URLs. Omitted documents with lengths <500 words or >500,000 words, or that were <90% English.7 billionText2011Shaoul, C., & Westbury C.
NUS SMS CorpusSMS messages collected between two users, with timing analysis.~ 10,000XMLNLP2011KAN, M
Reddit All Comments CorpusAll Reddit comments (as of 2015).~ 1.7 billionJSONNLP, research2015Stuck_In_the_Matrix
Ubuntu Dialogue CorpusDialogues extracted from Ubuntu chat stream on IRC.930 thousand dialogues, 7.1 million utterancesCSVDialogue Systems Research2015Lowe, R. et al.
Dialog State Tracking ChallengeThe Dialog State Tracking Challenges 2 & 3 (DSTC2&3) were research challenge focused on improving the state of the art in tracking the state of spoken dialog systems.Transcription of spoken dialogs with labellingDSTC2 contains ~3.2k calls – DSTC3 contains ~2.3k callsJSONDialogue state tracking2014Henderson, Matthew and Thomson, Blaise and Williams, Jason D
Clinc-150Single-turn utterances collected from Amazon Mechanical Turk. 150 classification "intent" categories, and additional data for "out-of-scope" utterances.23,700JSONIntent Classification2019Larson, S. et al.

Legal

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
FreeLawFiltered data from Court Listener, part of the FreeLaw project.Cleaned and normalized text4,940,710JsonNLP, linguistics2020T. Hoppe
Pile of LawCorpus of legal and administrative dataCleaned, normalized, and privatized~50,000,000JsonNLP, linguistics, sentiment2022L. Zheng; N. Guha; B. Anderson; P. Henderson; D. Ho
Caselaw Access ProjectAll official, book-published state and federal United States case law — every volume or case designated as an official report of decisions by a court within the United States.Cleaned and normalized text~10,000JsonNLP, linguistics2022A. Aizman; S. Chapman; J. Cushman; K. Dulin; H. Eidolon; et al.

Other text

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Hansard French-EnglishThe Canadian Hansard records.2869040 French-English sentence pairs in 46.3 million words of French and 38.6 words of English (IBM portion), and 60 million words (Bell portion)French-English sentence pairsTranslation1995IBM, Bell labs
Web of Science DatasetHierarchical Datasets for Text ClassificationNone.46,985TextClassification, Categorization2017K. Kowsari et al.
Legal Case ReportsFederal Court of Australia cases from 2006 to 2009.None.4,000TextSummarization, citation analysis2012F. Galgani et al.
Blogger Authorship CorpusBlog entries of 19,320 people from blogger.com.Blogger self-provided gender, age, industry, and astrological sign.681,288TextSentiment analysis, summarization, classification2006J. Schler et al.
Social Structure of Facebook NetworksLarge dataset of the social structure of Facebook.None.100 colleges coveredTextNetwork analysis, clustering2012A. Traud et al.
Dataset for the Machine Comprehension of TextStories and associated questions for testing comprehension of text.None.660TextNatural language processing, machine comprehension2013M. Richardson et al.
The Penn Treebank ProjectNaturally occurring text annotated for linguistic structure.Text is parsed into semantic trees.~ 1M wordsTextNatural language processing, summarization1995M. Marcus et al.
Web 1T 5-gramText from webpages.One slice divides the data into sentences. Another slice divides the data into n-grams for n = 1--5.~1T wordsText and n-gram tablesUnsupervised learning2006Google
DEXTER DatasetTask given is to determine, from features given, which articles are about corporate acquisitions.Features extracted include word stems. Distractor features included.2600TextClassification2008Reuters
Google Books N-gramsN-grams from a very large corpus of booksNone.2.2 TB of textTextClassification, clustering, regression2011Google
Personae CorpusCollected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays.In addition to normal texts, syntactically annotated texts are given.145TextClassification, regression2008K. Luyckx et al.
PushShiftArchives of social media websites, including Reddit, Twitter, and Hackernews.Text extracted and normalized from WARCs~100,000,000 postsJsonNLP, sentiment, linguistics2022J. Baumgartner
EDGAR | Company FilingsText extracted.csvNLP
CNAE-9 DatasetCategorization task for free text descriptions of Brazilian companies.Word frequency has been extracted.1080TextClassification2012P. Ciarelli et al.
Sentiment Labeled Sentences Dataset3000 sentiment labeled sentences.Sentiment of each sentence has been hand labeled as positive or negative.3000TextClassification, sentiment analysis2015D. Kotzias
BlogFeedback DatasetDataset to predict the number of comments a post will receive based on features of that post.Many features of each post extracted.60,021TextRegression2014K. Buza
PubMed® comprises more than 35 million citations for biomedical literature from MEDLINE, life science journals, and online books.None35 MillionTextNLP
The United States Patent and Trademark OfficeTextNLP
Open access collection of philosophy publicationsTextNLP
A popular large-scale text corpus.NoneTextNLP2015Zhu, Yukun, et al.
Stanford Natural Language Inference (SNLI) CorpusImage captions matched with newly constructed sentences to form entailment, contradiction, or neutral pairs.Entailment class labels, syntactic parsing by the Stanford PCFG parser570,000TextNatural language inference/recognizing textual entailment2015S. Bowman et al.
DSL Corpus Collection (DSLCC)A multilingual collection of short excerpts of journalistic texts in similar languages and dialects.None294,000 phrasesTextDiscriminating between similar languages2017Tan, Liling et al.
Urban Dictionary DatasetCorpus of words, votes and definitionsUser names anonymised2,580,925CSVNLP, Machine comprehension2016 MayAnonymous
T-RExWikipedia abstracts aligned with Wikidata entitiesAlignment of Wikidata triples with Wikipedia abstracts11M aligned triplesJSON and NIFNLP, Relation Extraction2018H. Elsahar et al.
General Language Understanding Evaluation (GLUE)Benchmark of nine tasksVarious~1M sentences and sentence pairsNLU2018Wang et al.
Contract Understanding Atticus Dataset (CUAD) (formerly known as Atticus Open Contract Dataset (AOK))Dataset of legal contracts with rich expert annotations~13,000 labelsCSV and PDFNatural language processing, QnA2021
Vietnamese Image Captioning Dataset (UIT-ViIC)Vietnamese Image Captioning Dataset19,250 captions for 3,850 imagesCSV and PDFNatural language processing, Computer vision2020Lam et al.
Vietnamese Names annotated with Genders (UIT-ViNames)Vietnamese Names annotated with Genders26,850 Vietnamese full names annotated with gendersCSVNatural language processing2020To et al.
Vietnamese Constructive and Toxic Speech Detection Dataset (UIT-ViCTSD)Vietnamese Constructive and Toxic Speech Detection Dataset10,000 Vietnamese users' comments on online newspapers on 10 domainsCSVNatural Language Processing2021Nguyen et al.
A set of books extracted from the Project Gutenberg books libraryTextNatural Language Processing2019Jack W et al.
Mathematical question and answer pairs.TextNatural Language Processing2018D Saxton et al.
A comprehensive archive of published books and papersNone100,356,641Text, epub, PDFNatural Language Processing2024

Sound data

These datasets consist of sounds and sound features used for tasks such as speech recognition and speech synthesis.

Speech

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Switchboard-1Conversational speech over telephone.260 hours of speech, from 543 speakers (302 male, 241 female) from across the United States, for around 2,400 two-sided telephone conversations, collected by Texas Instruments in 1990–1991.audio, text transcript, word-level timestamps, phonetic transcriptionsspeech recognition, phonetic transcription.1992 (2000)NIST
Hub5'00Conversational speech over telephone.260 hours of speech, from 543 speakers (302 male, 241 female) from across the United States, for around 2,400 two-sided telephone conversations, at ~3 million words. Collected by Texas Instruments in 1990–1991.audio, text transcript, word-level timestamps, phonetic transcriptionsspeech recognition, phonetic transcription. The most commonly used test set for this dataset is called "Hub5'00".1992 (2000)NIST
Zero Resource Speech Challenge 2015Spontaneous speech (English), Read speech (Xitsonga).None, raw WAV files.English: 5h, 12 speakers; Xitsonga: 2h30, 24 speakersWAV (audio only)Unsupervised discovery of speech features/subword units/word units2015Versteegh et al.
Parkinson Speech DatasetMultiple recordings of people with and without Parkinson's Disease.Voice features extracted, disease scored by physician using Unified Parkinson's Disease Rating Scale.1,040TextClassification, regression2013B. E. Sakar et al.
Spoken Arabic DigitsSpoken Arabic digits from 44 male and 44 female.Time-series of mel-frequency cepstrum coefficients.8,800TextClassification2010M. Bedda et al.
ISOLET DatasetSpoken letter names.Features extracted from sounds.7797TextClassification1994R. Cole et al.
Japanese Vowels DatasetNine male speakers uttered two Japanese vowels successively.Applied 12-degree linear prediction analysis to it to obtain a discrete-time series with 12 cepstrum coefficients.640TextClassification1999M. Kudo et al.
Parkinson's Telemonitoring DatasetMultiple recordings of people with and without Parkinson's Disease.Sound features extracted.5875TextClassification2009A. Tsanas et al.
TIMITRecordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences.Speech is lexically and phonemically transcribed.6300TextSpeech recognition, classification.1986J. Garofolo et al.
Arabic Speech CorpusA single-speaker, Modern Standard Arabic (MSA) speech corpus with phonetic and orthographic transcripts aligned to phoneme level.Speech is orthographically and phonetically transcribed with stress marks.~1900Text, WAVSpeech Synthesis, Speech Recognition, Corpus Alignment, Speech Therapy, Education.2016N. Halabi
Common VoiceA public domain database of crowdsourced data across a wide range of dialects.Validation by other users .English: 1,118 hoursMP3 with corresponding text filesSpeech recognition2017 June (2019 December)Mozilla
LJSpeechA single-speaker corpus of English public-domain audiobook recordings, split into short clips at punctuation marks.Quality check, normalized transcription alongside the original.13,100CSV, WAVSpeech synthesis2017Keith Ito, Linda Johnson
Arabic Speech Commands DatasetCollected from 30 contributors and grouped into 40 keywords.Raw WAV files12,000WAV, CSVSpeech recognition, keyword spotting2021Abdulkader Ghandoura

Music

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Geographic Origin of Music Data SetAudio features of music samples from different locations.Audio features extracted using MARSYAS software.1,059TextGeographic classification, clustering2014F. Zhou et al.
Million Song DatasetAudio features from one million different songs.Audio features extracted.1MTextClassification, clustering2011T. Bertin-Mahieux et al.
MUSDB18Multi-track popular music recordingsRaw audio150MP4, WAVSource Separation2017Z. Rafii et al.
Free Music ArchiveAudio under Creative Commons from 100k songs (343 days, 1TiB) with a hierarchy of 161 genres, metadata, user data, free-form text.Raw audio and audio features.106,574Text, MP3Classification, recommendation2017M. Defferrard et al.
Bach Choral Harmony DatasetBach chorale chords.Audio features extracted.5665TextClassification2014D. Radicioni et al.

Other sounds

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
UrbanSoundLabeled sound recordings of sounds like air conditioners, car horns and children playing.Sorted into folders by class of events as well as metadata in a JSON file and annotations in a CSV file.1,059Sound (WAV)Classification2014J. Salamon et al.
AudioSet10-second sound snippets from YouTube videos, and an ontology of over 500 labels.128-d PCA'd VGG-ish features every 1 second.2,084,320Text (CSV) and TensorFlow Record filesClassification2017J. Gemmeke et al., Google
Bird Audio Detection challengeAudio from environmental monitoring stations, plus crowdsourced recordings17,000+Classification2016 (2018)Queen Mary University and IEEE Signal Processing Society
WSJ0 Hipster Ambient MixturesAudio from WSJ0 mixed with noise recorded in the San Francisco Bay AreaNoise clips matched to WSJ0 clips28,000Sound (WAV)Audio source separation2019Wichern, G., et al., Whisper and MERL
Clotho4,981 audio samples of 15 to 30 seconds long, each audio sample having five different captions of eight to 20 words long.24,905Sound (WAV) and text (CSV)Automated audio captioning2020K. Drossos, S. Lipping, and T. Virtanen

Signal data

Datasets containing electric signal information requiring some sort of signal processing for further analysis.

Electrical

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Witty Worm DatasetDataset detailing the spread of the Witty worm and the infected computers.Split into a publicly available set and a restricted set containing more sensitive information like IP and UDP headers.55,909 IP addressesTextClassification2004Center for Applied Internet Data Analysis
Cuff-Less Blood Pressure Estimation DatasetCleaned vital signals from human patients which can be used to estimate blood pressure.125 Hz vital signs have been cleaned.12,000TextClassification, regression2015M. Kachuee et al.
Gas Sensor Array Drift DatasetMeasurements from 16 chemical sensors utilized in simulations for drift compensation.Extensive number of features given.13,910TextClassification2012A. Vergara
Servo DatasetData covering the nonlinear relationships observed in a servo-amplifier circuit.Levels of various components as a function of other components are given.167TextRegression1993K. Ullrich
UJIIndoorLoc-Mag DatasetIndoor localization database to test indoor positioning systems. Data is magnetic field based.Train and test splits given.40,000TextClassification, regression, clustering2015D. Rambla et al.
Sensorless Drive Diagnosis DatasetElectrical signals from motors with defective components.Statistical features extracted.58,508TextClassification2015M. Bator

Motion-tracking

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Wearable Computing: Classification of Body Postures and Movements (PUC-Rio)People performing five standard actions while wearing motion trackers.None.165,632TextClassification2013Pontifical Catholic University of Rio de Janeiro
Gesture Phase Segmentation DatasetFeatures extracted from video of people doing various gestures.Features extracted aim at studying gesture phase segmentation.9900TextClassification, clustering2014R. Madeo et a
Vicon Physical Action Data Set Dataset10 normal and 10 aggressive physical actions that measure the human activity tracked by a 3D tracker.Many parameters recorded by 3D tracker.3000TextClassification2011T. Theodoridis
Daily and Sports Activities DatasetMotor sensor data for 19 daily and sports activities.Many sensors given, no preprocessing done on signals.9120TextClassification2013B. Barshan et al.
Human Activity Recognition Using Smartphones DatasetGyroscope and accelerometer data from people wearing smartphones and performing normal actions.Actions performed are labeled, all signals preprocessed for noise.10,299TextClassification2012J. Reyes-Ortiz et al.
Australian Sign Language SignsAustralian sign language signs captured by motion-tracking gloves.None.2565TextClassification2002M. Kadous
Weight Lifting Exercises monitored with Inertial Measurement UnitsFive variations of the biceps curl exercise monitored with IMUs.Some statistics calculated from raw data.39,242TextClassification2013W. Ugulino et al.
sEMG for Basic Hand movements DatasetTwo databases of surface electromyographic signals of 6 hand movements.None.3000TextClassification2014C. Sapsanis et al.
REALDISP Activity Recognition DatasetEvaluate techniques dealing with the effects of sensor displacement in wearable activity recognition.None.1419TextClassification2014O. Banos et al.
Heterogeneity Activity Recognition DatasetData from multiple different smart devices for humans performing various activities.None.43,930,257TextClassification, clustering2015A. Stisen et al.
Indoor User Movement Prediction from RSS DataTemporal wireless network data that can be used to track the movement of people in an office.None.13,197TextClassification2016D. Bacciu
PAMAP2 Physical Activity Monitoring Dataset18 different types of physical activities performed by 9 subjects wearing 3 IMUs.None.3,850,505TextClassification2012A. Reiss
OPPORTUNITY Activity Recognition DatasetHuman Activity Recognition from wearable, object, and ambient sensors is a dataset devised to benchmark human activity recognition algorithms.None.2551TextClassification2012D. Roggen et al.
Real World Activity Recognition DatasetHuman Activity Recognition from wearable devices. Distinguishes between seven on-body device positions and comprises six different kinds of sensors.None.3,150,000 (per sensor)TextClassification2016T. Sztyler et al.
Toronto Rehab Stroke Pose Dataset3D human pose estimates (Kinect) of stroke patients and healthy participants performing a set of tasks using a stroke rehabilitation robot.None.10 healthy person and 9 stroke survivors (3500–6000 frames per person)CSVClassification2017E. Dolatabadi et al.
Corpus of Social Touch (CoST)7805 gesture captures of 14 different social touch gestures performed by 31 subjects. The gestures were performed in three variations: gentle, normal and rough, on a pressure sensor grid wrapped around a mannequin arm.Touch gestures performed are segmented and labeled.7805 gesture capturesCSVClassification2016M. Jung et al.

Other signals

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Wine DatasetChemical analysis of wines grown in the same region in Italy but derived from three different cultivars.13 properties of each wine are given178TextClassification, regression1991M. Forina et al.
Combined Cycle Power Plant Data SetData from various sensors within a power plant running for 6 years.None9568TextRegression2014P. Tufekci et al.

Chemical data

Datasets from physical systems.

Chemical Reactions with transition states (TS)

OpenReACT-CHON-EFH

OpenReACT-CHON-EFH (Open Reaction Dataset of Atomic ConfiguraTions comprising C, H, O and N with Energies, Forces and Hessians) is a 2025 open-access benchmark for machine-learning interatomic potentials.

  • **RTP set** – 35,087 stationary-point geometries (reactant, transition state and product) drawn from 11,961 elementary reactions, each labeled with density-functional energies, atomic forces and full Hessian matrices at the ωB97X-D/6-31G(d) level.
  • **IRC set** – 34,248 structures along 600 minimum-energy reaction paths, used to test extrapolation beyond trained stationary points.
  • **NMS set** – 62,527 off-equilibrium geometries generated by normal-mode sampling to probe model robustness under thermal perturbations.

The collection underpins the study Does Hessian Data Improve the Performance of Machine Learning Potentials? and was used to train and benchmark the machine-learning interatomic potentials reported therein.

The dataset itself is distributed under a CC licence via Figshare.

Physical data

Datasets from physical systems.

High-energy physics

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
HIGGS DatasetMonte Carlo simulations of particle accelerator collisions.28 features of each collision are given.11MTextClassification2014D. Whiteson
HEPMASS DatasetMonte Carlo simulations of particle accelerator collisions. Goal is to separate the signal from noise.28 features of each collision are given.10,500,000TextClassification2016D. Whiteson

Systems

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Yacht Hydrodynamics DatasetYacht performance based on dimensions.Six features are given for each yacht.308TextRegression2013R. Lopez
Robot Execution Failures Dataset5 data sets that center around robotic failure to execute common tasks.Integer valued features such as torque and other sensor measurements.463TextClassification1999L. Seabra et al.
Pittsburgh Bridges DatasetDesign description is given in terms of several properties of various bridges.Various bridge features are given.108TextClassification1990Y. Reich et al.
Automobile DatasetData about automobiles, their insurance risk, and their normalized losses.Car features extracted.205TextRegression1987J. Schimmer et al.
Auto MPG DatasetMPG data for cars.Eight features of each car given.398TextRegression1993Carnegie Mellon University
Energy Efficiency DatasetHeating and cooling requirements given as a function of building parameters.Building parameters given.768TextClassification, regression2012A. Xifara et al.
Airfoil Self-Noise DatasetA series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections.Data about frequency, angle of attack, etc., are given.1503TextRegression2014R. Lopez
Challenger USA Space Shuttle O-Ring DatasetAttempt to predict O-ring problems given past Challenger data.Several features of each flight, such as launch temperature, are given.23TextRegression1993D. Draper et al.
Statlog (Shuttle) DatasetNASA space shuttle datasets.Nine features given.58,000TextClassification2002NASA

Astronomy

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Volcanoes on Venus – JARtool experiment DatasetVenus images returned by the Magellan spacecraft.Images are labeled by humans.not givenImagesClassification1991M. Burl
MAGIC Gamma Telescope DatasetMonte Carlo generated high-energy gamma particle events.Numerous features extracted from the simulations.19,020TextClassification2007R. Bock
Solar Flare DatasetMeasurements of the number of certain types of solar flare events occurring in a 24-hour period.Many solar flare-specific features are given.1389TextRegression, classification1989G. Bradshaw
CAMELS Multifield Dataset2D maps and 3D grids from thousands of N-body and state-of-the-art hydrodynamic simulations spanning a broad range in the value of the cosmological and astrophysical parametersEach map and grid has 6 cosmological and astrophysical parameters associated to it405,000 2D maps and 405,000 3D grids2D maps and 3D gridsRegression2021Francisco Villaescusa-Navarro et al.

Earth science

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Volcanoes of the WorldVolcanic eruption data for all known volcanic events on earth.Details such as region, subregion, tectonic setting, dominant rock type are given.1535TextRegression, classification2013E. Venzke et al.
Seismic-bumps DatasetSeismic activities from a coal mine.Seismic activity was classified as hazardous or not.2584TextClassification2013M. Sikora et al.
CAMELS-USCatchment hydrology dataset with hydrometeorological timeseries and various attributessee Reference671CSV, Text, ShapefileRegression2017N. Addor et al. / A. Newman et al.
CAMELS-ChileCatchment hydrology dataset with hydrometeorological timeseries and various attributessee Reference516CSV, Text, ShapefileRegression2018C. Alvarez-Garreton et al.
CAMELS-BrazilCatchment hydrology dataset with hydrometeorological timeseries and various attributessee Reference897CSV, Text, ShapefileRegression2020V. Chagas et al.
CAMELS-GBCatchment hydrology dataset with hydrometeorological timeseries and various attributessee Reference671CSV, Text, ShapefileRegression2020G. Coxon et al.
CAMELS-AustraliaCatchment hydrology dataset with hydrometeorological timeseries and various attributessee Reference222CSV, Text, ShapefileRegression2021K. Fowler et al.
LamaH-CECatchment hydrology dataset with hydrometeorological timeseries and various attributessee Reference859CSV, Text, ShapefileRegression2021C. Klingler et al.

Other physical

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Concrete Compressive Strength DatasetDataset of concrete properties and compressive strength.Nine features are given for each sample.1030TextRegression2007I. Yeh
Concrete Slump Test DatasetConcrete slump flow given in terms of properties.Features of concrete given such as fly ash, water, etc.103TextRegression2009I. Yeh
Musk DatasetPredict if a molecule, given the features, will be a musk or a non-musk.168 features given for each molecule.6598TextClassification1994Arris Pharmaceutical Corp.
Steel Plates Faults DatasetSteel plates of 7 different types.27 features given for each sample.1941TextClassification2010Semeion Research Center
Noble Metal Monometallic Nanoparticles DatasetsProcessing and structural features of monometallic nanoparticles, labels being formation energy.85-182 features given for each sample.425 to 4000CSVRegression2017 to 2023A. Barnard and G. Opletal
Noble Metal Bimetallic Nanoparticles DatasetsProcessing and structural features of bimetallic nanoparticles, labels being formation energy.922 features given for each sample.138147 to 162770CSVRegression2023J. Ting et al.
AuPdPt Trimetallic Nanoparticles DatasetProcessing and structural features of AuPdPt nanoparticles, labels being formation energy.1958 features given for each sample.48136CSVRegression2023K. Lu et al.

Biological data

Datasets from biological systems.

Human

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Age DatasetA structured general-purpose dataset on life, work, and death of 1.22 million distinguished people. Public domain.A five-step method to infer birth and death years, gender, and occupation from community-submitted data to all language versions of the Wikipedia project.1,223,009TextRegression, Classification2022Paper DatasetAmoradnejad et al.
Synthetic Fundus DatasetPhotorealistic retinal images and vessel segmentations. Public domain.2500 images with 1500*1152 pixels useful for segmentation and classification of veins and arteries on a single background.2500ImagesClassification, Segmentation2020C. Valenti et al.
EEG DatabaseStudy to examine EEG correlates of genetic predisposition to alcoholism.Measurements from 64 electrodes placed on the scalp sampled at 256 Hz (3.9 ms epoch) for 1 second.122TextClassification1999H. Begleiter
P300 Interface DatasetData from nine subjects collected using P300-based brain-computer interface for disabled subjects.Split into four sessions for each subject. MATLAB code given.1,224TextClassification2008U. Hoffman et al.
Heart Disease Data SetAttributed of patients with and without heart disease.75 attributes given for each patient with some missing values.303TextClassification1988A. Janosi et al.
Breast Cancer Wisconsin (Diagnostic) DatasetDataset of features of breast masses. Diagnoses by physician is given.10 features for each sample are given.569TextClassification1995W. Wolberg et al.
National Survey on Drug Use and HealthLarge scale survey on health and drug use in the United States.None.55,268TextClassification, regression2012United States Department of Health and Human Services
Lung Cancer DatasetLung cancer dataset without attribute definitions56 features are given for each case32TextClassification1992Z. Hong et al.
Arrhythmia DatasetData for a group of patients, of which some have cardiac arrhythmia.276 features for each instance.452TextClassification1998H. Altay et al.
Diabetes 130-US hospitals for years 1999–2008 Dataset9 years of readmission data across 130 US hospitals for patients with diabetes.Many features of each readmission are given.100,000TextClassification, clustering2014J. Clore et al.
Diabetic Retinopathy Debrecen DatasetFeatures extracted from images of eyes with and without diabetic retinopathy.Features extracted and conditions diagnosed.1151TextClassification2014B. Antal et al.
Diabetic Retinopathy Messidor DatasetMethods to evaluate segmentation and indexing techniques in the field of retinal ophthalmology (MESSIDOR)Features retinopathy grade and risk of macular edema1200Images, TextClassification, Segmentation2008Messidor Project
Liver Disorders DatasetData for people with liver disorders.Seven biological features given for each patient.345TextClassification1990Bupa Medical Research Ltd.
Thyroid Disease Dataset10 databases of thyroid disease patient data.None.7200TextClassification1987R. Quinlan
Mesothelioma DatasetMesothelioma patient data.Large number of features, including asbestos exposure, are given.324TextClassification2016A. Tanrikulu et al.
Parkinson's Vision-Based Pose Estimation Dataset2D human pose estimates of Parkinson's patients performing a variety of tasks.Camera shake has been removed from trajectories.134TextClassification, regression2017M. Li et al.
KEGG Metabolic Reaction Network (Undirected) DatasetNetwork of metabolic pathways. A reaction network and a relation network are given.Detailed features for each network node and pathway are given.65,554TextClassification, clustering, regression2011M. Naeem et al.
AlphaDent - Dataset of teeth pathologiesIntraoral DSLR photography with high resolution (>5000x3000 px)Nine class types include abrasion, fillings, crowns, and 6 caries classes1320Images, MasksInstance segmentation2025E.I. Sosnin, R.A. Solovyev et al.
Modified Human Sperm Morphology Analysis Dataset (MHSMA)Human sperm images from 235 patients with male factor infertility, labeled for normal or abnormal sperm acrosome, head, vacuole, and tail.Cropped around single sperm head. Magnification normalized. Training, validation, and test set splits created.1,540.npy filesClassification2019S. Javadi and S. A. Mirroshandel

Animal

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Abalone DatasetPhysical measurements of Abalone. Weather patterns and location are also given.None.4177TextRegression1995Marine Research Laboratories – Taroona
Zoo DatasetArtificial dataset covering 7 classes of animals.Animals are classed into 7 categories and features are given for each.101TextClassification1990R. Forsyth
Demospongiae DatasetData about marine sponges.503 sponges in the Demosponge class are described by various features.503TextClassification2010E. Armengol et al.
Farm animals dataPLF data inventory (cows, pigs; location, acceleration, etc.).Labeled datasets.List is constantly updatedTextClassification2020V. Bloch
Splice-junction Gene Sequences DatasetPrimate splice-junction gene sequences (DNA) with associated imperfect domain theory.None.3190TextClassification1992G. Towell et al.
Mice Protein Expression DatasetExpression levels of 77 proteins measured in the cerebral cortex of mice.None.1080TextClassification, Clustering2015C. Higuera et al.

Fungi

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
UCI Mushroom DatasetMushroom attributes and classification.Many properties of each mushroom are given.8124TextClassification1987J. Schlimmer
Secondary Mushroom DatasetMushroom attributes and classificationSimulated data from larger and more realistic primary mushroom entries. Fully reproducible.61069TextClassification2020D. Wagner et al.

Plant

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Forest Fires DatasetForest fires and their properties.13 features of each fire are extracted.517TextRegression2008P. Cortez et al.
Iris DatasetThree types of iris plants are described by 4 different attributes.None.150TextClassification1936R. Fisher
Plant Species Leaves DatasetSixteen samples of leaf each of one-hundred plant species.Shape descriptor, fine-scale margin, and texture histograms are given.1600TextClassification2012J. Cope et al.
Soybean DatasetDatabase of diseased soybean plants.35 features for each plant are given. Plants are classified into 19 categories.307TextClassification1988R. Michalski et al.
Seeds DatasetMeasurements of geometrical properties of kernels belonging to three different varieties of wheat.None.210TextClassification, clustering2012Charytanowicz et al.
Covertype DatasetData for predicting forest cover type strictly from cartographic variables.Many geographical features given.581,012TextClassification1998J. Blackard et al.
Abscisic Acid Signaling Network DatasetData for a plant signaling network. Goal is to determine set of rules that governs the network.None.300TextCausal-discovery2008J. Jenkens et al.
Folio Dataset20 photos of leaves for each of 32 species.None.637Images, textClassification, clustering2015T. Munisami et al.
Oxford Flower Dataset17 category dataset of flowers.Train/test splits, labeled images,1360Images, textClassification2006M.-E. Nilsback et al.
Plant Seedlings Dataset12 category dataset of plant seedlings.Labelled images, segmented images,5544ImagesClassification, detection2017Giselsson et al.
Fruits-360Database with images of 251 fruits, vegetables, nuts and seeds.100x100 pixels, white background.174700Images (jpg)Classification2017–2026Mihai Oltean
Weed-ID.AppDatabase with 1,025 species, 13,500+ images, and 120,000+ characteristicsVarying size and background. Labeled by PhD botanist.13,500Images, textClassification1999-2024Richard Old
CottonWeedDet3 DatasetA 3-class weed detection dataset for cotton cropping systems3 species of weeds.848ImagesClassification2022Rahman et al.

Microbe

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Ecoli DatasetProtein localization sites.Various features of the protein localizations sites are given.336TextClassification1996K. Nakai et al.
MicroMass DatasetIdentification of microorganisms from mass-spectrometry data.Various mass spectrometer features.931TextClassification2013P. Mahe et al.
Yeast DatasetPredictions of Cellular localization sites of proteins.Eight features given per instance.1484TextClassification1996K. Nakai et al.

Drug discovery

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Tox21 DatasetPrediction of outcome of biological assays.Chemical descriptors of molecules are given.12707TextClassification2016A. Mayr et al.

Anomaly data

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Numenta Anomaly Benchmark (NAB)Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted.None50+ filesCSVAnomaly detection2016 (continually updated)Numenta
Skoltech Anomaly Benchmark (SKAB)Each file represents a single experiment and contains a single anomaly. The dataset represents a multivariate time series collected from the sensors installed on the testbed.There are two markups for Outlier detection (point anomalies) and Changepoint detection (collective anomalies) problems30+ files (v0.9)CSVAnomaly detection2020 (continually updated)Iurii D. Katser and Vyacheslav O. Kozitsin
On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical StudyMost data files are adapted from UCI Machine Learning Repository data, some are collected from the literature.treated for missing values, numerical attributes only, different percentages of anomalies, labels1000+ filesARFFAnomaly detection2016 (possibly updated with new datasets and/or results)Campos et al.
Secure Water Treatment (SWaT)Data collected from a six-stage SWaT testbed. Contains data with normal conditions and anomaly (attack) conditionsWindow and smooth/average accordingly3 filesCSVAnomaly detection2016 (Last Updated - 2020)Jonathan Goh et al.

Question answering data

This section includes datasets that deals with structured data.

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
DBpedia Neural Question Answering (DBNQA) DatasetA large collection of Question to SPARQL specially design for Open Domain Neural Question Answering over DBpedia Knowledgebase.This dataset contains a large collection of Open Neural SPARQL Templates and instances for training Neural SPARQL Machines; it was pre-processed by semi-automatic annotation tools as well as by three SPARQL experts.894,499Question-query pairsQuestion Answering2018Hartmann, Soru, and Marx et al.
Vietnamese Question Answering Dataset (UIT-ViQuAD)A large collection of Vietnamese questions for evaluating MRC models.This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia.23,074Question-answer pairsQuestion Answering2020Nguyen et al.
Vietnamese Multiple-Choice Machine Reading Comprehension Corpus(ViMMRC)A collection of Vietnamese multiple-choice questions for evaluating MRC models.This corpus includes 2,783 Vietnamese multiple-choice questions.2,783Question-answer pairsQuestion Answering/Machine Reading Comprehension2020Nguyen et al.
Open-Domain Question Answering Goes Conversational via Question RewritingAn end-to-end open-domain question answering.This dataset includes 14,000 conversations with 81,000 question-answer pairs.Context, Question, Rewrite, Answer, Answer_URL, Conversation_no, Turn_no, Conversation_source Further details are provided in the and respective .Question Answering2021Anantha and Vakulenko et al.
UnifiedQAQuestion-answer dataProcessed datasetQuestion Answering2020Khashabi et al.

Dialog or instruction prompted data

This section includes datasets that contains multi-turn text with at least two actors, a "user" and an "agent". The user makes requests for the agent, which performs the request.

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Taskmaster3 datasets with >55,000 spoken and written task-oriented dialogs in several domains.13,215 + 17,289 + 23,757 dialogs, in 6 + 7 + 1 task domains.1 and 2: conversation id, utterances, Instruction id 3: conversation id, utterances, vertical, scenario, instructions.Do the task.2019Byrne and Krishnamoorthi et al.
DrRepairLabeled dataset for program repair.Check format details in the .Do the task.2020Michihiro et al.
Super-NaturalInstructionsTasks specified in natural language.1,616 NLP tasks in 76 task types.Task definition in natural language instructions; example input/output.Do the task.2022Wang et al.
LAMBADANarrative passages where the last word is omitted.Guess the last word.2016Paperno et al.
FLANInstruction tuning data, with a mix of zero-shot, few-shot and chain-of-thought templates.Instruction tuning; do the task.2021Wei et al.

Cybersecurity

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
MITRE ATTACKThe ATT&CK is a globally-accessible knowledge base of adversary tactics and techniques.Data can be downloaded from these two GitHub repositories: andMITRE ATTACK
CAPECCommon Attack Pattern Enumeration and ClassificationData can be downloaded from :CAPEC
CVECVE is a list of publicly disclosed cybersecurity vulnerabilities that is free to search, use, and incorporate into products and services.Data can be downloaded from:CVE
CWECommon Weakness Enumeration data.Data can be downloaded from: [permanent dead link]CWE
MalwareTextDBAnnotated database of malware texts.The contains the data to download.Kiat et al.
USENIX Security Symposium proceedingsCollection of security proceedings from USENIX Security Symposium – technical sessions from 1995 to 2022.This data is not pre-processed., , , , , , , , , , , , , , , , , , , , , , , , , , .USENIX Security Symposium
APTNotesCollection of public documents, whitepapers and articles about APT campaigns. All the documents are publicly available data.This data is not pre-processed.The of the project contains a file with links to the data stored in box. Data files can also be downloaded .APT Notes
arXiv Cryptography and Security papersCollection of articles about cybersecurityThis data is not pre-processed.All articles available .arXiv
Security eBooks for freeSmall collection of security eBooks, and security presentations publicly available.This data is not pre-processed.
National Cyber Security strategy repositoryRepository of worldwide strategy documents about cybersecurity.This data is not pre-processed.
Cyber Security Natural Language ProcessingData about cybersecurity strategies from more than 75 countries.Tokenization, meaningless-frequent words removal.Yanlin Chen, Yunjian Wei, Yifan Yu, Wen Xue, Xianya Qin
APT Reports collectionSample of APT reports, malware, technology, and intelligence collectionRaw and tokenize data available.All data is available in this repository.[citation needed]blackorbird
Offensive Language Identification Dataset (OLID)Data available in the . Data is also available .Zampieri et al.
Cyber reports from the National Cyber Security CentreThis data is not pre-processed., , , , . .
APT reports by KasperskyThis data is not pre-processed.
The cyberwireThis data is not pre-processed., , and .
Databreaches newsThis data is not pre-processed.,
CybernewsThis data is not pre-processed.,
BleepingcomputerThis data is not pre-processed.
TherecordThis data is not pre-processed.
HackreadThis data is not pre-processed.
SecurelistThis data is not pre-processed., , , , , , , , , , and .
Stucco projectThe Stucco project collects data not typically integrated into security systems.This data is not pre-processed
FarsightsecurityWebsite with technical information, reports, and more about security topics.This data is not pre-processed, , .
SchneierWebsite with academic papers about security topics.This data is not pre-processed, .
TrendmicroWebsite with research, news, and perspectives bout security topics.This data is not pre-processed.
The Hacker NewsNews about cybersecurity topics.This data is not pre-processed, , , .
KrebsonsecuritySecurity news and investigationThis data is not pre-processed
Mitre DefendMatrix of Defend artifactsjson files
Mitre AtlasMitre Atlas is a knowledge base of adversary tactics, techniques, and case studies for machine learning (ML) systems based on real-world observations.This data is not pre-processed
Mitre EngageMITRE Engage is a framework for planning and discussing adversary engagement operations that empowers you to engage your adversaries and achieve your cybersecurity goals.This data is not pre-processed
Hacking TutorialsThis data is not pre-processed

Climate and sustainability

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
TCFD reportsDatabase of company reports that include TCFD-related disclosures.This data is not pre-processedTCFD Knowledge Hub
Corporate Social Responsibility ReportsA listing of responsibility reports on the internet.This data is not pre-processedResponsibilityReports
The Intergovernmental Panel on Climate Change (IPCC)A collection of comprehensive assessment reports about knowledge on climate change, its causes, potential impacts and response optionsThis data is not pre-processedIPCC
Alliance for Research on Corporate SustainabilityThis data is not pre-processedARCS
ESG corpus: Knowledge Hub of the Accounting for SustainabilityThis data is not pre-processed, , , and .Mehra et al.
CLIMATE-FEVERA dataset adopting the FEVER methodology that consists of 1,535 real-world claims regarding climate-change collected on the internet.Each claim is accompanied by five manually annotated evidence sentences retrieved from the English Wikipedia that support, refute or do not give enough information to validate the claim totalling in 7,675 claim-evidence pairs., and project's .Diggelmann et al.
Climate News datasetA dataset for NLP and climate change media researchersThe dataset is made up of a number of data artifacts (JSON, JSONL & CSV text files & SQLite database), Project'sADGEfficiency
ClimatextClimatext is a dataset for sentence-based climate change topic detection.University of Zurich
GreenBizCollection of articles and news about climate and sustainabilityThis data is not pre-processed
Top research pre-prints in climate and sustainabilityList of pre-prints from researchers in the reuters hot listThis data is not pre-processedMaurice Tamman
ARCSThis data is not pre-processed
GreenBizWebsite with articles about climate and sustainabilityThis data is not pre-processedGreenBiz
CSRWIREThis data is not pre-processedCSRWIRE
CDPArticles about , , andThis data is not pre-processedCDP

Code data

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
The StackA 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages.Filtered through license detection and deduplication.6 TB, 51.76B files (prior to deduplication); 3 TB, 5.28B files (after). 358 programming languages.ParquetLanguage modeling, autocompletion, program synthesis.2022D. Kocetkov, R. Li, L. Ben Allal, L. von Werra, H. de Vries
LEMUR Neural Network DatasetThe structured repository of standardized neural network models designed to facilitate AutoML tasks and model analysis with LLMsFiltered through license detection and deduplication.PyTorch models.Python scripts.Image classification, object detection, image segmentation, and natural language processing.2024A. Goodarzi, R. Kochnev, W. Khalid, F. Qin, T. Uzun, Y. Dhameliya, Y. Kathiriya, Z. Bentyn, D. Ignatov, R. Timofte
GitHub repositoriesThis data is not pre-processedCurated lis of repositories from GitHub: , , , , , ,
IBM Public GitHub repositoriesThis data is not pre-processedfrom GitHub
RedHat Public GitHub repositoriesThis data is not pre-processedfrom GitHub
StackExchange Public Archive.org filesThis data is not pre-processedfrom
Gitlab Public repositoriesThis data is not pre-processedCurated list of repositories from Gitlab:
Ansible Collections public repositoriesThis data is not pre-processedfrom GitHub.
CodeParrot GitHub Code DatasetThis data is not pre-processedCurated list of repositories from Hugging Face:
OKDThe Community Distribution of Kubernetes that powers Red Hat OpenShiftThis data is not pre-processed
OpenShiftThe developer and operations friendly Kubernetes distro
KubernetesThis data is not pre-processed
Red Hat DeveloperGitHub home of the Red Hat Developer programThis data is not pre-processed
Red Hat WorkshopsThis data is not pre-processed
Kubernetes SIGsThis data is not pre-processed
KonveyorThis data is not pre-processed
RedHat MarketplaceThis data is not pre-processed
Redhat blogThis data is not pre-processed
Kubernetes ioThis data is not pre-processed
Docs OpenshiftThis data is not pre-processed
cncf ioThis data is not pre-processed
Kubernetes presentationsList of publicly available Kubernetes presentationsThis data is not pre-processed
Red Hat Open Innovation LabsThis data is not pre-processed
Red Hat DemosThis data is not pre-processed
Red Hat OpenShift OnlineThis data is not pre-processed
Software CollectionsThis data is not pre-processed
Red Hat InsightsThis data is not pre-processed
Red Hat GovernmentThis data is not pre-processed
Red Hat ConsultingThis data is not pre-processed
Red Hat Communities of PracticeThis data is not pre-processed
Red Hat Partner TechThis data is not pre-processed
Red Hat DocumentationThis data is not pre-processed
IBMThis data is not pre-processed
IBM CloudThis data is not pre-processed
Build Lab TeamThis data is not pre-processed
Terraform IBM ModulesThis data is not pre-processed
Cloud SchematicsThis data is not pre-processed
OCP Power DemosThis data is not pre-processed
IBM App ModernizationThis data is not pre-processed
Kubernetes OperatorHubThis data is not pre-processed
Cloud Native Computing Foundation (CNCF)This data is not pre-processed
Operator FrameworkThis data is not pre-processed
GitHub repositories referenced in artifacthub.ioThis data is not pre-processed
Red Hat Communities of PracticeThis data is not pre-processed
Red Hat partnerThis data is not pre-processed
IBM RepositoriesThis data is not pre-processed
Build Lab TeamThis data is not pre-processed
Operator FrameworkThis data is not pre-processed
GitHub repositoriesThis data is not pre-processed
Red HatThis data is not pre-processed
Kubernetes PatternsThis data is not pre-processed
Kubernetes Deployment & Security PatternsThis data is not pre-processed
Kubernetes for Full-Stack DevelopersThis data is not pre-processed
Load Balancer Cloudwatch MetricsThis data is not pre-processed
DynatraceThis data is not pre-processed
AIOps Challenge 2020 DataThis data is not pre-processed
LoghubThis data is not pre-processed
HTML PagesThis data is not pre-processed
Opensift ebooksThis data is not pre-processed
Kubernetes ebooksThis data is not pre-processed, ,
Kubernetes for Full-Stack DevelopersThis data is not pre-processed
List of public and licensed GitHub repositoriesThis data is not pre-processed

Multivariate data

Financial

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Dow Jones IndexWeekly data of stocks from the first and second quarters of 2011.Calculated values included such as percentage change and a lags.750Comma separated valuesClassification, regression, Time series2014M. Brown et al.
Statlog (Australian Credit Approval)Credit card applications either accepted or rejected and attributes about the application.Attribute names are removed as well as identifying information. Factors have been relabeled.690Comma separated valuesClassification1987R. Quinlan
eBay auction dataAuction data from various eBay.com objects over various length auctionsContains all bids, bidderID, bid times, and opening prices.~ 550TextRegression, classification2012G. Shmueli et al.
Statlog (German Credit Data)Binary credit classification into "good" or "bad" with many featuresVarious financial features of each person are given.690TextClassification1994H. Hofmann
Bank Marketing DatasetData from a large marketing campaign carried out by a large bank .Many attributes of the clients contacted are given. If the client subscribed to the bank is also given.45,211TextClassification2012S. Moro et al.
Istanbul Stock Exchange DatasetSeveral stock indexes tracked for almost two years.None.536TextClassification, regression2013O. Akbilgic
Default of Credit Card ClientsCredit default data for Taiwanese creditors.Various features about each account are given.30,000TextClassification2016I. Yeh
Stock movement prediction from tweets and historical stock pricesNoneTextNLP2018Yumo Xu and Shay B. Cohen

Weather

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Cloud DataSetData about 1024 different clouds.Image features extracted.1024TextClassification, clustering1989P. Collard
El Nino DatasetOceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific.12 weather attributes are measured at each buoy.178080TextRegression1999Pacific Marine Environmental Laboratory
Greenhouse Gas Observing Network DatasetTime-series of greenhouse gas concentrations at 2921 grid cells in California created using simulations of the weather.None.2921TextRegression2015D. Lucas
Atmospheric CO2 from Continuous Air Samples at Mauna Loa ObservatoryContinuous air samples in Hawaii, USA. 44 years of records.None.44 yearsTextRegression2001Mauna Loa Observatory
Ionosphere DatasetRadar data from the ionosphere. Task is to classify into good and bad radar returns.Many radar features given.351TextClassification1989Johns Hopkins University
Ozone Level Detection DatasetTwo ground ozone level datasets.Many features given, including weather conditions at time of measurement.2536TextClassification2008K. Zhang et al.

Census

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Adult DatasetCensus data from 1994 containing demographic features of adults and their income.Cleaned and anonymized.48,842Comma separated valuesClassification1996United States Census Bureau
Census-Income (KDD)Weighted census data from the 1994 and 1995 Current Population Surveys.Split into training and test sets.299,285Comma separated valuesClassification2000United States Census Bureau
IPUMS Census DatabaseCensus data from the Los Angeles and Long Beach areas.None256,932TextClassification, regression1999IPUMS
US Census Data 1990Partial data from 1990 US census.Results randomized and useful attributes selected.2,458,285TextClassification, regression1990United States Census Bureau

Transit

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Bike Sharing DatasetHourly and daily count of rental bikes in a large city.Many features, including weather, length of trip, etc., are given.17,389TextRegression2013H. Fanaee-T
New York City Taxi Trip DataTrip data for yellow and green taxis in New York City.Gives pick up and drop off locations, fares, and other details of trips.6 yearsTextClassification, clustering2015New York City Taxi and Limousine Commission
Taxi Service Trajectory ECML PKDDTrajectories of all taxis in a large city.Many features given, including start and stop points.1,710,671TextClustering, causal-discovery2015M. Ferreira et al.
METR-LASpeed from loop detectors in the highway of Los Angeles County.Average speed in 5 minutes timesteps.7,094,304 from 207 sensors and 34,272 timestepsComma separated valuesRegression, Forecasting2014Jagadish et al.
PeMSSpeed, flow, occupancy and other metrics from loop detectors and other sensors in the freeway of the State of California, USA.Metric usually aggregated via Average into 5 minutes timesteps.39,000 individual detectors, each containing years of timeseriesComma separated valuesRegression, Forecasting, Nowcasting, Interpolation(updated realtime)California Department of Transportation

Internet

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Webpages from Common Crawl 2012Large collection of webpages and how they are connected via hyperlinksNone.3.5BTextclustering, classification2013V. Granville
Internet Advertisements DatasetDataset for predicting if a given image is an advertisement or not.Features encode geometry of ads and phrases occurring in the URL.3279TextClassification1998N. Kushmerick
Internet Usage DatasetGeneral demographics of internet users.None.10,104TextClassification, clustering1999D. Cook
URL Dataset120 days of URL data from a large conference.Many features of each URL are given.2,396,130TextClassification2009J. Ma
Phishing Websites DatasetDataset of phishing websites.Many features of each site are given.2456TextClassification2015R. Mustafa et al.
Online Retail DatasetOnline transactions for a UK online retailer.Details of each transaction given.541,909TextClassification, clustering2015D. Chen
Freebase Simple Topic DumpFreebase is an online effort to structure all human knowledge.Topics from Freebase have been extracted.largeTextClassification, clustering2011Freebase
Farm Ads DatasetThe text of farm ads from websites. Binary approval or disapproval by content owners is given.SVMlight sparse vectors of text words in ads calculated.4143TextClassification2011C. Masterharm et al.
The PileAssembling several large datasets of diverse and unstructured textsVarious (removing HTML and JavaScript from websites, removing duplicated sentences)825 GiB English textJSON LinesNatural Language Processing, Text Prediction2021Gao et al.
OSCARLarge collection of monolingual corpora extracted from web data (Common Crawl dumps) covering 150+ languagesVarious (filtering, language classification, adult-content detection and other labelling)3.4 TB English text, 1.4 TB Chinese text, 1.1 TB Russian text, 595 MB German text, 431 MB French text, and data for 150+ languages (figures for version 23.01)JSON LinesNatural Language Processing, Text Prediction2021Ortiz Suarez, Abadji, Sagot et al.
OpenWebTextAn open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit with at least three upvotes.Extracted non-HTML content, deduplicated, and tokenized.8,013,769 Documents, 38GBTextNatural Language Processing, Text Prediction2019A. Gokaslan, V. Cohen
ROOTSA well-documented and representative multilingual dataset with the explicit goal of doing good for and by the people whose data was collected.Extracted non-HTML content, cleaned out UI and ads, deduplicated, removed PII, and tokenized.1.6 TB, 59 languages.ParquetNatural Language Processing, Text Prediction2022H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. Villanova del Moral, T. Le Scao

Games

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Poker Hand Dataset5 card hands from a standard 52 card deck.Attributes of each hand are given, including the Poker hands formed by the cards it contains.1,025,010TextRegression, classification2007R. Cattral
Connect-4 DatasetContains all legal 8-ply positions in the game of connect-4 in which neither player has won yet, and in which the next move is not forced.None.67,557TextClassification1995J. Tromp
Chess (King-Rook vs. King) DatasetEndgame Database for White King and Rook against Black King.None.28,056TextClassification1994M. Bain et al.
Chess (King-Rook vs. King-Pawn) DatasetKing+Rook versus King+Pawn on a7.None.3196TextClassification1989R. Holte
Tic-Tac-Toe Endgame DatasetBinary classification for win conditions in tic-tac-toe.None.958TextClassification1991D. Aha

Other multivariate

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated (updated)ReferenceCreator
Housing Data SetMedian home values of Boston with associated home and neighborhood attributes.None.506TextRegression1993D. Harrison et al.
The Getty Vocabulariesstructured terminology for art and other material culture, archival materials, visual surrogates, and bibliographic materials.None.largeTextClassification2015Getty Center
Yahoo! Front Page Today Module User Click LogUser click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page.Conjoint analysis with a bilinear model.45,811,883 user visitsTextRegression, clustering2009Chu et al.
British Oceanographic Data CentreBiological, chemical, physical and geophysical data for oceans. 22K variables tracked.Various.22K variables, many instancesTextRegression, clustering2015British Oceanographic Data Centre
Congressional Voting Records DatasetVoting data for all USA representatives on 16 issues.Beyond the raw voting data, various other features are provided.435TextClassification1987J. Schlimmer
Entree Chicago Recommendation DatasetRecord of user interactions with Entree Chicago recommendation system.Details of each user's usage of the app are recorded in detail.50,672TextRegression, recommendation2000R. Burke
Insurance Company Benchmark (COIL 2000)Information on customers of an insurance company.Many features of each customer and the services they use.9,000TextRegression, classification2000P. van der Putten
Nursery DatasetData from applicants to nursery schools.Data about applicant's family and various other factors included.12,960TextClassification1997V. Rajkovic et al.
University DatasetData describing attributed of a large number of universities.None.285TextClustering, classification1988S. Sounders et al.
Blood Transfusion Service Center DatasetData from blood transfusion service center. Gives data on donors return rate, frequency, etc.None.748TextClassification2008I. Yeh
Record Linkage Comparison Patterns DatasetLarge dataset of records. Task is to link relevant records together.Blocking procedure applied to select only certain record pairs.5,749,132TextClassification2011University of Mainz
Nomao DatasetNomao collects data about places from many different sources. Task is to detect items that describe the same place.Duplicates labeled.34,465TextClassification2012Nomao Labs
Movie DatasetData for 10,000 movies.Several features for each movie are given.10,000TextClustering, classification1999G. Wiederhold
Open University Learning Analytics DatasetInformation about students and their interactions with a virtual learning environment.None.~ 30,000TextClassification, clustering, regression2015J. Kuzilek et al.
Mobile phone recordsTelecommunications activity and interactionsAggregation per geographical grid cells and every 15 minutes.largeTextClassification, Clustering, Regression2015G. Barlacchi et al.

Curated repositories of datasets

As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research.

  • OpenML: Web platform with Python, R, Java, and other APIs for downloading hundreds of machine learning datasets, evaluating algorithms on datasets, and benchmarking algorithm performance against dozens of other algorithms.
  • PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Provides classification and regression datasets in a standardized format that are accessible through a Python API.
  • Metatext NLP: web repository maintained by community, containing nearly 1000 benchmark datasets, and counting. Provides many tasks from classification to QA, and various languages from English, Portuguese to Arabic.
  • Appen: Off The Shelf and Open Source Datasets hosted and maintained by the company. These biological, image, physical, question answering, signal, sound, text, and video resources number over 250 and can be applied to over 25 different use cases.

See also