DataComp: In search of the next generation of multimodal datasets

Samir  Yitzhak Gadre; Gabriel  Ilharco; Alex  Fang; Jonathan  Hayase; Georgios  Smyrnis; Thao  Nguyen; Ryan  Marten; Mitchell  Wortsman; Dhruba  Ghosh; Jieyu  Zhang; Eyal  Orgad; Rahim Entezari; Giannis  Daras; Sarah  Pratt; Vivek  Ramanujan; Yonatan  Bitton; Kalyani  Marathe; Stephen  Mussmann; Richard  Vencu; Mehdi  Cherti; Ranjay  Krishna; Pang  Wei Koh; Olga Saukh; Alexander  Ratner; Shuran  Song; Hannaneh  Hajishirzi; Ali  Farhadi; Romain  Beaumont; Sewoong  Oh; Alexandros  G. Dimakis; Jenia  Jitsev; Yair  Carmon; Vaishaal  Shankar; Ludwig Schmidt

DataComp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre^*, Gabriel Ilharco^*, Alex Fang^*, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi ChertiRanjay Krishna, Pang Wei Koh, Olga Saukh, Alexander Ratner, Shuran Song, Hannaneh Hajishirzi, Ali Farhadi, Romain Beaumont, Sewoong Oh, Alexandros G. Dimakis, Jenia Jitsev, Yair Carmon, Vaishaal Shankar, Ludwig Schmidt

^*Korrespondierende/r Autor/-in für diese Arbeit

Institut für Technische Informatik (4480)

Publikation: Arbeitspapier › Preprint

Abstract

Large multimodal datasets have been instrumental in recent breakthroughs such as CLIP, Stable Diffusion, and GPT-4. At the same time, datasets rarely receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a participatory benchmark where the training code is fixed and researchers innovate by proposing new training sets. Concretely, we provide a testbed for dataset experiments centered around a new candidate pool of 12.8B image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple scales, with four candidate pool sizes and associated compute budgets ranging from 12.8M to 12.8B samples seen during training. This multi-scale design facilitates the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow is a promising direction for improving multimodal datasets. We introduce DataComp-1B, a dataset created using a simple filtering algorithm applied to the 12.8B candidate pool. The resulting 1.4B subset enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet. Our new ViT-L/14 model outperforms a larger ViT-g/14 trained on LAION-2B by 0.7 percentage points while requiring 9× less compute during training. We also outperform OpenAI’s CLIP ViT-L/14 by 3.7 percentage points, which is trained with the same compute budget as our model. These gains highlight the potential for improving model performance by carefully curating training sets. We view DataComp-1B as only the first step and hope that DataComp paves the way toward the next generation of multimodal datasets. We publicly release our datasets, associated tooling, filtering baselines, and our code for training and evaluating models at www.datacomp.ai.

Originalsprache	englisch
Publikationsstatus	Veröffentlicht - 2023

Zugriff auf Dokument

https://arxiv.org/pdf/2304.14108.pdfLizenz: CC BY 4.0

Dieses zitieren

Yitzhak Gadre, S, Ilharco, G, Fang, A, Hayase, J, Smyrnis, G, Nguyen, T, Marten, R, Wortsman, M, Ghosh, D, Zhang, J, Orgad, E, Entezari, R, Daras, G, Pratt, S, Ramanujan, V, Bitton, Y, Marathe, K, Mussmann, S, Vencu, R, Cherti, M, Krishna, R, Wei Koh, P, Saukh, O, Ratner, A, Song, S, Hajishirzi, H, Farhadi, A, Beaumont, R, Oh, S, G. Dimakis, A, Jitsev, J, Carmon, Y, Shankar, V & Schmidt, L 2023 'DataComp: In search of the next generation of multimodal datasets'. <https://arxiv.org/pdf/2304.14108.pdf>

@techreport{7190900ee7044afa9b58aa439e07ea51,

title = "DataComp: In search of the next generation of multimodal datasets",

abstract = "Large multimodal datasets have been instrumental in recent breakthroughs such as CLIP, Stable Diffusion, and GPT-4. At the same time, datasets rarely receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a participatory benchmark where the training code is fixed and researchers innovate by proposing new training sets. Concretely, we provide a testbed for dataset experiments centered around a new candidate pool of 12.8B image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple scales, with four candidate pool sizes and associated compute budgets ranging from 12.8M to 12.8B samples seen during training. This multi-scale design facilitates the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow is a promising direction for improving multimodal datasets. We introduce DataComp-1B, a dataset created using a simple filtering algorithm applied to the 12.8B candidate pool. The resulting 1.4B subset enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet. Our new ViT-L/14 model outperforms a larger ViT-g/14 trained on LAION-2B by 0.7 percentage points while requiring 9× less compute during training. We also outperform OpenAI{\textquoteright}s CLIP ViT-L/14 by 3.7 percentage points, which is trained with the same compute budget as our model. These gains highlight the potential for improving model performance by carefully curating training sets. We view DataComp-1B as only the first step and hope that DataComp paves the way toward the next generation of multimodal datasets. We publicly release our datasets, associated tooling, filtering baselines, and our code for training and evaluating models at www.datacomp.ai.",

author = "{Yitzhak Gadre}, Samir and Gabriel Ilharco and Alex Fang and Jonathan Hayase and Georgios Smyrnis and Thao Nguyen and Ryan Marten and Mitchell Wortsman and Dhruba Ghosh and Jieyu Zhang and Eyal Orgad and Rahim Entezari and Giannis Daras and Sarah Pratt and Vivek Ramanujan and Yonatan Bitton and Kalyani Marathe and Stephen Mussmann and Richard Vencu and Mehdi Cherti and Ranjay Krishna and {Wei Koh}, Pang and Olga Saukh and Alexander Ratner and Shuran Song and Hannaneh Hajishirzi and Ali Farhadi and Romain Beaumont and Sewoong Oh and {G. Dimakis}, Alexandros and Jenia Jitsev and Yair Carmon and Vaishaal Shankar and Ludwig Schmidt",

year = "2023",

language = "English",

type = "WorkingPaper",

}

TY - UNPB

T1 - DataComp: In search of the next generation of multimodal datasets

AU - Yitzhak Gadre, Samir

AU - Ilharco, Gabriel

AU - Fang, Alex

AU - Hayase, Jonathan

AU - Smyrnis, Georgios

AU - Nguyen, Thao

AU - Marten, Ryan

AU - Wortsman, Mitchell

AU - Ghosh, Dhruba

AU - Zhang, Jieyu

AU - Orgad, Eyal

AU - Entezari, Rahim

AU - Daras, Giannis

AU - Pratt, Sarah

AU - Ramanujan, Vivek

AU - Bitton, Yonatan

AU - Marathe, Kalyani

AU - Mussmann, Stephen

AU - Vencu, Richard

AU - Cherti, Mehdi

AU - Krishna, Ranjay

AU - Wei Koh, Pang

AU - Saukh, Olga

AU - Ratner, Alexander

AU - Song, Shuran

AU - Hajishirzi, Hannaneh

AU - Farhadi, Ali

AU - Beaumont, Romain

AU - Oh, Sewoong

AU - G. Dimakis, Alexandros

AU - Jitsev, Jenia

AU - Carmon, Yair

AU - Shankar, Vaishaal

AU - Schmidt, Ludwig

PY - 2023

Y1 - 2023

N2 - Large multimodal datasets have been instrumental in recent breakthroughs such as CLIP, Stable Diffusion, and GPT-4. At the same time, datasets rarely receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a participatory benchmark where the training code is fixed and researchers innovate by proposing new training sets. Concretely, we provide a testbed for dataset experiments centered around a new candidate pool of 12.8B image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple scales, with four candidate pool sizes and associated compute budgets ranging from 12.8M to 12.8B samples seen during training. This multi-scale design facilitates the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow is a promising direction for improving multimodal datasets. We introduce DataComp-1B, a dataset created using a simple filtering algorithm applied to the 12.8B candidate pool. The resulting 1.4B subset enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet. Our new ViT-L/14 model outperforms a larger ViT-g/14 trained on LAION-2B by 0.7 percentage points while requiring 9× less compute during training. We also outperform OpenAI’s CLIP ViT-L/14 by 3.7 percentage points, which is trained with the same compute budget as our model. These gains highlight the potential for improving model performance by carefully curating training sets. We view DataComp-1B as only the first step and hope that DataComp paves the way toward the next generation of multimodal datasets. We publicly release our datasets, associated tooling, filtering baselines, and our code for training and evaluating models at www.datacomp.ai.

AB - Large multimodal datasets have been instrumental in recent breakthroughs such as CLIP, Stable Diffusion, and GPT-4. At the same time, datasets rarely receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a participatory benchmark where the training code is fixed and researchers innovate by proposing new training sets. Concretely, we provide a testbed for dataset experiments centered around a new candidate pool of 12.8B image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple scales, with four candidate pool sizes and associated compute budgets ranging from 12.8M to 12.8B samples seen during training. This multi-scale design facilitates the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow is a promising direction for improving multimodal datasets. We introduce DataComp-1B, a dataset created using a simple filtering algorithm applied to the 12.8B candidate pool. The resulting 1.4B subset enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet. Our new ViT-L/14 model outperforms a larger ViT-g/14 trained on LAION-2B by 0.7 percentage points while requiring 9× less compute during training. We also outperform OpenAI’s CLIP ViT-L/14 by 3.7 percentage points, which is trained with the same compute budget as our model. These gains highlight the potential for improving model performance by carefully curating training sets. We view DataComp-1B as only the first step and hope that DataComp paves the way toward the next generation of multimodal datasets. We publicly release our datasets, associated tooling, filtering baselines, and our code for training and evaluating models at www.datacomp.ai.

M3 - Preprint

BT - DataComp: In search of the next generation of multimodal datasets

ER -

DataComp: In search of the next generation of multimodal datasets

Abstract

Zugriff auf Dokument

Fingerprint

Dieses zitieren