2 datasets found

c
ATLAS Top Tagging Open Data Set
opendata.cern.ch
Updated 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ATLAS collaboration (2022). ATLAS Top Tagging Open Data Set [Dataset]. http://doi.org/10.7483/OPENDATA.ATLAS.FG5F.96GA
Explore at:
Unique identifier
https://doi.org/10.7483/OPENDATA.ATLAS.FG5F.96GA
Dataset updated
2022
Dataset provided by
CERN Open Data Portal
Authors
ATLAS collaboration
Description
Boosted top tagging is an essential binary classification task for experiments at the Large Hadron Collider (LHC) to measure the properties of the top quark. The ATLAS Top Tagging Open Data Set is a publicly available data set for the development of Machine Learning (ML) based boosted top tagging algorithms. The data are split into two orthogonal sets, named train and test and stored in the HDF5 file format, containing 42 million and 2.5 million jets respectively. Both sets are composed of equal parts signal (jets initiated by a boosted top quark) and background (jets initiated by light quarks or gluons). For each jet, the data set contains:
The four vectors of constituent particles
15 high level summary quantities evaluated on the jet
The four vector of the whole jet
A training weight
A signal (1) vs background (0) label.
There is one rule in using this data set: the contribution to a loss function from any jet should always be weighted by the training weight. Apart from this a model should separate the signal jets from background by whatever means necessary.
Updated on July 26th 2024. This dataset has been superseeded by a new dataset which also includes systematic uncertainties. Please use the new dataset instead of this one.
c
ATLAS top tagging open data set with systematic uncertainties
opendata.cern.ch
Updated 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ATLAS collaboration (2024). ATLAS top tagging open data set with systematic uncertainties [Dataset]. http://doi.org/10.7483/OPENDATA.ATLAS.SOAY.LABE
Explore at:
Unique identifier
https://doi.org/10.7483/OPENDATA.ATLAS.SOAY.LABE
Dataset updated
2024
Dataset provided by
CERN Open Data Portal
Authors
ATLAS collaboration
Description
Boosted top tagging is an essential binary classification task for experiments at the Large Hadron Collider (LHC) to measure the properties of the top quark. The ATLAS Top Tagging Open Data Set is a publicly available dataset for the development of Machine Learning (ML) based boosted top tagging algorithms. The dataset consists of a nominal piece used for the training and evaluation of algorithms, and a systematic piece used for estimating the size of systematic uncertainties produced by an algorithm. The nominal data are is split into two orthogonal sets, named train and test. The systematic varied data is split into many more pieces that should only be used for evaluation in most cases. Both nominal sets are composed of equal parts signal (jets initiated by a boosted top quark) and background (jets initiated by light quarks or gluons).

A brief overview of these datasets is as follows. For more detailed information see arxiv:2047.20127.

train_nominal - 92,820,427 jets, equal parts signal and background

test_nominal - 10,306,813 jets, equal parts signal and background

esup - 10,032,472 jets with the cluster energy scale up systematic variation active, equal parts signal and background

esdown - 10,032,472 jets with the cluster energy scale down systematic variation active, equal parts signal and background

cer - 10,040,653 jets with the cluster energy resolution systematic variation active, equal parts signal and background

cpos - 10,032,472 jets with the cluster energy position systematic variation active, equal parts signal and background

teg - 7,421,204 jets with the track efficiency global systematic variation active, 30% signal jets

tej - 7,017,046 jets with the track efficiency in jets systematic variation active, 32% signal jets

tfl - 5,907,310 jets with the track fake rate loose systematic variation active, 18% signal jets

tfj - 6,977,371 jets with the track fake rate in jets systematic variation active, 32% signal jets

bias - 10,011,330 jets with the track bias systematic variation active, 52% signal jets

ttbar_pythia - 193,792 jets from Pythia simulated events containing Standard Model top-anti top quark pair production, all signal jets

ttbar_herwig - 180,811 jets from Herwig simulated events containing Standard Model top-anti top quark pair production, all signal jets

cluster - 5,000,004 jets simulated using the Sherpa cluster based hadronization model, all background jets

string - 5,000,001 jets simulated using the Lund string based hadronization model, all background jets

angular - 4,900,000 jets simulated using the Herwig angular ordered parton shower model, all background jets

dipole - 4,900,000 jets simulated using the Herwig dipole parton shower model, all background jets

For each jet, the datasets contain:

The four vectors of constituent particles

15 high level summary quantities evaluated on the jet

The four vector of the whole jet

A training weight (nominal only)

PYTHIA shower weights (nominal only)

A signal (1) vs background (0) label

There are two rules for using this data set: the contribution to a loss function from any jet should always be weighted by the training weight, and any performance claim is incomplete without an estimate of the systematic uncertainties via the method illustrated in this repository. The ideal model shows high performance but also small systematic uncertainties.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

ATLAS collaboration (2022). ATLAS Top Tagging Open Data Set [Dataset]. http://doi.org/10.7483/OPENDATA.ATLAS.FG5F.96GA

ATLAS Top Tagging Open Data Set

Explore at:

6 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.7483/OPENDATA.ATLAS.FG5F.96GA

Dataset updated

2022

Dataset provided by

CERN Open Data Portal

Authors

ATLAS collaboration

Description

Boosted top tagging is an essential binary classification task for experiments at the Large Hadron Collider (LHC) to measure the properties of the top quark. The ATLAS Top Tagging Open Data Set is a publicly available data set for the development of Machine Learning (ML) based boosted top tagging algorithms. The data are split into two orthogonal sets, named train and test and stored in the HDF5 file format, containing 42 million and 2.5 million jets respectively. Both sets are composed of equal parts signal (jets initiated by a boosted top quark) and background (jets initiated by light quarks or gluons). For each jet, the data set contains:

The four vectors of constituent particles
15 high level summary quantities evaluated on the jet
The four vector of the whole jet
A training weight
A signal (1) vs background (0) label.

There is one rule in using this data set: the contribution to a loss function from any jet should always be weighted by the training weight. Apart from this a model should separate the signal jets from background by whatever means necessary.

Updated on July 26th 2024. This dataset has been superseeded by a new dataset which also includes systematic uncertainties. Please use the new dataset instead of this one.

Clear search

Close search

Google apps

Main menu

ATLAS Top Tagging Open Data Set

ATLAS top tagging open data set with systematic uncertainties

ATLAS Top Tagging Open Data Set