Feed ROOT data directly into models for machine learning training.
RDataLoader streams ROOT data into machine learning frameworks as batches ready for training. It takes any RDataFrame as input, giving you access to the full ROOT ecosystem for filtering, defining new variables and applying selections; it delivers batches of your dataset for NumPy, TensorFlow and PyTorch through a simple iteration interface.
RDataLoader is part of ROOT.Experimental.ML and is currently experimental. The API may change between ROOT releases.A one-page quick reference covering the API.
RDataLoader takes an RDataFrame as input. This means your data preparation (selecting events, computing new variables, applying cuts, etc.) all happens before the loader is created, using the full power of RDataFrame:
Then pass your RDataFrame to RDataLoader:
The sections below explain how to configure the loader and get the most out of it.
columns selects which branches to load. target names the label column, it is returned separately as y when you iterate, so you don't need to split it manually:
You can also pass multiple targets:
target must appear in the columns list.batch_size controls how many events are in each batch. batches_in_memory controls how many batches are held in the shuffle buffer at any time:
batches_in_memory ↑** - larger shuffle buffer, better randomisation, higher memory usebatches_in_memory ↓** - lower memory use, limited shuffleShuffling is enabled by default. To make runs reproducible, fix the seed:
ROOT branches that store variable-length arrays must be declared with a maximum size. Shorter entries are zero-padded and the branch is expanded into numbered columns:
columns must appear in max_vec_sizes.Yields torch.Tensor batches:
Move tensors to GPU by passing a device:
Returns a tf.data.Dataset of tf.Tensor batches:
Yields np.ndarray batches:
Pass test_size to split the dataset into two loaders each representing a fraction of the original dataset (no data is duplicated):
train_test_split twice:By default the loader reads data lazily, one chunk of data at a time. For small datasets that fit in memory and will be iterated many times, eager loading pays a one-time cost at construction and then serves batches every epoch from memory:
Correct class imbalance by oversampling the minority or undersampling the majority. You can do this by passing two RDataFrames:
load_eager=True).If your dataset has a weight column, pass its name to weights. It is returned as a third value w alongside X and y: