template<typename... Args>
class ROOT::Experimental::Internal::ML::RClusterLoader< Args >
Loads TTree/RNTuple clusters from one or more RDataFrames into RFlat2DMatrix buffers for ML training and validation.
Overview
At construction the loader scans the cluster boundaries of every provided RDataFrame and stores them as a flat list of RClusterRange objects. SplitDataset() then partitions those ranges into training and validation sets according to validationSplit.
The split strategy depends on whether shuffling is enabled or not
- Unshuffled: one cut is made so that the first (1 - validationSplit) fraction of entries goes to training. At most one cluster is split at the boundary.
- Shuffled: each cluster is split proportionally (according to validationSplit) so both sets draw entries from every part of the dataset. ShuffleTrainingClusters() and ShuffleValidationClusters() re-order the cluster lists at the start of each epoch. A second shuffling step, at the entries level, happens inside LoadTrainingClusterInto() and LoadValidationClusterInto() when loading the data into the tensors.
Filtered RDataFrames
When any RDataFrame carries a filter, the true entry count is not known until the computation graph is executed. In this case SplitDataset() is a no-op and the split is discovered lazily inside LoadTrainingClusterInto() during the first epoch. After the first epoch FinaliseSplitDiscovery() marks the split as stable and all subsequent epochs use the same pre-computed ranges.
Definition at line 149 of file RClusterLoader.hxx.
|
| | RClusterLoader (std::vector< ROOT::RDF::RNode > &rdfs, const std::vector< std::string > &cols, const std::vector< std::size_t > &vecSizes, float vecPadding, float validationSplit, bool shuffle, std::size_t setSeed) |
| void | FinaliseSplitDiscovery () |
| | Mark the train/val split as finalised after the first epoch.
|
| std::size_t | GetNmTotalClusters () const |
| std::size_t | GetNumChunkCols () const |
| std::size_t | GetNumTrainingClusters () const |
| std::size_t | GetNumTrainingEntries () const |
| std::size_t | GetNumValidationClusters () const |
| std::size_t | GetNumValidationEntries () const |
| const std::vector< RClusterRange > & | GetTrainingClusters () const |
| const std::vector< RClusterRange > & | GetValidationClusters () const |
| bool | IsSplitDiscovered () const |
| void | LoadClusterInto (RFlat2DMatrix &dest, std::size_t rdfIdx, std::uint64_t startRow, std::uint64_t endRow, std::size_t rowOffset=0) |
| std::size_t | LoadTrainingClusterInto (RFlat2DMatrix &dest, std::size_t rdfIdx, std::uint64_t startRow, std::uint64_t endRow, std::size_t rowOffset=0) |
| | Load one training cluster and return the number of rows written.
|
| void | LoadValidationClusterInto (RFlat2DMatrix &dest, std::size_t rdfIdx, std::uint64_t startRow, std::uint64_t endRow, std::size_t rowOffset=0) |
| | Load one validation cluster into dest starting at rowOffset.
|
| void | ShuffleTrainingClusters (std::size_t epochIdx) |
| | Re-order training clusters for the upcoming epoch.
|
| void | ShuffleValidationClusters (std::size_t epochIdx) |
| | Re-order validation clusters for the upcoming epoch.
|
| void | SplitDataset () |
| | Distribute the clusters into training and validation datasets No-op for filtered RDataFrames, the split is discovered lazily during the first epoch.
|