Logo ROOT  
Reference Guide
 
All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Properties Friends Macros Modules Pages
Loading...
Searching...
No Matches
RDataSource.hxx
Go to the documentation of this file.
1// Author: Enrico Guiraud, Danilo Piparo CERN 09/2017
2
3/*************************************************************************
4 * Copyright (C) 1995-2018, Rene Brun and Fons Rademakers. *
5 * All rights reserved. *
6 * *
7 * For the licensing terms see $ROOTSYS/LICENSE. *
8 * For the list of contributors see $ROOTSYS/README/CREDITS. *
9 *************************************************************************/
10
11#ifndef ROOT_RDATASOURCE
12#define ROOT_RDATASOURCE
13
15#include <string_view>
16#include "RtypesCore.h" // ULong64_t
17#include "TString.h"
18
19#include <algorithm> // std::transform
20#include <cassert>
21#include <string>
22#include <typeinfo>
23#include <vector>
24
25namespace ROOT {
26namespace RDF {
27class RDataSource;
28}
29}
30
31/// Print a RDataSource at the prompt
32namespace cling {
33std::string printValue(ROOT::RDF::RDataSource *ds);
34} // namespace cling
35
36namespace ROOT {
37
38namespace Internal {
39namespace TDS {
40
41/// Mother class of TTypedPointerHolder. The instances
42/// of this class can be put in a container. Upon destruction,
43/// the correct deletion of the pointer is performed in the
44/// derived class.
46protected:
47 void *fPointer{nullptr};
48
49public:
50 TPointerHolder(void *ptr) : fPointer(ptr) {}
51 void *GetPointer() { return fPointer; }
52 void *GetPointerAddr() { return &fPointer; }
54 virtual ~TPointerHolder(){};
55};
56
57/// Class to wrap a pointer and delete the memory associated to it
58/// correctly
59template <typename T>
61public:
62 TTypedPointerHolder(T *ptr) : TPointerHolder((void *)ptr) {}
63
65 {
66 const auto typedPtr = static_cast<T *>(fPointer);
67 return new TTypedPointerHolder(new T(*typedPtr));
68 }
69
70 ~TTypedPointerHolder() { delete static_cast<T *>(fPointer); }
71};
72
73} // ns TDS
74} // ns Internal
75
76namespace RDF {
77
78// clang-format off
79/**
80\class ROOT::RDF::RDataSource
81\ingroup dataframe
82\brief RDataSource defines an API that RDataFrame can use to read arbitrary data formats.
83
84A concrete RDataSource implementation (i.e. a class that inherits from RDataSource and implements all of its pure
85methods) provides an adaptor that RDataFrame can leverage to read any kind of tabular data formats.
86RDataFrame calls into RDataSource to retrieve information about the data, retrieve (thread-local) readers or "cursors"
87for selected columns and to advance the readers to the desired data entry.
88
89The sequence of calls that RDataFrame (or any other client of a RDataSource) performs is the following:
90
91 - SetNSlots() : inform RDataSource of the desired level of parallelism
92 - GetColumnReaders() : retrieve from RDataSource per-thread readers for the desired columns
93 - Initialize() : inform RDataSource that an event-loop is about to start
94 - GetEntryRanges() : retrieve from RDataSource a set of ranges of entries that can be processed concurrently
95 - InitSlot() : inform RDataSource that a certain thread is about to start working on a certain range of entries
96 - SetEntry() : inform RDataSource that a certain thread is about to start working on a certain entry
97 - FinalizeSlot() : inform RDataSource that a certain thread finished working on a certain range of entries
98 - Finalize() : inform RDataSource that an event-loop finished
99
100RDataSource implementations must support running multiple event-loops consecutively (although sequentially) on the same dataset.
101 - \b SetNSlots() is called once per RDataSource object, typically when it is associated to a RDataFrame.
102 - \b GetColumnReaders() can be called several times, potentially with the same arguments, also in-between event-loops, but not during an event-loop.
103 - \b GetEntryRanges() will be called several times, including during an event loop, as additional ranges are needed. It will not be called concurrently.
104 - \b Initialize() and \b Finalize() are called once per event-loop, right before starting and right after finishing.
105 - \b InitSlot(), \b SetEntry(), and \b FinalizeSlot() can be called concurrently from multiple threads, multiple times per event-loop.
106
107 Advanced users that plan to implement a custom RDataSource can check out existing implementations, e.g. RCsvDS or RNTupleDS.
108 See the inheritance diagram below for the full list of existing concrete implementations.
109*/
111 // clang-format on
112protected:
113 using Record_t = std::vector<void *>;
114 friend std::string cling::printValue(::ROOT::RDF::RDataSource *);
115
116 virtual std::string AsString() { return "generic data source"; };
117
118 unsigned int fNSlots{};
119
120public:
121 RDataSource() = default;
122 // Rule of five
123 RDataSource(const RDataSource &) = delete;
127 virtual ~RDataSource() = default;
128
129 // clang-format off
130 /// \brief Inform RDataSource of the number of processing slots (i.e. worker threads) used by the associated RDataFrame.
131 /// Slots numbers are used to simplify parallel execution: RDataFrame guarantees that different threads will always
132 /// pass different slot values when calling methods concurrently.
133 // clang-format on
134 virtual void SetNSlots(unsigned int nSlots)
135 {
136 assert(fNSlots == 0);
137 assert(nSlots > 0);
138 fNSlots = nSlots;
139 };
140
141 /// \brief Returns the number of files from which the dataset is constructed
142 virtual std::size_t GetNFiles() const { return 0; }
143
144 // clang-format off
145 /// \brief Returns a reference to the collection of the dataset's column names
146 // clang-format on
147 virtual const std::vector<std::string> &GetColumnNames() const = 0;
148
149 /// \brief Checks if the dataset has a certain column
150 /// \param[in] colName The name of the column
151 virtual bool HasColumn(std::string_view colName) const = 0;
152
153 // clang-format off
154 /// \brief Type of a column as a string, e.g. `GetTypeName("x") == "double"`. Required for jitting e.g. `df.Filter("x>0")`.
155 /// \param[in] colName The name of the column
156 // clang-format on
157 virtual std::string GetTypeName(std::string_view colName) const = 0;
158
159 // clang-format off
160 /// Called at most once per column by RDF. Return vector of pointers to pointers to column values - one per slot.
161 /// \tparam T The type of the data stored in the column
162 /// \param[in] columnName The name of the column
163 ///
164 /// These pointers are veritable cursors: it's a responsibility of the RDataSource implementation that they point to
165 /// the "right" memory region.
166 // clang-format on
167 template <typename T>
168 std::vector<T **> GetColumnReaders(std::string_view columnName)
169 {
171 std::vector<T **> typedVec(typeErasedVec.size());
172 std::transform(typeErasedVec.begin(), typeErasedVec.end(), typedVec.begin(),
173 [](void *p) { return static_cast<T **>(p); });
174 return typedVec;
175 }
176
177 /// If the other GetColumnReaders overload returns an empty vector, this overload will be called instead.
178 /// \param[in] slot The data processing slot that needs to be considered
179 /// \param[in] name The name of the column for which a column reader needs to be returned
180 /// \param[in] tid A type_info
181 /// At least one of the two must return a non-empty/non-null value.
182 virtual std::unique_ptr<ROOT::Detail::RDF::RColumnReaderBase>
183 GetColumnReaders(unsigned int /*slot*/, std::string_view /*name*/, const std::type_info &)
184 {
185 return {};
186 }
187
188 // clang-format off
189 /// \brief Return ranges of entries to distribute to tasks.
190 /// They are required to be contiguous intervals with no entries skipped. Supposing a dataset with nEntries, the
191 /// intervals must start at 0 and end at nEntries, e.g. [0-5],[5-10] for 10 entries.
192 /// This function will be invoked repeatedly by RDataFrame as it needs additional entries to process.
193 /// The same entry range should not be returned more than once.
194 /// Returning an empty collection of ranges signals to RDataFrame that the processing can stop.
195 // clang-format on
196 virtual std::vector<std::pair<ULong64_t, ULong64_t>> GetEntryRanges() = 0;
197
198 // clang-format off
199 /// \brief Advance the "cursors" returned by GetColumnReaders to the selected entry for a particular slot.
200 /// \param[in] slot The data processing slot that needs to be considered
201 /// \param[in] entry The entry which needs to be pointed to by the reader pointers
202 /// Slots are adopted to accommodate parallel data processing.
203 /// Different workers will loop over different ranges and
204 /// will be labelled by different "slot" values.
205 /// Returns *true* if the entry has to be processed, *false* otherwise.
206 // clang-format on
207 virtual bool SetEntry(unsigned int slot, ULong64_t entry) = 0;
208
209 // clang-format off
210 /// \brief Convenience method called before starting an event-loop.
211 /// This method might be called multiple times over the lifetime of a RDataSource, since
212 /// users can run multiple event-loops with the same RDataFrame.
213 /// Ideally, `Initialize` should set the state of the RDataSource so that multiple identical event-loops
214 /// will produce identical results.
215 // clang-format on
216 virtual void Initialize() {}
217
218 // clang-format off
219 /// \brief Convenience method called at the start of the data processing associated to a slot.
220 /// \param[in] slot The data processing slot wihch needs to be initialized
221 /// \param[in] firstEntry The first entry of the range that the task will process.
222 /// This method might be called multiple times per thread per event-loop.
223 // clang-format on
224 virtual void InitSlot(unsigned int /*slot*/, ULong64_t /*firstEntry*/) {}
225
226 // clang-format off
227 /// \brief Convenience method called at the end of the data processing associated to a slot.
228 /// \param[in] slot The data processing slot wihch needs to be finalized
229 /// This method might be called multiple times per thread per event-loop.
230 // clang-format on
231 virtual void FinalizeSlot(unsigned int /*slot*/) {}
232
233 // clang-format off
234 /// \brief Convenience method called after concluding an event-loop.
235 /// See Initialize for more details.
236 // clang-format on
237 virtual void Finalize() {}
238
239 /// \brief Return a string representation of the datasource type.
240 /// The returned string will be used by ROOT::RDF::SaveGraph() to represent
241 /// the datasource in the visualization of the computation graph.
242 /// Concrete datasources can override the default implementation.
243 virtual std::string GetLabel() { return "Custom Datasource"; }
244
245protected:
246 /// type-erased vector of pointers to pointers to column values - one per slot
247 virtual Record_t GetColumnReadersImpl(std::string_view name, const std::type_info &) = 0;
248};
249
250} // ns RDF
251
252} // ns ROOT
253
254/// Print a RDataSource at the prompt
255namespace cling {
256inline std::string printValue(ROOT::RDF::RDataSource *ds)
257{
258 return ds->AsString();
259}
260} // namespace cling
261
262#endif // ROOT_TDATASOURCE
unsigned long long ULong64_t
Definition RtypesCore.h:70
ROOT::Detail::TRangeCast< T, true > TRangeDynCast
TRangeDynCast is an adapter class that allows the typed iteration through a TCollection.
winID h TVirtualViewer3D TVirtualGLPainter p
char name[80]
Definition TGX11.cxx:110
Mother class of TTypedPointerHolder.
virtual TPointerHolder * GetDeepCopy()=0
Class to wrap a pointer and delete the memory associated to it correctly.
RDataSource defines an API that RDataFrame can use to read arbitrary data formats.
RDataSource(RDataSource &&)=delete
RDataSource(const RDataSource &)=delete
RDataSource & operator=(const RDataSource &)=delete
virtual bool HasColumn(std::string_view colName) const =0
Checks if the dataset has a certain column.
virtual void Finalize()
Convenience method called after concluding an event-loop.
virtual void InitSlot(unsigned int, ULong64_t)
Convenience method called at the start of the data processing associated to a slot.
virtual void FinalizeSlot(unsigned int)
Convenience method called at the end of the data processing associated to a slot.
virtual ~RDataSource()=default
virtual std::string AsString()
virtual bool SetEntry(unsigned int slot, ULong64_t entry)=0
Advance the "cursors" returned by GetColumnReaders to the selected entry for a particular slot.
virtual void SetNSlots(unsigned int nSlots)
Inform RDataSource of the number of processing slots (i.e.
std::vector< void * > Record_t
virtual std::string GetLabel()
Return a string representation of the datasource type.
virtual std::size_t GetNFiles() const
Returns the number of files from which the dataset is constructed.
virtual const std::vector< std::string > & GetColumnNames() const =0
Returns a reference to the collection of the dataset's column names.
virtual std::vector< std::pair< ULong64_t, ULong64_t > > GetEntryRanges()=0
Return ranges of entries to distribute to tasks.
RDataSource & operator=(RDataSource &&)=delete
virtual Record_t GetColumnReadersImpl(std::string_view name, const std::type_info &)=0
type-erased vector of pointers to pointers to column values - one per slot
virtual std::string GetTypeName(std::string_view colName) const =0
Type of a column as a string, e.g.
std::vector< T ** > GetColumnReaders(std::string_view columnName)
Called at most once per column by RDF.
virtual std::unique_ptr< ROOT::Detail::RDF::RColumnReaderBase > GetColumnReaders(unsigned int, std::string_view, const std::type_info &)
If the other GetColumnReaders overload returns an empty vector, this overload will be called instead.
virtual void Initialize()
Convenience method called before starting an event-loop.
const_iterator begin() const
const_iterator end() const
tbb::task_arena is an alias of tbb::interface7::task_arena, which doesn't allow to forward declare tb...