Pipeline overview#
This page explains the end-to-end pipeline that uproot-custom uses to turn
ROOT binary data into awkward arrays. Understanding this pipeline will help
you design your own factories and readers.
At a high level the pipeline has five stages:
Build factories — uproot-custom reads streamer information and recursively creates a tree of
Factoryinstances that mirror the class hierarchy.Build readers — each factory creates a corresponding
Reader(Python or C++).Read binary data — the composed reader graph walks the byte buffer.
Return raw data — leaf readers return
numpyarrays; parent readers assemble them into nested tuples / lists.Build awkward arrays — factories convert the raw arrays into
awkwardcontents and form the finalawkwardarray.
Stage 1: Build factory instances#
When a branch is read, AsCustom calls uproot_custom.factories.build_factory.
This function loops over all registered factory classes (highest priority first)
and invokes each one’s build_factory class method. The first class that
returns a non-None instance wins.
uproot_custom.factories.build_factory#def build_factory(
cur_streamer_info: dict,
all_streamer_info: dict,
item_path: str = "",
**kwargs,
) -> "Factory":
fName = cur_streamer_info["fName"]
top_type_name = (
get_top_type_name(cur_streamer_info["fTypeName"])
if "fTypeName" in cur_streamer_info
else None
)
if not kwargs.get("called_from_top", False):
item_path = f"{item_path}.{fName}"
for factory_class in sorted(registered_factories, key=lambda x: x.priority(), reverse=True):
factory_instance = factory_class.build_factory(
top_type_name,
cur_streamer_info,
all_streamer_info,
item_path,
**kwargs,
)
if factory_instance is not None:
return factory_instance
raise ValueError(f"Unknown type: {cur_streamer_info['fTypeName']} for {item_path}")
Recursive factory construction#
For composite classes such as TSimpleObject, AnyClassFactory matches and
then recursively calls build_factory for every data member listed in the
streamer information:
AnyClassFactory.build_factory#class AnyClassFactory(GroupFactory):
...
@classmethod
def build_factory(
cls,
top_type_name,
cur_streamer_info,
all_streamer_info,
item_path,
**kwargs,
):
sub_streamers: list = all_streamer_info[top_type_name]
sub_factories = [build_factory(s, all_streamer_info, item_path) for s in sub_streamers]
return cls(name=top_type_name, sub_factories=sub_factories)
The resulting factory tree for TSimpleObject
(using the streamer information as input) looks
like this:
{
"factory": AnyClassFactory,
"name": "TSimpleObject",
"sub_factories": [
{"factory": TObjectFactory, "name": "TObject"},
{"factory": PrimitiveFactory, "name": "m_int", "dtype": "int32"},
{"factory": STLStringFactory, "name": "m_str"},
{"factory": CStyleArrayFactory,"name": "m_arr_int", "flat_size": 5},
{"factory": STLSeqFactory, "name": "m_vec_double", "element": "PrimitiveFactory(float64)"},
{"factory": STLMapFactory, "name": "m_map_int_double", "key/val": "Primitive/Primitive"},
{"factory": STLMapFactory, "name": "m_map_str_str", "key/val": "STLString/TString"},
{"factory": TStringFactory, "name": "m_tstr"},
{"factory": TArrayFactory, "name": "m_tarr_int", "dtype": "int32"},
],
}
Stage 2: Build readers#
Once the factory tree is ready, AsCustom calls Factory.build_python_reader
(or Factory.build_cpp_reader when using the default C++ backend) on the root
factory. Each factory delegates to its sub-factories to build sub-readers, then
combines them into a parent reader.
Important
The default reader backend is C++ (uproot_custom.factories.reader_backend = "cpp").
During development, you must explicitly switch to the Python backend:
import uproot_custom.factories as fac
fac.reader_backend = "python"
AnyClassFactory.build_python_reader#class AnyClassFactory(GroupFactory):
...
def build_python_reader(self):
sub_readers = [s.build_python_reader() for s in self.sub_factories]
return uproot_custom.readers.python.AnyClassReader(self.name, sub_readers)
Stage 3 & 4: Read binary data and return results#
The top-level reader drives sub-readers recursively. For instance,
AnyClassReader reads its fNBytes + fVersion header, then asks each
sub-reader to consume its portion of the buffer:
AnyClassReader.read method (Python)#def read(self, buffer):
fNBytes = buffer.read_fNBytes()
start_pos = buffer.cursor
end_pos = start_pos + fNBytes
buffer.skip_fVersion()
for reader in self.element_readers:
reader.read(buffer)
assert buffer.cursor == end_pos, (
f"AnyClassReader({self.name}): Invalid read length!"
)
After all entries are read, the top-level reader’s data() method collects
results recursively — leaf readers return numpy arrays, and composite
readers assemble them into nested tuples or lists:
PrimitiveReader.data — leaf#def data(self):
return np.frombuffer(self._data.tobytes(), dtype=self.dtype)
AnyClassReader.data — composite#def data(self):
return [reader.data() for reader in self.element_readers]
Stage 5: Build awkward arrays#
With the nested arrays available in Python, AsCustom calls
Factory.make_awkward_content to reconstruct awkward contents. Each factory
extracts its slice of the raw data, delegates to sub-factories, then combines
the results:
GroupFactory.make_awkward_content#class GroupFactory(Factory):
...
def make_awkward_content(self, raw_data):
sub_configs = self.sub_factories
sub_fields = []
sub_contents = []
for s_fac, s_data in zip(sub_configs, raw_data):
if isinstance(s_fac, TObjectFactory) and not s_fac.keep_data:
continue
sub_fields.append(s_fac.name)
sub_contents.append(s_fac.make_awkward_content(s_data))
return awkward.contents.RecordArray(sub_contents, sub_fields)
Generating awkward forms#
awkward forms describe the data structure without holding data, enabling
lazy evaluation with dask. Form generation mirrors make_awkward_content but
needs no input data:
GroupFactory.make_awkward_form#class GroupFactory(Factory):
...
def make_awkward_form(self):
sub_configs = self.sub_factories
sub_fields = []
sub_contents = []
for s_fac in sub_configs:
if isinstance(s_fac, TObjectFactory) and not s_fac.keep_data:
continue
sub_fields.append(s_fac.name)
sub_contents.append(s_fac.make_awkward_form())
return ak.forms.RecordForm(sub_contents, sub_fields)
Complete example: putting it all together#
The five stages above may feel abstract. Below is a concrete, end-to-end
example that walks through every stage. We use a demo class
TOverrideStreamer that overrides the default ROOT Streamer method by
inserting an extra mask between its two data members.
TOverrideStreamer (C++)#class TOverrideStreamer : public TObject {
private:
int m_int{ 0 };
double m_double{ 0.0 };
ClassDef( TOverrideStreamer, 1 );
};
Its Streamer method is overridden to insert an extra mask (0x12345678)
between m_int and m_double:
Streamer method#void TOverrideStreamer::Streamer( TBuffer& b ) {
if ( b.IsReading() ) {
TObject::Streamer( b );
b >> m_int;
unsigned int mask;
b >> mask; // additionally read a mask
if ( mask != 0x12345678 ) { /* error */ }
b >> m_double;
} else {
TObject::Streamer( b );
b << m_int;
unsigned int mask = 0x12345678;
b << mask;
b << m_double;
}
}
Because of this override, the binary layout is:
Content |
Type |
Size |
|---|---|---|
|
— |
10 bytes |
|
|
4 bytes |
mask |
|
4 bytes |
|
|
8 bytes |
The built-in reader cannot handle the extra mask, so we must implement a custom reader and factory.
Stage 1 — Factory#
The factory must implement four methods. build_factory matches on the
class name and returns an instance; the remaining methods build readers and
convert raw data to awkward arrays.
OverrideStreamerFactory — complete factory implementation# 1import awkward
2import awkward.contents
3import awkward.forms
4from uproot_custom import Factory
5
6
7class OverrideStreamerFactory(Factory):
8 @classmethod
9 def build_factory(cls, top_type_name, cur_streamer_info,
10 all_streamer_info, item_path, **kwargs):
11 if cur_streamer_info["fName"] != "TOverrideStreamer":
12 return None
13 return cls(cur_streamer_info["fName"])
14
15 def build_python_reader(self):
16 # Stage 2 — create the reader
17 return OverrideStreamerReader(self.name)
18
19 def make_awkward_content(self, raw_data):
20 # Stage 5a — convert raw numpy arrays to awkward content
21 int_array, double_array = raw_data
22 return awkward.contents.RecordArray(
23 [awkward.contents.NumpyArray(int_array),
24 awkward.contents.NumpyArray(double_array)],
25 ["m_int", "m_double"],
26 )
27
28 def make_awkward_form(self):
29 # Stage 5b — describe the data layout for dask
30 return awkward.forms.RecordForm(
31 [awkward.forms.NumpyForm("int32"),
32 awkward.forms.NumpyForm("float64")],
33 ["m_int", "m_double"],
34 )
Stage 2 — Reader#
The reader implements the binary decoding logic. It reads every entry from
the byte buffer and accumulates values in Python array objects, then
returns them as numpy arrays.
OverrideStreamerReader — Python reader# 1from array import array
2import numpy as np
3from uproot_custom.readers.python import IReader
4
5
6class OverrideStreamerReader(IReader):
7 def __init__(self, name):
8 super().__init__(name)
9 self.m_ints = array("i") # int32
10 self.m_doubles = array("d") # float64
11
12 def read(self, buffer):
13 # Stage 3 — read binary data
14 buffer.skip_TObject() # skip base class
15 self.m_ints.append(buffer.read_int32()) # m_int
16
17 mask = buffer.read_uint32() # custom mask
18 if mask != 0x12345678:
19 raise RuntimeError(f"Unexpected mask: {mask:#x}")
20
21 self.m_doubles.append(buffer.read_double()) # m_double
22
23 def data(self):
24 # Stage 4 — return raw numpy arrays
25 return np.asarray(self.m_ints), np.asarray(self.m_doubles)
Register and read#
With the factory and reader defined, register them and read data with Uproot:
1import uproot
2import uproot_custom
3import uproot_custom.factories as fac
4
5# During development, use the Python backend
6fac.reader_backend = "python"
7
8# Register the target branch and the custom factory
9uproot_custom.AsCustom.target_branches |= {
10 "/my_tree:override_streamer",
11}
12uproot_custom.registered_factories.add(OverrideStreamerFactory)
13
14# Read the data — Uproot will automatically invoke our factory/reader
15arr = uproot.open("demo_data.root")["my_tree:override_streamer"].array()
16arr.m_int # <Array [0, 1, 2, ...] type='100 * int32'>
17arr.m_double # <Array [0.0, 3.14, 6.28, ...] type='100 * float64'>
Tip
In a real project, put the factory, reader, and registration code inside a
Python package (see Project setup). The registration is typically
done in the package’s __init__.py so that it happens automatically on
import.
See also
For a full walkthrough — including binary-data inspection and porting to C++ — see Example 1: Streamer method is overridden.
Next step
Now that you understand the pipeline, move on to Investigate your data to learn how to inspect streamer information and raw binary bytes for the class you need to read.