Reader and factory interface#
uproot-custom uses a reader/factory mechanism to balance performance and flexibility. readers are implemented in C++, do the actual reading from binary stream. factorys are implemented in Python, manage readers, and reconstruct the final awkward array.
Reader interface#
Base class: IElementReader#
reader in C++ should derive from the IElementReader interface, which has two pure virtual methods:
read: called by parentreaderto read data from the binary stream.data: called after reading is done, to return the read-out data innumpyarrays or any Python nested containers filled withnumpyarrays.
There are extra 3 virtual methods that can be overridden to handle special cases:
read_many: called when reading multiple elements in one go, such as reading c-style arrays. Some classes may have only one header with multiple elements following it. In this case,read_manycan be overridden to read handle such case.read_until: called when reading elements until a certain position in the binary stream.read_many_memberwise: called when reading multiple elements in a member-wise fashion. “Member-wise” means that the first member of all elements are read first, then the second member of all elements are read, and so on. This is used when reading STL containers in some specific cases.
IElementReader# 1class IElementReader {
2 protected:
3 const std::string m_name;
4
5 public:
6 IElementReader( std::string name ) : m_name( name ) {}
7 virtual ~IElementReader() = default;
8
9 virtual const std::string name() const { return m_name; }
10
11 virtual void read( BinaryBuffer& buffer ) = 0;
12 virtual py::object data() const = 0;
13
14 virtual uint32_t read_many( BinaryBuffer& buffer, const int64_t count ) {
15 for ( int32_t i = 0; i < count; i++ ) { read( buffer ); }
16 return count;
17 }
18
19 virtual uint32_t read_until( BinaryBuffer& buffer, const uint8_t* end_pos ) {
20 uint32_t cur_count = 0;
21 while ( buffer.get_cursor() < end_pos )
22 {
23 read( buffer );
24 cur_count++;
25 }
26 return cur_count;
27 }
28
29 virtual uint32_t read_many_memberwise( BinaryBuffer& buffer, const int64_t count ) {
30 if ( count < 0 )
31 {
32 std::stringstream msg;
33 msg << name() << "::read_many_memberwise with negative count: " << count;
34 throw std::runtime_error( msg.str() );
35 }
36 return read_many( buffer, count );
37 }
38};
Note
A
std::string nameis required for eachreader, which is used to print debug information.Several extended virtual methods are also provided. These methods will be called when reading c-style arrays or STL containers.
BinaryBuffer class#
BinaryBuffer is a helper class to read binary data from a uint8_t* buffer. It provides methods to read basic types, skip bytes, and handle fNBytes+fVersion header.
Reading methods include:
const T read<T>(): Read a value of typeTfrom the buffer, and advance the cursor.const int16_t read_fVersion(): Equivalent toread<int16_t>().const uint32_t read_fNBytes(): ReadfNBytesfrom the buffer, check the mask, and return the actual number of bytes.const std::string read_null_terminated_string(): Read a null-terminated string from the buffer.const std::string read_obj_header(): Read the object header from the buffer, return the object’s name if present. Can be used whenObjectHeaderReaderis not used.const std::string read_TString(): Read aTStringfrom the buffer
Skipping methods include:
void skip(const size_t nbytes): Skipnbytesbytes.void skip_fVersion(): Skip thefVersion(2 bytes).void skip_fNBytes(): Equivalent toread_fNBytes(), will check the mask.void skip_null_terminated_string(): Skip a null-terminated string.void skip_obj_header(): Skip the object header.void skip_TObject(): Skip aTObject.
Other methods include:
const uint8_t* get_data() const: Get the start of the data buffer.const uint8_t* get_cursor() const: Get the current cursor position.const uint32_t* get_offsets() const: Get the entry offsets of the data buffer.const uint64_t* entries() const: Get the number of entries of the data buffer.void debug_print( const size_t n = 100 ) const: Print the nextnbytes from the current cursor for debugging.
Accept sub-readers#
When reading nested classes, a reader may require sub-readers to read the nested data members. For example, STLSeqReader needs a sub-reader to read the element type of the STL sequence, such as int, float, or a custom class. In this case, the sub-reader can be passed to the constructor of the parent reader.
In the constructor, the sub-reader should be passed as a std::shared_ptr<IElementReader>. This is because other readers are constructed in Python, so the ownership of the reader should be shared between C++ and Python.
uproot-custom.hh already define a type alias using SharedReader = shared_ptr<IElementReader>; for convenience.
Transform std::vector to numpy array without copying#
When returning data in data method, reader can use make_array helper function to transform std::shared_ptr<std::vector<T>> to numpy array without copying:
std::shared_ptr<std::vector<int>> data = std::make_shared<std::vector<int>>();
data->push_back(1);
data->push_back(2);
data->push_back(3);
py::array_t<int> np_array = make_array(data);
Declaring reader to Python#
uproot-custom uses pybind11 to declare C++ readers to Python. A helper function declare_reader is provided to simplify the declaration. When implementing your own reader, you should declare it to Python like this:
PYBIND11_MODULE( my_cpp_reader, m) {
declare_reader<MyReaderClass, constructor_arg1_type, constructor_arg2_type, ...>(m, "MyReaderClass");
}
The
constructor_argN_typeare the types of the constructor arguments of yourreader. If yourreaderhas no constructor arguments, you can omit them.The second argument of
declare_readeris the name of yourreaderin Python. In this example, it is"MyReaderClass".
Then you can import MyReaderClass in Python:
from my_cpp_reader import MyReaderClass
Debugging message#
uproot-custom provides a debug_print method to print debugging message. The print will only be performed when UPROOT_DEBUG macro is defined, or UPROOT_DEBUG environment variable is set:
// Will print "The reader name is Bob"
debug_print("The reader name is %s", "Bob");
// Call buffer.debug_print(50), print next 50 bytes from current cursor
debug_print( buffer, 50 )
Factory interface#
factory in Python should derive from the BaseFactory interface, which has four methods to be implemented:
build_factory: Match the data member and instantiate thefactoryif matched, otherwise returnNone.build_cpp_reader: Called to create the C++reader.make_awkward_content: Called to reconstruct the finalawkwardcontent with the raw data read by the C++reader.make_awkward_form: Called to generate theawkwardform.
To select the appropriate factory for a data member, uproot-custom loops over all registered factory classes, and calls their build_factory method. The first non-None return value will be used.
Constructor#
The constructor of factory should receive all necessary parameters for the factory to build C++ reader, make awkward content and form. The constructor should at least receive a name parameter, which is usually the fName in the streamer info.
Class method build_factory#
This method is called when instatiating factories. It should be a class method.
It receives following parameters:
top_type_name: str: The top-level type name of current data member.Any
std::prefixes will be stripped. For example, forstd::vector<std::map<int, float>>, thetop_type_nameisvector.cur_streamer_info: dict: The streamer infodictof the current data member.factorycan use this information to decide whether it can handle this node, and to generate the configurationdict.An example of
cur_streamer_infois:{'@fUniqueID': 0, '@fBits': 16777216, 'fName': 'm_int', 'fTitle': '', 'fType': 3, 'fSize': 4, 'fArrayLength': 0, 'fArrayDim': 0, 'fMaxIndex': array([0, 0, 0, 0, 0], dtype='>i4'), 'fTypeName': 'int'}
For other type of data members, such as STL containers or nested classes, some other attributes may be present.
all_streamer_info: dict: Adictmapping all available streamer names to their members’ streamer infodict.factorycan use this information to look up the streamer info of any nested classes.For example, you can retrieve the streamer information of
TSimpleObjectlike:>>> all_streamer_info["TSimpleObject"] [{'@fUniqueID': 0, '@fBits': 16777216, 'fName': 'TObject', 'fTitle': 'Basic ROOT object', 'fType': 66, 'fSize': 0, 'fArrayLength': 0, 'fArrayDim': 0, 'fMaxIndex': array([ 0, -1877229523, 0, 0, 0], dtype='>i4'), 'fTypeName': 'BASE', 'fBaseVersion': 1}, {'@fUniqueID': 0, '@fBits': 16777216, 'fName': 'm_int', 'fTitle': '', 'fType': 3, 'fSize': 4, 'fArrayLength': 0, 'fArrayDim': 0, 'fMaxIndex': array([0, 0, 0, 0, 0], dtype='>i4'), 'fTypeName': 'int'}, ... ]
And use it to build sub-factories for nested data members:
sub_factories = [] for member in all_streamer_info["TSimpleObject"]: sub_fac = build_factory(member) sub_factories.append(sub_fac)
item_path: str: The absolute path from the root node to the current data member.It is useful when some special handling is needed for certain nodes.
**kwargs: Any extra keyword arguments that might be needed.
When current data member is not suitable for the factory, it should return None, so that uproot-custom will try next factory, until one return an instance of itself.
When current data member is suitable for the factory, it should return an instance of itself, with all necessary parameters passed to the constructor.
Method build_cpp_reader#
This method is called to instatiate the C++ reader. For non-bottom-level factories, it should also instatiate sub-readers for nested data members and combine them together to the parent reader.
Method make_awkward_content#
This method is called to construct awkward content with given raw data read by the C++ reader.
It receives following parameters:
raw_data: Any: The raw data read by the C++reader, returned by itsdatamethod.
The factory should return awkward.contents.Content object.
See also
Refer to awkward direct constructors for more details about awkward contents.
Method make_awkward_form#
This method is called when building the awkward forms.
The factory should return an awkward.forms.Form object.
See also
Refer to awkward forms for more details about awkward forms.
Example of reader and factory#
Take the TArrayReader and TArrayFactory as an example. This pair of reader and factory handles TArray nodes in the data tree.
The TArrayReader:
Reads
fSize, then readsfSizenumber ofTelements from the binary stream in itsreadmethod.The read-out data is stored in two vectors:
m_offsetsandm_dataThese two vectors are converted to
numpyarrays and returned in itsdatamethod, which is called after reading is done.
TArrayReader#template <typename T>
class TArrayReader : public IElementReader {
private:
SharedVector<int64_t> m_offsets;
SharedVector<T> m_data;
public:
TArrayReader( std::string name )
: IElementReader( name )
, m_offsets( std::make_shared<std::vector<int64_t>>( 1, 0 ) )
, m_data( std::make_shared<std::vector<T>>() ) {}
void read( BinaryBuffer& buffer ) override {
auto fSize = buffer.read<uint32_t>();
m_offsets->push_back( m_offsets->back() + fSize );
for ( auto i = 0; i < fSize; i++ ) { m_data->push_back( buffer.read<T>() ); }
}
py::object data() const override {
auto offsets_array = make_array( m_offsets );
auto data_array = make_array( m_data );
return py::make_tuple( offsets_array, data_array );
}
};
In TArrayFactory,
build_factorymethod matches the data member and instantiates theTArrayFactorybuild_cpp_readermethod creates the C++TArrayReader.make_awkward_contentmethod constructs theawkwardcontent with the raw data returned byTArrayReader.make_awkward_formmethod generates the correspondingawkwardform.
TArrayFactory#class TArrayFactory(Factory):
"""
This class reads TArray from a binary paerser.
TArray includes TArrayC, TArrayS, TArrayI, TArrayL, TArrayL64, TArrayF, and TArrayD.
Corresponding ctype is u1, u2, i4, i8, i8, f, and d.
"""
typenames = {
"TArrayC": "i1",
"TArrayS": "i2",
"TArrayI": "i4",
"TArrayL": "i8",
"TArrayL64": "i8",
"TArrayF": "f",
"TArrayD": "d",
}
@classmethod
def build_factory(
cls,
top_type_name,
cur_streamer_info,
all_streamer_info,
item_path,
**kwargs,
):
"""
Return when `top_type_name` is in `cls.typenames`.
"""
if top_type_name not in cls.typenames:
return None
ctype = cls.typenames[top_type_name]
return cls(name=cur_streamer_info["fName"], ctype=ctype)
def __init__(self, name: str, ctype: str):
super().__init__(name)
self.ctype = ctype
def build_cpp_reader(self):
return {
"i1": uproot_custom.cpp.TArrayCReader,
"i2": uproot_custom.cpp.TArraySReader,
"i4": uproot_custom.cpp.TArrayIReader,
"i8": uproot_custom.cpp.TArrayLReader,
"f": uproot_custom.cpp.TArrayFReader,
"d": uproot_custom.cpp.TArrayDReader,
}[self.ctype](self.name)
def make_awkward_content(self, raw_data):
offsets, data = raw_data
return awkward.contents.ListOffsetArray(
awkward.index.Index64(offsets),
awkward.contents.NumpyArray(data),
)
def make_awkward_form(self):
return ak.forms.ListOffsetForm(
"i64",
ak.forms.NumpyForm(PrimitiveFactory.ctype_primitive_map[self.ctype]),
)