High-performance queries using the Cumulocity APIs
The Cumulocity IoT Platform is built for Device and Application Management in the Internet of Things. It connects and manages IoT devices and assets efficiently. Device information can be stored as inventory objects, alarms, events, operations, and measurements and can be retrieved from Cumulocity. The schema for such data is intentionally kept very generic, giving applications a high degree of freedom when defining the data structures they want to use. To allow for such high flexibility Cumulocity stores such data in a document-oriented database.
There is always a trade-off between costs for data insertion, data storage and efficiency in data retrieval. Cumulocity provides a set of APIs which allow for efficient data retrieval supported by appropriate indices. To keep control over storage overhead and computational costs when inserting and deleting data Cumulocity does not allow users to define their own indices. Consequently, such applications need to rely on database structures as prescribed by the Cumulocity platform. This article explains what type of queries can be executed efficiently. Query execution is considered to be efficient when an appropriate index can be used to directly access relevant data in a large data set. In contrast, when there is no such index a query needs to scan through all data items in a data set and analyze all the filter criteria on each and every data item. We do not consider corner cases when queries cannot benefit from existing indexes (e.g. on small data sets or when many of the objects meet given filter criteria).
The focus is on the very core IoT APIs (inventory, alarms, events, measurements). Management APIs like users, tenants and options are out of scope for now.
The document has relationships to several previous activities linked in the references section. In particular, performance will never be good if the data model does not fit to the use case, so modeling guidelines are linked here.
We first discuss some general considerations for the Cumulocity APIs, then check per API how the most frequent use cases are covered by the current implementation.
Please note that the document ignores basic indexes for sharding, IDs and bulk exports (creation time, last updated). This document reflects implementation as of version 10.4.6 and 10.5.0.
General query performance aspects
The database is fairly similar to traditional database systems in terms of query performance, minus the joins. You create indexes on a collection (~table) properties and, the query optimizer decides which index to use for an efficient query execution.The general trade-offs for creating indexes are:
● When you create an index on a collection, you slow down writes on the collection because the index has to be maintained.
● When you add too many properties into an index, its size grows and its query performance is reduced because it cannot be cached well anymore.
● Unless you can compute queries using the index alone (happens rarely in Cumulocity), it does not make sense to index properties that have a bad selectivity. For example, if you know that a property can only take five values and the values are equally distributed, the selectivity is ⅕ = 20%. It may not make sense to index such values because the overhead of maintaining and reading the index may not justify the benefit over just scanning through all values.
General design assumptions
The following general design assumptions have been made when designing the Cumulocity database schema:
● The inventory is designed to store hundred thousands to million entries in realistic practical cases. See also the best practices document.
● For time-series data (alarms, measurements, events), most use cases query a limited time range and a device.
● If the volume stored by a single device exceeds tens of GB or more, Edge preprocessing or hierarchical modeling should be considered (i.e., the device should be split into multiple sub-devices). Cumulocity shards data by device when scaling the database.
Dates are always specified according to ISO8601 up to milliseconds. You can query by any valid ISO8601 string. If you just query by date it will work and set time to midnight
Use case. The inventory stores master data for devices and provides a structure to group devices according to communication hierarchies and logical structures.
Queries. You can query the inventory efficiently (i.e. supported by related indexing) for
● The existence of specific features of devices through the fragmentType parameter. This way you can receive a list of inventory objects sharing specific capabilities (e.g., all objects having a location "c8y_Location"). This query allows for verifying if a certain fragment is contained in the object and is only available for root level fragments in the JSON document.
● The default columns of the device list in device management. This includes the device name (name), device type (type), serial number, (c8y_Hardware.serialNumber), device model (c8y_Hardware.model), IMEI (c8y_Mobile.imei) and the registration date (creationTime)
Note: Only filters for the default columns are optimised
● External IDs. You can link your own IDs to inventory objects making them accessible through that ID.
● Hierarchical relationships (assets, devices, additional hierarchies). Reference values stored in an array can be used for searching. You can search for all childAssets/childDevices of a specific object and you can also search for all objects having a certain object as child (=all parents of an object).
You can efficiently execute full text queries of properties using a Google-like search style as well.
More generic queries are supported through OData-style queries, but these will generally result in a full scan of the inventory if not combined with the parameters listed above.
Use case. Alarms are used to inform operations or service personnel that a condition requires their intervention or review. Generally, the amount of active and acknowledged alarms is kept low to permit service personnel to react. For that purpose, techniques like alarm deduplication and correlation are applied. Mainly active and acknowledged alarms are relevant for daily work.
While there can be many historical (resolved/closed) alarms, there are only few unresolved alarms. In particular, Cumulocity's automated alarm deduplication makes sure that there are very few active alarms for a particular device.
Queries. Alarms can be efficiently queried by source, by severity, by time period (including date and time) and by status.
The time period (time) refers to the time when the Alarm occurred and is a value set by the client creating the alarm.
Be aware that some of these selection criteria can result in very different filter selectivity. While there might be only few active alarms at a time there can be many closed ones in the database. Depending on the selected time window it might result in higher/lower selectivity than the device type. In general we assume a search by device has a high selectivity because even though not all devices create alarms, there will be a high number of devices creating alarms.
Also, there is special support for the specific use case with filter criteria on status and severity combined.
Use case. Events are typically used similar to logs, recording non-numeric events and data from devices. For example, whenever a machine operator changes a setting on a machine, an event could be recorded to cover when the change was done and what was changed. Events are typically checked for a particular time range plus a type (all position events) and/or a particular device (all position events for a device).
Queries. Events can be efficiently queried by time (including date and time), by source, and by both, source and creation time. Queries by fragmentType are currently not efficient.
Please note, the creation time is set by the system, imported data might not even have such a timestamp. Also note, filters on source and time might first analyze the time, so as of today it is important to have high selectivity on the time when you specify both as filter predicates.
Use cases. The standard use case for querying measurements is to retrieve data points for a device for a certain time span. There may be multiple data points and multiple devices involved. In addition, you would like to determine the most recent value of a set of data points for a device.
Queries. These use cases are currently well supported by the existing setup. Please make sure to use “valueFragmentType” and “valueFragmentSeries" to select data points (see the following example. For further documentation please check the measurement API reference documentation).
A typical query pattern on measurements combines searching for one or multiple specific measurement time series (combining valueFragmentType, valueFragmentSeries, dateFrom and dateTo). This query pattern is either being done for a specific device (source) or without adding this query parameter. Both options are well supported.
Use cases. Operations are used to control devices. Operators want to know which operations have been sent to a device and what the current state or outcome of the execution is.
Queries. Operations can be efficiently filtered by a given device (source), as well as by creationTime (including date and time). There is special support for queries which filter by creationTime and deviceId, or by creationTime, deviceId, and status.
This document summarizes the types of queries which can be executed in an efficient and high-performant manner based on the pre-defined indexes in the database.