小程序
传感搜
传感圈

How to Handle Out-of-Order Data in Your IoT Pipeline

2023-08-22 16:00:48
关注

Illustration: © IoT For All

Say you are a vertical manager at a logistics company. Knowing the value of proactive anomaly detection, you implement a real-time IoT system that generates streaming data, not just occasional batch reports. Now you’ll be able to get aggregated analytics data in real time.   

But can you really trust the data? 

If some of your data looks odd, it’s possible that something went wrong in your IoT data pipeline. Often, these errors are the result of out-of-order data, one of the most vexing IoT data issues in today’s streaming systems. 

Business insight can only tell an accurate story when it relies on quality data that you can trust. The meaning depends not just on a series of events, but on the order in which they occur. Get the order wrong, and the story changes—and false reports won’t help you optimize asset utilization or discover the source of anomalies. That’s what makes out-of-order data such a problem as IoT data feeds your real-time systems. 

So why does streaming IoT data tend to show up out of order? More importantly, how do you build a system that offers better IoT data quality? Keep reading to find out. 

The Causes of Out-of-Order Data in IoT Platforms

In an IoT system, data originates with devices. It travels over some form of connectivity. Finally, it arrives at a centralized destination, like a data warehouse that feeds into applications or IoT data analytics platforms.

The most common cause of out-of-order data relates to the first two links of this IoT chain. The IoT device may send data out of order because it’s operating in battery-save mode, or due to poor-quality design. The device may also lose connectivity for a period of time.

It might travel outside of a cellular network’s coverage area (think “high seas” or “military areas jamming all signals”), or it might simply crash and then reboot. Either way, it’s programmed to send data when it re-establishes a connection and gets this command. That might not be anywhere near the time that it recorded a measurement or GPS position. You end up with an event timestamped hours or more after it actually occurred.  

But connectivity lapses aren’t the only cause of out-of-order (and otherwise noisy) data. Many devices are programmed to extrapolate when they fail to capture real-world readings. When you’re looking at a database, there’s no indication of which entries reflect actual measurements and which are just the device’s best guess. This is an unfortunately common problem. To comply with service level agreements, device manufacturers may program their products to send data according to a set schedule—whether there’s an accurate sensor reading or not.      

The bad news is that you can’t prevent these data-flow interruptions, at least not in today’s IoT landscape. But there’s good news, too. There are methods of processing streaming data that limit the impact of out-of-order data. That brings us to the solution for this persistent data-handling challenge.   

Fixing Data Errors Caused by Out-of-Order Logging

You can’t build a real-time IoT system without a real-time data processing engine—and not all of these engines offer the same suite of services. As you compare data processing frameworks for your streaming IoT pipeline, look for three features that keep out-of-order data from polluting your logs:  

  1. Bitemporal modeling. This is a fancy term for the ability to track an IoT device’s event readings along two timelines at once. The system applies one timestamp at the moment of the measurement. It applies a second the instant the data gets recorded in your database. That gives you (or your analytics applications) the ability to spot lapses between a device recording a measurement and that data reaching your database.  
  1. Support for data backfilling. Your data processing engine should support later corrections to data entries in a mutable database (i.e., one that allows rewriting over data fields). To support the most accurate readings, your data processing framework should also accept multiple sources, including streams and static data. 
  1. Smart data processing logic. The most advanced data processing engine doesn’t just create a pipeline; it also layers machine learning capabilities onto streaming data. That allows the streaming system to simultaneously debug and process data as it moves from the device to your warehouse. 

With these three capabilities operating in tandem, you can build an IoT system that flags—or even corrects—out-of-order data before it can cause problems. All you have to do is choose the right tool for the job. 

What kind of tool, you ask? Look for a unified real-time data processing engine with a rich ML library covering the unique needs of the type of data you are processing. That may sound like a big ask, but the real-time IoT framework you’re looking for is available now, at this very moment—the one time that’s never out of order. 

Tweet

Share

Share

Email

  • Data Analytics
  • Big Data
  • Connectivity
  • Quality Management

  • Data Analytics
  • Big Data
  • Connectivity
  • Quality Management

参考译文
如何在物联网管道中处理乱序数据
插图:© IoT For All --> 假设你是某物流公司的一名垂直业务经理。你明白主动异常检测的重要性,于是实施了一套实时物联网系统,该系统生成的是实时数据流,而不仅仅是偶尔的批量报告。这样,你现在就可以实时获取聚合的分析数据。但你真的能相信这些数据吗?如果你发现部分数据看起来异常,这可能意味着你的物联网数据管道出了问题。通常,这些错误是由于数据失序造成的,而这是当今实时系统中最令人头痛的物联网数据问题之一。只有依靠你信任的高质量数据,业务洞察才能讲出一个准确的故事。意义不仅取决于事件本身,更取决于这些事件发生的顺序。顺序出错,故事就会改变,而错误的报告不会帮助你优化资产利用率,也不会帮你找到异常的源头。这就是为何在物联网数据实时系统中,失序的数据会造成如此严重的问题。那么,为什么流式物联网数据往往会失序地出现呢?更重要的是,你该如何构建一个能提供更高物联网数据质量的系统?继续阅读,以了解更多。物联网平台中失序数据的成因在物联网系统中,数据起源于设备,通过某种形式的连接传输,最后到达中央目的地,例如数据仓库,该仓库为应用程序或物联网数据分析平台提供数据。失序数据最常见成因与物联网数据链的前两个环节有关。物联网设备可能会以乱序方式发送数据,因为它正在以省电模式运行,或设计质量不佳。设备也可能会在一段时间内失去连接。它可能离开蜂窝网络的覆盖范围(例如“公海”或“军区信号被干扰的区域”),也可能只是出现故障并重启。无论如何,它都会在重新建立连接并接收到该命令时发送数据。这可能与它记录测量值或GPS位置的时间相差甚远。你最终会得到一个时间戳比它实际发生的事件晚几个小时甚至更久的事件。但连接中断并不是导致失序(以及其他嘈杂)数据的唯一原因。许多设备在未能捕获真实世界读数时,会被编程进行外推。当你查看数据库时,无法分辨哪些条目是真实测量值,哪些只是设备的推测。这是一个很常见的问题。为了遵守服务等级协议,设备制造商可能会编程其产品按照固定计划发送数据——无论是否有准确的传感器读数。坏消息是,你无法在当今物联网环境中防止这些数据流的中断。但也有好消息。有一种处理实时数据的方法,可以减少失序数据造成的影响。这将我们引向解决这个持续数据处理挑战的解决方案。修复由失序记录引起的数据错误你无法构建一个实时物联网系统,而没有一个实时数据处理引擎,而且并不是所有引擎都提供相同的服务功能。在比较用于流式物联网管道的数据处理框架时,请寻找三个功能,这些功能可以防止失序数据污染你的日志:双时态建模。这是一个术语,用于描述同时跟踪物联网设备事件读数的两个时间线的能力。系统在测量发生时应用一个时间戳,在数据记录到数据库的瞬间应用另一个时间戳。这使你(或你的分析应用)能够发现设备记录测量值与数据到达数据库之间的间隙。支持数据回填。你的数据处理引擎应支持在可变数据库(即允许重写数据字段)中对数据条目进行后期修改。为了支持最准确的读数,你的数据处理框架还应接受多个来源,包括数据流和静态数据。智能数据处理逻辑。最先进的数据处理引擎不仅仅创建一个数据管道,还会在数据流上叠加机器学习能力。这使得流式系统能够在数据从设备传输到数据仓库的过程中同时进行调试和处理。借助这三项功能协同工作,你可以构建一个物联网系统,在失序数据引发问题之前,标记——甚至纠正——这些数据。你所需要做的就是选择合适的工具。你问,什么样的工具?寻找一个统一的实时数据处理引擎,它应具备丰富涵盖你所处理数据类型独特需求的机器学习库。这听起来可能要求很高,但你现在就能获得你需要的实时物联网框架——这个时刻从不会失序。推文分享邮件数据分析大数据连接质量管理 --> 数据分析大数据连接质量管理
您觉得本篇内容如何
评分

评论

您需要登录才可以回复|注册

提交评论

广告
提取码
复制提取码
点击跳转至百度网盘