Integrated Internet of Things Platform Helps Close the Gap Between Data Science and Operations

Machine learning and digital transformation have been embraced by many upstream companies in the past few years. Factors such as lower-cost industrial Internet of things (IIoT) devices and cloud computing resources enable exploration and production (E&P) companies to access and store large volumes of data that were traditionally unavailable. These factors also enable the application of machine-learning techniques that can augment traditional physics- and engineering-based models of complex systems, turning data into actionable insights. Searching the OnePetro library reveals that the number of publications on data-driven approaches in oil and gas has been increasing exponentially in the recent decade. Many major oil and gas conferences designated sessions for data analytics in recent years.

The majority of data-science projects started in upstream companies in recent years, however, did not go beyond research papers, and few appear to have been put in constant operation in actual business processes. A major challenge to putting data science into operation involves transforming data-science models created by data scientists into on-site products that could perform robustly in real-world scenarios and be used by operators to make decisions.

Unlike traditional techniques used in E&P work flows, data science is still in the exploratory stage of being applied in the oil and gas sector. The options of merging data-driven elements into existing work flows or creating new work flows around data-driven algorithms both face uncertainty in objective reliability and subjective trustworthiness. Adopting cutting-edge models from technology companies such as Google and Microsoft, data scientists in oil companies typically also must address the challenge of model transferability. Moreover, they face the risk associated with low interpretability of some models because of complexity, which is less acceptable in industrial environments. Apart from models, data scientists’ lack of operational experiences also may yield poor user experience of their products in real E&P operations.

For product development and deployment in a highly uncertain environment, Silicon Valley has built a successful work flow. The “agile” methodology emphasizes closed-loop cycles of design/develop/deploy/feedback such that a nonrobust product is exposed to real users for feedback and criticism and then improved accordingly with fast iterations. Some data-science teams in oil companies attempt to follow agile methodology. However, the core of the agile methodology is to productize in an agile fashion, which can hardly be satisfied in the current setting of most oil companies. A typical data scientist does not have the skills to transform a machine-learning model into a software-as-a-service (SAAS) product, including data engineering, web-service development, user interface development, development operations, support, and other facets common to web-based software tools. This impedes the desired agile productizing because data scientists have to wait for support from data engineers, information technology (IT) administrators, front-end developers, and subject-matter experts whose efficient collaboration is not an established norm in oil companies.  

Integrated IIoT Platform

Arundo Analytics has built an integrated IIoT platform that allows data scientists to productize data-science solutions with minimal IT support and accelerate feedback/improvement iterations between end-users and data scientists effectively.

Fig. 1—Structure of the integrated IIoT platform.

Besides having a machine-learning model as its core, a data-science SAAS product must also have a data interface through which streaming or batched data is connected to the model as input and a user interface through which an end-user may visualize or export the output from the model or otherwise integrate the output into an existing business process. The SAAS must be deployed in a cloud environment where computational resources are available. A model-management mechanism is also necessary to guarantee scalability and reliability of services, especially when multiple models are deployed or multiple data scientists or end-users work on the same set of models. To satisfy these requirements, the platform includes four core components (Fig. 1): edge agent, model hub, model deployment software, and end-user application. The cloud-based model hub is the core of the computing system where all models are deployed and managed and where live data streams through and is connected to deployed models. Model deployment software helps data scientists who create data-driven models to publish their work securely and efficiently. The edge component is connected to IIoT devices in fields and securely streams live, operational data into the model hub. End-user applications deliver the solutions created by models to users in real time.

The model hub is the computational environment of all machine-learning models. For scalability and security purposes, operating-system-level virtualization, also known as containerization, is implemented. Containerization allows multiple isolated user-space instances (i.e., containers) to exist on the same operating system simultaneously, while each container looks like real computers from the point of view of programs running in it. Programs running inside a container cannot see the contents assigned to other containers. In this platform, every deployed model is assigned to an independent container, which secures the information and intellectual property safety. Another benefit of containerization is that wasted resources are reduced substantially because all containers share the same server and host operating system and each container only holds the model and related binaries. To spin up a container from a server usually takes significantly less time than provisioning a virtual machine, so the platform is more scalable than running every model on separate virtual machines. When a model is published to the hub for deployment, the model hub creates a container and hosts the model there as a representational state transfer application programming interface. A container orchestration tool is a key element of the model hub, and it is responsible for automating the deployment, scaling, and networking of containers. This is especially critical if the usage of the platform scales up rapidly and challenges such as load balancing, auto-start of containers, and health checks become stability bottlenecks of the platform.

On the field side, edge devices (i.e., devices that provide entry points into enterprise or service provider core networks) are equipped with edge agent software that gathers sensor data and sends it to the cloud. Edge agents support most standard industrial protocols. One single edge agent can support thousands of tags, and multiple edge agents can be installed at a single device. For assets with a large number of live data tags or with high data-gathering frequency, some preprocessing of the data could be performed at the edge instead of transferring all raw data to the cloud, which reduces latency, bandwidth pressure, and data-transfer cost. The edge agent is also the first safeguard line of data security. It is built with best-of-breed end-to-end encryption, as well as firewall support.

For data scientists who create machine-learning models, the platform offers a desktop tool to deploy their models to the model hub. Using the tool, data scientists may publish their model to the cloud. With the platform independent to local environments of model building, data scientists still follow their own model-building process with their own selected tools (e.g., Python or Jupyter notebook). The goal is to reduce barriers between data scientists and model deployment with minimal change to their ordinary work flow. A Python library and command-line interface are provided by the platform, with which data scientists may wrap their models and deploy to the cloud. This desktop tool communicates with the model hub to trigger the containerized deployment process on cloud and transfers models from the local environment to the cloud.

Using the same desktop tool, data scientists also may connect their models with real-time data streams. Edge agent software sends live data to the cloud in a readable format; therefore, data scientists do not need to have knowledge of data-transfer protocol to use those live streams. To build model-data pipelines, data scientists only need to map prepared data tags to the input arguments of their deployed models. The deployed model generates output data stream by running with input data streamed through pipelines, and those outputs are available to downstream applications immediately. This makes the connection between in-house modeling teams and operational teams seamless.

Output streams from machine-learning models also can be used as input streams to other models, which enables data scientists to chain machine-learning models. This enables the reusability of a model over multiple projects, which is important for collaborations between data scientists in the same organization. All output streams are stored in warm storage from which end-user may query recent results. Users also have the option to send results to external data storage (e.g., SQL database).

At the downstream of the system, various types of applications can be created or connected to the output streams generated by models. For third-party visualization and business-intelligence tools, end-users may configure the warm storage such that it can be connected with the tool directly. For customized application user interface (UI), data scientists may publish their model along with a web template, and the model hub will render it automatically during the deployment process and host the web-based UI at the same container with the computational model. This makes the UI accessible to permitted end-users immediately. For data scientists with limited expertise in web hosting, this helps accelerate the development of UI for machine-learning models.

Case Study

The platform was tested in real-world scenarios partnering with a few upstream oil and gas companies. Most of those partners are in the early stage of their artificial-intelligence journey where they were trying to figure out potential values in various use cases. Because of the exploratory nature of those use cases, most solutions were not robust for fully automated operation at first, and an agile-style work flow was desired. For example, an offshore operator tried to create a data-driven predictive-maintenance solution for their critical equipment on platforms. The initial models did not cover all potential failures and had a high false alarm rate because of limited historical failures. Data scientists had to modify and retrain their models multiple times during a period of several months before the final model could perform robustly. The iterative improvement was based on testing models against new data repeatedly. Our platform offered data scientists easy access to new live data and a handy system to deploy new versions of the model continuously. Moreover, a dashboard created and deployed on the platform helped operators examine results visually and provide feedbacks faster.


This integrated IIoT platform helps minimize the time required to productize an in-house data-science solution in real E&P operations. The proposed IIoT architecture is the first to focus on agile development and deployment of data science in oil and gas E&P context. This work addresses the challenge of operationalizing the digital oil field, in turn catalyzing smart operations through industrial analytics.

The complete paper can be found on One Petro here.

Tailai Wen is a staff data scientist at Arundo Analytics based in Houston, where he works on physics-based and data-driven technologies and innovations in the industrial sectors, including oil and gas, renewable energy, and transportation. Before joining Arundo, Wen worked at research and development branches of the QRI Group, Chevron, and IBM. He holds a PhD degree in computational mathematics, an MS degree in statistics from Stanford University, and a BS degree in mathematics from Tsinghua University.


Don't miss out on the latest technology delivered to your email monthly.  Sign up for the Data Science and Digital Engineering newsletter.  If you are not logged in, you will receive a confirmation email that you will need to click on to confirm you want to receive the newsletter.