Data & Analytics

Simulation Algorithm Benefits by Connecting Geostatistics With Unsupervised Learning

A new geostatistics modeling methodology that connects geostatistics and machine-learning methodologies, uses nonlinear topological mapping to reduce the original high-dimensional data space, and uses unsupervised-learning algorithms to bypass problems with supervised-learning algorithms.

190087-2.jpg

This paper presents a new geostatistics modeling methodology that connects geostatistics and machine-learning methodologies, uses nonlinear topological mapping to reduce the original high-dimensional data space, and uses unsupervised-learning algorithms to bypass problems with supervised-learning algorithms. The algorithm presented is a neural topology-preserving pattern-based geostatistical simulation algorithm that integrates the self-organizing map (SOM) concept and its updated version—growing self-organizing map (GSOM)—with an unsupervised competitive learning structure.

Introduction
In oil and gas reservoir modeling, any model construction faces challenges of limited data to some extent. The heuristic behind all geostatistical techniques is the implicit existence of statistical relationships among available data. “Data” here is a broad term; it could be discrete points, such as porosity or permeability at certain locations, but it also could be training images (TIs), which are used in this work. Using TIs as input data originated with multiple-point geostatistics. The aim was to overcome the limitations of using traditional two-point statistical variograms to describe geological continuity, especially in the case of curvilinear structures, which are quite common in nature, such as in fracture networks and geological fluvial structures.

The authors write that geostatistics could benefit from the machine-learning or statistical-learning areas. Machine-learning tasks can be divided into two protocols, supervised learning and unsupervised learning. The difference depends on whether the input data have correct labels or not. For the investigation considered in this paper, after retrieving image patches from TIs, machine-learning algorithms were used to cluster those image patches into different classifications. If the correct clusters to which the image patches belong is known beforehand, the data can be said to have correct labels. Further, a large amount of these labeled data could be used for a model to learn. Supervised learning occurs when the model is finely tuned by guidance of these correct labels using an error-correction process. Unsupervised learning, on the other hand, occurs when neither how many clusters exist nor the correct clusters to which the image patches belong is known. Thus, without correct labels for guidance, this problem is identified as an unsupervised-learning problem. This is the first major drawback of applying machine-learning algorithms to geostatistical simulations.

Two other important issues are encountered when performing pattern-based geostatistical simulations. The first is related to the large number of image patches. This large image-patch database contains high pattern redundancy, which could make pattern similarity comparison inefficient. The second issue is the high dimensionality of each image patch, which could typically be described by low-dimensional ­nonlinear-structure manifolds. To model and visualize these high-dimensional data, a nonlinearity dimensionality reduction should be sought.

This led the authors to propose a new pattern-based geostatistical simulation algorithm that integrates an unsupervised artificial-neural-network (ANN) scheme while simultaneously working directly on nonlinear lower-dimensional input data for pattern clustering, comparison, and visualization. The proposed algorithm was designed to manage the difficulties mentioned previously while bridging machine learning with geostatistical simulation.

Methodology
An unsupervised-learning algorithm was proposed to bypass potential difficulties. The simulation methodology is built around the context of pattern-based multiple-point geostatistics; therefore, TIs are scanned by use of a template to generate an image-patch database. The simulation step relies on image-patch similarity. Instead of comparisons using the entire image-patch database, the unsupervised-learning algorithm will perform clustering on the database first to generate a representative data set, the number of which is the number of different clusters that exist in the image-patch database and the element of which is the center of each cluster. Next, an image-patch similarity comparison is conducted with this representative data set and an image patch from that determined cluster is selected and pasted on the realization grid.

This representative set also serves as a nonlinear lower-dimensional manifold. By projecting the original high-dimensional image-patch database onto this manifold, visualization of the internal data structure relationship can be performed easily. The SOM algorithm was selected as the dimensionality-reduction technique for this work because of its advantages with regard to topology preservation.

SOM
The classic SOM algorithm is inspired by neuroanatomy studies that suggest sensory input is projected in an orderly manner onto the cortex, functional connected neurons grouped together in a human cortex have a minimum metabolic cost, and neighboring connected element arrangements in the sensory-input domain activate the neighboring connected neurons in the cortex. This feature is known as topographical organization of the sensory cortex and includes a feature called a neighborhood-preserving map, which means that closely related information is kept together after mapping. This feature accomplishes the most fundamental task of the SOM algorithm: learning a topographic transformation that maps the high-dimensional input (source) space to low-dimensional output (target) space with the original topographic order maintained.

jpt-2018-10-190087f1.jpg
Fig. 1—SOM mapping from input space to neural grids. (a) The high-dimensional input-data space. (b) The neural network in a one-layer 2D lattice grid; each neuron represents a feature drawn from input-data space. The red square represents a neighborhood region. (c) Activated BMU neighborhood. The red square shows the activated neighborhood region surrounding the BMU, shown as a red circle. The blue circles represent included neighborhood neurons other than the BMU at that iteration. Black circles represent deactivated neurons.

SOM is an ANN learning procedure, but, unlike most ANNs, its learning protocol is unsupervised. It only has one layer of neurons (Fig. 1) and only has the input-data feed-forward step. For each input data vector, each neuron competes, according to its distance, with an input data vector, with the closest one winning. The winning vector is called the best-matching unit (BMU). The BMU will activate a neighborhood region including the BMU and its surrounding neurons according to a predefined neighborhood radius function and threshold while inhibiting the neurons outside the active region. Then, a weight vector for each neuron in this activated-BMU region is updated adaptively to approach the input-data vector. One input vector from the input space is presented to the network at a time. Therefore, from this perspective, SOM also uses an online learning protocol in which all input-data vectors are formed into a matrix and flow into the network together.

Drawback of Classical SOM
In the SOM algorithm, the number of neurons/prototypes should be predefined. This potentially injects learner’s bias toward the source space. Generally, for example, one does not know completely the prior structure of the TI used for the pattern-based geostatistical simulation; therefore, this number of prototypes should be set up dynamically during the learning process. This constraint is relaxed by use of GSOM.

GSOM
GSOM was developed to overcome the limitation of SOM with a self-adaptive way to grow the number of neurons. Most of its rules are the same as in SOM.

In GSOM, the input-data vectors are presented to the training process one at a time, with each one iterated until a maximum number of iterations have been reached. BMU, neighborhood region, and neuron weight-vector updating are calculated in the same way as in SOM. As GSOM dynamically grows the neuron grid, the algorithm should make judgments by itself.

To ease the process, two factors are used, accumulated error and growing threshold (GT). After one neuron is selected as the BMU, GSOM calculates an error value, which is the Euclidean distance between the BMU’s weight vector and the input-data vector. This error value is tracked and accumulated for each BMU. On the basis of the accumulated error and GT, three criteria of neuron growth are considered.

  • If the accumulated error of a BMU passes the GT and the subject BMU is a boundary neuron, the neuron grows at all free positions of BMU. A weight-distribution process is conducted to update weight vectors for each new neuron.
  • If the accumulated error of a BMU bypasses the GT and the subject BMU is not a boundary neuron, an error-distribution process is conducted.
  • If the accumulated error of a BMU is below the GT, then neurons’ weight vectors are updated in the same way as in SOM.

Geostatistical SimulationSimulation Algorithm. A neural topological-map-based geostatistical simulation algorithm is proposed that incorporates SOM/GSOM. This algorithm is a pattern-based geostatistical simulation algorithm, the same as other predeveloped pattern-based algorithms. It relies on a TI as input. Image patches are extracted from the TI and stored in a pattern database. As usual, a random path visiting each node in the realization grid is used.

Realizations are performed by pasting the image patches onto a realization grid along this random path according to a similarity pattern search between realized patterns on the grid and the pattern database. SOM/GSOM has a direct effect on this similarity pattern process.

With SOM. Comparing the results of the algorithm that included SOM with previously published simulation results using an algorithm of the same template size reveals that the algorithm with SOM could generate much better pattern reproduction of the original TI.

With GSOM. Comparing the results of the algorithm using SOM with those of the algorithm using GSOM, a much better reproduction is seen from the algorithm with GSOM.


This article, written by Special Publications Editor Adam Wilson, contains highlights of paper SPE 190087, “Unsupervised Statistical Learning With Integrated Pattern-Based Geostatistical Simulation,” by Q. Li and R. Aguilera, SPE, University of Calgary, prepared for the 2018 SPE Western Regional Meeting, Garden Grove, California, USA, 22–27 April. The paper has not been peer reviewed.