Transforming Petroleum Engineers Into Data Science Wizards
Often, I receive questions from colleagues asking for tips on data science and machine learning as applied to petroleum engineering. These answers address some of those questions I have collected. Here is my advice on becoming a petroleum engineer and a data science wizard.
First, complete any of the Python or R online courses on data science. My favorites are the ones from Johns Hopkins and the University of Michigan in Coursera (data-science specializations in R or Python). Do not be mistaken: The data-science specialization in R is a high-quality course and can make you feel like you are going through a PhD program. You will need a firm commitment and to set aside some time for lectures, quizzes, and project assignments. You could complement it with DataCamp short workshops. These are just two names that come to mind quickly, but there are many others: edX, Udacity, Udemy, and others from reputable universities such as Stanford, MIT, or Harvard. If you do not have previous programming experience, start with Python. If you feel confident about your programming skills and would like to break the barrier between engineering and science, go full-throttle with R. You will not regret it. Note: For those who have asked me if I recommend a formal data-science degree from a university, what I tell them is try first with online courses and see if it is for you.
Start using Git as much as possible in all your projects. It is useful for sharing and maintaining code, working in teams, synchronizing projects, working on different computers, and reproducing your data-science projects. To access Git in the cloud, you can use GitHub, Bitbucket, or GitLab. Do not be frustrated if you do not understand it at first; everybody struggles with Git. Even PhDs, who, by the way, have written the best tutorials on Git. So, you are not alone.
Learn the basics of the Unix terminal. This is useful for many things that Windows does not do and might never do; it is even useful for Linux and Macs, which are Unix based, as well. You can create automatic scripting using the Unix terminal that can serve you in many data-oriented activities, such as operations on huge data sets, deployments, backups, file transfers, remote-computer management, secure transfers, low-level settings, and version control with Git. If you are a Windows user, get familiar with Unix from hybrid Windows applications such as Git-Bash, MSYS2, or Cygwin. There is no question that you have to know the Unix terminal. It makes your data science much more powerful and reproducible, also giving you avenues for deployment. I am seeing articles more frequently in which the author has managed to read and transform terabyte-size data sets, in laptops, using a combination of Unix utilities such as grep, awk, and sed along with data.frame and data.table structures. This eliminates the need for big-data computer clusters with Hadoop or Spark, which are much more difficult to handle.
As soon as you have installed R, Rtools, and RStudio on your computer, start using Markdown. In R, it is called Rmarkdown, which is widely used in science for generating documentation, papers, citations, booklets, manuals, tutorials, schematics, diagrams, web pages, blogs, and slides. Make a habit of using Markdown. If possible, during engineering work, avoid Word, which generates mostly binary files. Working with Markdown makes revision control easier and reproducible, both key to reliable, testable, traceable, repeatable data science. With Markdown, you can also embed LaTeX equations with text, code, and calculations. Besides, you gain an additional ecosystem to run code and tools from the LaTeX universe, which is enormous.
Strive to publish your engineering results using Markdown. It will complement your efforts of batch automation, data science, and machine learning. Combine calculations with code and text using the Rmarkdown notebooks in R. Essentially, any document can be written mixing text, graphics, and calculations with R or Python. Even though I am originally a Python guy (10+ years), I am not strongly recommending the Python notebooks or Jupyter because they do not use 100% human-readable text (it uses JSON). Also, you may find it difficult to apply version control and reproducible practices or use it with Git. I have built possibly more than a thousand Jupyter notebooks, but learning Rmarkdown was like stepping into another dimension.
Start bringing your data into data sets with the assistance of R or Python. Build your favorite collections of data sets. Share with colleagues in the office and discuss the challenges in making raw data tidy. Generate tables, plots, and statistical reports to come up with discoveries. Use Markdown to document the variables or features (columns). If you want to share the data, keeping the confidentiality, learn how to anonymize your data with R or Python cryptographic or scrambling packages.
Start solving daily engineering problems with R or Python, incorporating them into your work flow. If you can, avoid Excel or Excel-VBA. VBA’s purpose was not version control or reproducibility or data science, much less machine learning. Sticking to Office tools may keep you stuck with outdated practices or being unable to perform much richer and productive data science. There is one more thing you may have noticed, and that is that Excel plots are very simplistic. They go back 30 years, and you run the risk of dumbing down your analysis or preventing discoveries from your data or showing a compelling story, which is the purpose of data science anyway.
Learn and apply statistics everywhere, every time you can, on all petroleum-engineering activities you perform. Find what no other person can by using math, physics, and statistics. Data science is about making discoveries and answering questions on the data. Data science was invented by statisticians who, at that time, called it “data analysis.” An article I never get tired of reading and rereading is “50 years of Data Science” by David Donoho. Please, read it. It will explain statistics and its tempestuous, albeit tight, relationship with data science. Remember: Any additional oil and gas that you can find with data science and machine learning will be the cheapest hydrocarbons to produce.
Read what other disciplines outside petroleum engineering are doing in data science and machine learning. You will see how insignificant we are in the scheme of things. It is up to us to change that. Look at bioscience, biostatistics, genetics, robotics, medicine, cancer research, psychology, biology, ecology, automotive, finance. They are light years ahead of the oil and gas industry.
Read articles in the net on data science. It does not matter if it is Python or R, you just have to learn what data science is about and how it could bring value to your everyday work flow. They may give you ideas of applications involving data in your petroleum-engineering area of expertise. They may not be data science per se now, but they most likely could be the next stepping stone. Additionally, most of the articles are free, as are hundreds of books, booklets, tutorials, and papers. We never had the chance to learn so much for so little. Somebody called this the era of democratization of knowledge and information. All you have to invest is time.
Start inquiring about machine learning. Do the same with artificial intelligence (AI). Nothing is better than knowing, at least, the fundamentals of what others are trying to sell you. There is so much noise and snake-oil marketing nowadays surrounding the words “machine learning” and “AI.” Two books I would recommend off the top of my head on AI are Computational Intelligence: A Logical Approach by David Poole, Alan Mackworth, and Randy Goebel and Artificial Intelligence, A Modern Approach by Peter Norvig and Stuart J. Russell. You will find that AI is not what you read in newspapers or articles. What’s more, I have a tip for you: If you see an article with the figure of a humanoid, or human-faced robot, or mechanical arms with some brain on it, skip it. That is not what AI is.
Review C++ and Fortran scientific code. I don’t mean you need to learn another programming language, but knowing what they can do will add power to your toolbox. Sooner or later, you will need Fortran, C, or C++ for reasons of efficiency and speed. Not for nothing, the best-in-class simulators and optimizers of today have plenty of Fortran routines under the hood.
Learn how to read from different file formats. The enormous variety of file formats in which you may find raw data is amazing. A lot of value could be brought to your daily activities by automating your data-analysis work flow using R or Python. Also, ask what different data formats your company uses for storing data. Get familiar with them. Try reading some chunks of that data. Try with logs, seismic, well tests, buildups, drilling reports, deviation surveys, geological data, process data, simulation output. Create tidy data sets out of them. Explore the data.
Something that is more challenging is learning how to read and transform unstructured data, meaning data that is not in row/column (rectangular) format. The typical cases and those closest to us are the text output from simulators, optimizers, stimulation, or well design. This is some of the more difficult data to operate with, when learning or knowing regex really pays off. Also consider how much data is coming as video and images. Today, plenty of algorithms deal with that kind of data through either Matlab, Python, or R.
Learn something about virtual machines (VMs) with VirtualBox or Vmware. Having several operating systems working at the same time on your PC—Windows, Linux, MacOS— can be very useful. A lot of good data-science and machine-learning stuff in Linux is packaged as VMs that could be run under Windows very easily. These applications are ready to run without the need to install anything on the physical machine. A few months ago, I was able to download a couple of Linux VM with a whole bunch of machine-learning and AI applications and test them with minimum effort. I have other VMs from Cloudera and Horton-Works that run big-data applications such as Hadoop and Spark. Another virtualization tool that you may want to learn is Docker containers. The concept is similar to that of virtual machines but lighter and less resource intensive. These tools will make your data science even more reproducible and stand the test of time.
Alfonso R. Reyes is a petroleum engineering data scientist and chief data scientist at his company, Oil Gains Analytics, in The Woodlands, Texas. He has worked as a production technologist with Petronas in Kuala Lumpur and was responsible for managing network modeling applications for the Production Enhancement line. Reyes has worked in the upstream areas of petroleum engineering in technology deployment, project management, and business development positions. Before joining Petronas, he worked in the areas of production optimization, nodal analysis, and electrical-submersible-pump software applications for oil and gas wells with IHS based in Denver. Reyes’ previous assignments have been in areas of intelligent wells, smart-well completions, unconventional oil, production engineering, and well and reservoir surveillance for Shell, Halliburton, WellDynamics, and Optimization Petroleum Technologies, all based in Houston. Reyes has traveled and worked extensively in Peru, Argentina, Brazil, Mexico, Malaysia, the United States, and other countries around the world. He holds a BS degree in petroleum engineering from the National University of Engineering in Lima, Peru. Reyes is passionate about computational physics, digital electronics, and history. He has published technical papers at Society of Petroleum Engineers, International Association of Drilling Contractors, and other international conferences.
Don't miss out on the latest technology delivered to your email monthly. Sign up for the Data Science and Digital Engineering newsletter. If you are not logged in, you will receive a confirmation email that you will need to click on to confirm you want to receive the newsletter.
19 May 2020
15 May 2020