Ten Research Challenge Areas in Data Science
To drive progress in the field of data science, the authors propose 10 challenge areas for the research community to pursue. Because data science is broad, with methods drawing from computer science, statistics, and other disciplines and with applications appearing in all sectors, these challenge areas speak to the breadth of issues spanning science, technology, and society. The authors preface their enumeration with metaquestions about whether data science is a discipline. They then describe each of the 10 challenge areas. The goal of this article is to start a discussion on what could constitute a basis for a research agenda in data science, while recognizing that the field of data science is still evolving.
Although data science builds on knowledge from computer science, engineering, mathematics, statistics, and other disciplines, data science is a unique field with many mysteries to unlock: fundamental scientific questions and pressing problems of societal importance.
This article enumerates 10 areas of research in which to make progress to advance the field of data science. The goal is to start a discussion on what could constitute a basis for a research agenda in data science, while recognizing that the field of data science is still evolving.
Ten Research Areas
What are the research challenge areas that drive the study of data science? Here is a list of 10. They are not in any priority order, and some of them are related to each other. They are phrased as challenge areas, not challenge questions; each area suggests many questions. They are not necessarily the top 10, but they are a good 10 to start the community discussing what a broad research agenda for data science might look like.
1. Scientific Understanding of Learning, Especially Deep Learning Algorithms
As much as we admire the astonishing successes of deep learning, we still lack a scientific understanding of why deep learning works so well, though we are making headway.
2. Causal Reasoning
Machine learning is a powerful tool to find patterns and to examine associations and correlations, particularly in large data sets. While the adoption of machine learning has opened many fruitful areas of research in economics, social science, public health, and medicine, these fields require methods that move beyond correlational analyses and can tackle causal questions. A rich and growing area of current study is revisiting causal inference in the presence of large amounts of data.
3. Precious Data
Data can be precious for one of three reasons: the data set is expensive to collect; the data set contains a rare event (low signal-to-noise ratio); or the data set is artisanal—small, task-specific, or targets a limited audience.
4. Multiple, Heterogeneous Data Sources
For some problems, we can collect lots of data from different data sources to improve our models and to increase knowledge.
5. Inferring From Noisy or Incomplete Data
The real world is messy, and we often do not have complete information about every data point. Yet, data scientists want to build models from such data to do prediction and inference. This long-standing problem in statistics comes to the fore as (1) the volume of data, especially about people, that we can generate and collect grows unboundedly; (2) the means of generating and collecting data is not under our control, for example, data from mobile phone and web apps vary—by design—across different users and across different populations; and (3) many sectors, from finance to retail to transportation, embrace the desire to do real-time personalization.
6. Trustworthy AI
We have seen rapid deployment of systems using artificial intelligence (AI) and machine learning in critical domains such as autonomous vehicles, criminal justice, health care, hiring, housing, human resource management, law enforcement, and public safety, where decisions taken by AI agents directly impact human lives. Consequently, there is an increasing concern if these decisions can be trusted to be correct, fair, ethical, interpretable, private, reliable, robust, safe, and secure, especially under adversarial attacks.
7. Computing Systems for Data-Intensive Applications
Traditional designs of computing systems have focused on computational speed and power: the more cycles, the faster the application can run. Today, the primary focus of applications, especially in the sciences, is data. Novel special-purpose processors are now commonly found in large data centers.
8. Automating Front-End Stages of the Data Life Cycle
While the excitement in data science is due largely to the successes of machine learning, and more specifically deep learning, before we get to use machine learning algorithms, we need to prepare the data for analysis. The early stages in the data life cycle are still labor intensive and tedious. Data scientists, drawing on both computational and statistical tools, need to devise automated methods that address data collection, data cleaning, and data wrangling, without losing other desired properties.
For many applications, the more data we have, the better the model we can build. One way to get more data is to share data, for example, multiple parties pool their individual data sets to build collectively a better model than any one party can build. However, in many cases, due to regulation or privacy concerns, we need to preserve the confidentiality of each party’s data set.
Data science raises new ethical issues. They can be framed along three axes: (1) the ethics of data: how data are generated, recorded, and shared; (2) the ethics of algorithms: how artificial intelligence, machine learning, and robots interpret data; and (3) the ethics of practices: devising responsible innovation and professional codes to guide this emerging science and to define institutional review board criteria and processes specific for data.
Don't miss out on the latest technology delivered to your email monthly. Sign up for the Data Science and Digital Engineering newsletter. If you are not logged in, you will receive a confirmation email that you will need to click on to confirm you want to receive the newsletter.
22 October 2020
22 October 2020
19 October 2020
16 October 2020