What is data science?
Data science is a multidisciplinary approach to finding, extracting, and surfacing patterns in data through a fusion of analytical methods, domain expertise, and technology. This approach generally includes the fields of data mining, forecasting, machine learning, predictive analytics, statistics, and text analytics.
As data is growing at an alarming rate, the race is on for companies to harness the insights in their data. However, most organizations are faced with a shortage of experts to analyze their big data to find insights and explore issues the company didn’t even know it had. To realize and monetize the value of data science, organizations must infuse predictive insights, forecasting, and optimization strategies into business and operational systems. Many businesses are now empowering their knowledge workers with platforms that can help them conduct their own machine learning projects and tasks. Being able to extract trends and opportunities in the massive amounts of data being infused into a business will give an organization a competitive advantage.
Data science includes descriptive, diagnostic, predictive, and prescriptive capabilities. This means that with data science, organizations can use data to figure out what happened, why it happened, what will happen, and what they should do about the anticipated result.
Understanding how data science works
Conceptually, the data science process is very simple to understand and involves the following steps:
- Understand the business problem
- Gather and integrate the raw data
- Explore, transform, clean, and prepare the data
- Create and select models based on the data
- Test, tune, and deploy the models
- Monitor, test, refresh, and govern the models
Understand the business problem
The process of data science starts with understanding the problem that the business user is trying to solve. For instance, a business user might want to ask and understand “How do I increase sales?” or “What techniques work best to sell to my customers?” These are very broad, ambiguous questions that don’t lead to an immediately researchable hypothesis. It is the data scientist’s job to break these business problems down into researchable and testable hypotheses. For instance, “How do I increase sales?” could be broken down into several smaller questions such as “What conditions lead to the increased sales? Was it a promotion, weather, or seasonality?” “How can we optimize our sales based on constraints?” and “What are the sales likely to be tomorrow/next week/next month for each store?” The important thing to remember is that one needs to understand the business decision that needs to be made, and work backwards from there. How will your business process change if you could predict something an hour/day/week/month into the future?
Gathering and integrating the raw data
Once the business problem is understood, the next step involves gathering and integrating the raw data. First, the analyst has to see what data is available. Often, data will be in many different formats and many different systems so data wrangling and data prepping techniques are often used to convert the raw data into a useable format suitable for the specific analytic techniques that will be used. If the data is not available, data scientists, data engineers, and IT generally collaborate to bring new data into a sandbox environment for testing.
Explore and prepare the data
Now, the data can be explored. Most data science practitioners will employ a data visualization tool that will organize the data into graphs and visualizations to help them see general patterns in the data, high-level correlations, and any potential outliers. This is also the time when the analyst starts to understand what factors may help solve the problem. Now that the analyst has a basic understanding of how the data behaves and potential factors that may be important to consider, the analyst will transform, create new features (aka variables), and prepare the data for modeling.
Test, tune, and deploy models
This is the point when most analysts will use algorithms to create models from the input data using techniques such as machine learning, deep learning, forecasting, or natural language processing (aka text analytics) to test different models. Statistical models and algorithms are applied to the dataset to try and generalize the behavior of the target variable (for example, what you’re trying to predict) based on the input predictors (for example, factors that influence the target).
Outputs are usually predictions, forecasts, anomalies, and optimizations that can be displayed in dashboards or embedded reports, or infused directly into business systems to make decisions close to the point of impact. Then, after the models are deployed into the visualization or business systems, they are used to score new input data that it has never been seen before.
Monitor, test, refresh, and govern the models
After the models are deployed, they must be monitored so they can be refreshed and retrained as data shifts due to changing behavior of real world events. Thus, it is imperative that organizations have a model operations strategy in place to govern and manage changes to production models.
In addition to deploying models to dashboards and production systems, data scientists may also create sophisticated data science pipelines that can be invoked from a visualization or dashboard tool. Oftentimes, these have a reduced and simplified set of parameters and factors that can be adjusted by a citizen data scientist. This helps address the skills shortage mentioned above. Thus, a citizen data scientist, often a business or domain expert, can select the parameters of interest and run a very complex data science workflow without having to understand the complexity behind it. This allows them to test different scenarios without having to involve a data scientist.
In summary, data scientists tell a story using data and then provide predictive insights that the business can use for real world applications. The process used, as shown in the graphic below, is:
- Input data
- Prep data
- Apply machine learning
- Deploy, score, and manage models
- Output data
Key steps in the data science process
Business understanding
- Understand the business decision to be made
- Determine what data is needed to make the decision
- Realize how your business will change as a result of the decision
- Determine the architecture needed to support the decision
- Assemble a cross-functional technical and project management team
Understand the machine learning process
- Data acquisition and integration
- Data exploration, preparation, and cleansing
- Data preprocessing, transformation, and feature generation
- Model development and selection
- Model testing and tuning
- Model deployment
Understand the model operations and governance process
- Model repository, documentation, and version control
- Model scoring, API framework, and container strategy
- Model execution environment
- Model deployment, integration, and results
- Model monitoring, testing, and refresh
What skills are needed for data science?
Business Skills: Collaboration, teamwork, communication, domain expertise/business knowledge
Analytics Skills: Data prep, machine learning, statistics, geospatial analytics, data visualization
Computer Science/IT Skills: Data pipelines, model deployment, monitoring, management, programming/coding
Who uses data science?
“The Hidden Talent” aka Citizen Data Scientists: Use data and analytics on a daily basis to solve specific business problems with a point-and-click interface.
“The Business-driven”: Focus on business unit-led initiatives and improving business operations.
“The Specialists”: Work across all functions and business units to solve problems and collaborate with IT to operationalize machine learning models. Attain buy-in and funding from executives.
“The Hotshots”: Leverage a multitude of data sources to solve new problems, prototype solutions using machine learning, and run data science workflows at scale. Favor tools like R, Python, Scala, Hadoop, and Spark.
"The Untapped Potential": Want to jump in, but don’t feel they have the support or training or don’t work for an organization with technology offering reusable templates.
Top data science tasks
- Problem understanding and analysis
- Data collection, data prep/cleaning, and basic exploratory data analysis
- Model development and testing
- Model deployment, monitoring, and governance
- Communication of findings to business decision-makers
What challenges does data science address?
Below are some examples of the challenges that data science is addressing across different industries:
Energy
Data science is mostly being used in the energy sector to optimize exploration, production, and operations while anticipating demands such as:
- Predict equipment failure
- Forecast future oil volumes and prices
- Optimize distribution
- Reduce emissions
- Analyze ground composition
- Characterize reservoirs
Finance and insurance
In the finance and insurance industry, data science is mostly focused on reducing risks, detecting fraud, and optimizing the customer experience. Some examples of where data science is used are:
- Predict credit risk
- Detect fraud
- Analyze customers
- Manage portfolio risk
- Determine likelihood to churn
- Comply with regulations such as SOX, Basel II
Healthcare
Data science in healthcare is mostly used to improve quality of care, improve operations, and reduce costs.
- Predict disease risk
- Detect fraudulent claims
- Prescribe personalized medicine doses
- Analyze images to detect cancers
- Manage claims
- Improve patient safety
- Determine who is most at risk
Pharmaceutical
Data science in the pharmaceutical sector is mainly used to ensure safety, product quality, and drug efficacy such as:
- Determine golden batch
- Analyze clinical trial
- Trace products
- Analyze stability & shelf life
- Validate reporting and analytics for regulatory compliance
- Analyze manufacturing processes, data
Manufacturing
In manufacturing, data science helps optimize processes, improve quality, and monitor suppliers. Some examples are:
- Improve yields
- Reduce scrap, rework, & recalls
- Detect warranty fraud
- Comply with regulations
- Predict & prevent equipment failures
Challenges that data scientists face
Inaccessible data
Addressed by:
- Easily combining data from multiple, disparate sources into a virtual data layer
- Visually manipulating, cleaning, and transforming data to make it ready for analysis
- Using introspection and relationship discovery to understand and validate data relationships for model building
Dirty data
Addressed by:
- AI fueled visual wrangling to automatically suggest transformations, remove outliers, and clean data
- Automated data health check to fill in missing values, remove unimportant variables, and prepare data for analytics
- Formatting and preparing data across disparate sources at scale
Limited talent & expertise
Addressed by:
- Using automated recommendations and visual insights to make sense of complexity
- Harnessing the creativity of the entire team, not just a few data scientists, and collaborating across the end-to-end analytic lifecycle
- Creating reusable parameterized templates that can be run by citizen data scientists to scale machine learning
Results not being used
Addressed by:
- Simplifying deployment to operational systems to embed machine learning into business processes at the point of impact
- Operationalizing data science with model monitoring, retraining, and governance
- Ensuring successful handoffs across the end-to-end analytic lifecycle: data pipeline, model building, scoring, and app development
Solving data science challenges
Data Science for Everyone: Democratize and collaborate on data science with automation, reusable templates, and a common collaborative framework for cross functional teams
Accelerate Innovation: Rapidly prototype new, flexible solutions with native algorithms, open source, and partner ecosystems while ensuring governance
AnalyticOps: Monetize the value of data science by systematically focusing on its operations through pipeline monitoring, management, updating, and governance
Training: Provide education and training to citizen data scientists and others who want to learn data science practices.
Center of Excellence: Establish a CoE to promote best practices and foster innovation and reusability so that data science can be scaled across the enterprise