Why understanding domain matters for a data scientist


Data science is one field which deals with different domains. “Where ever you find data, you can apply science" – Data Science.

Now with advent of hadoop and bigdata, data storage and processing is smooth and easily done. Now the question is what can be done with the data?

  • Finding loyal customer
  • Bank Defaulters
  • Bank loan takers
  • Predict demand of a product
  • Customer segmentation prior to marketing
  • Fuel savings in logistics
  • Warehouse planning
  • Server outage predictions in telecom industry.
  • Stars labelling in Astronomy.
  • Provide better health care decisions
  • Predict who will leave company
  • Measure emotion in a Resume
  • HR analytics to measure performance of a team
  • On-time delivery in SCM.
  •  Targeting right customers
  • Churn prediction in telecom.
  • Epilepsy prediction in health care.
  • Cancer cells growth rate.
  • Better traffic management.
  • Understanding the flow of rivers to predict the movement of soils.
  • Better energy management.
  • Targeting the right customer
  • Optimization in every sector especially manufacturing
  • Cost cuttings in every sector
  • Personalized offers in ecommerce.
  • Recommendations
  • Digital Marketing
  • Fraud management
  • Stocks forecasting
  • Better Air Traffic Control
  • And many more........

   
Yes there are many more applications and future of the coming years for better living is Data Science. But the first and foremost step for data scientist is to understand enough domain for following reasons:
1.       To look accessible in client meetings.
2.       To understand client’s business for building better models.
3.       To extract domain related features in Feature Engineering.
4.       To build a data product, we have to understand what business problem we are addressing.
5.        To help customer in achieving the targets.
6.       For better feature selection.
7.       Your solutions will be implemented in business, so build quick, easy and better solutions.
8.       For out of box thinking.
9.       Creative solutions come up when we understand what domain we are dealing with.
10.   Accuracy measures vary from domain to domain, so before validating model, decide what accuracy measures would make sense in domain you are dealing with.
11.   What visualizations make sense.

 Finally, by understanding domain customers can be provided exceptional help, acceptable solutions. 

Data Science in Supply Chain Management - Predictive Analytics

Data Science as you all know is the science which we apply on data.


From Raw Materials to Consumer vast amounts of data is generated, which is valuable and can generate beautiful insights.

But the major problem is, companies do not know how to capture and leverage the data.

Supply chain management is a vast domain, which in particular need the help of data scientists. Every day, 1000's of trucks hit roads to reach their delivery targets. If I as a manager can track each truck, driver driving patterns, it helps me to provide better service to customers.

Data in this digital age, can be captured by sensors. If following data can be captured, it could generate millions of savings:

  • Longitude and latitude, geo location
  • Speed
  • Halts of truck
  • Drunk(Yes or No)
  • Tyre air pressure
All the above details can be well captured using sensors, collection using Apache flume or chukwa, and HBase Hadoop for storage.

Using the data collected following batch analysis can be done:
  • Average speed of each truck on a given day
  • Max speed of each truck on a given day
  • Driver drunk or not
  • Number of halts 
  • Average speed on High Ways
  • Average speed in crowded areas
  • Distance from delivery point
  • Average speed in accident prone areas
  • Driver driving patterns in a given day
  • Average Fuel usage in a given day
  • Number of on-time deliveries
  • Number of delays in delivery
  • Average speed in curved roads
  • Speed during accidents
  • Number of red signals crossed-
Using the data for real time analysis, following key value added services can be provided:
  • If driver is 100 kms away from delivery point, 2 hours is left for ontime delivery, now using the historical data, if a data science model could give the best route with minimum traffic, it will ensure on time delivery.
  • If driver is travelling at high speeds even in public roads, immediate message can be sent to driver to slow down.
  • If it a festive season, and some roads are completely blocked which can be seen from historical data, then data science model will avoid that route and suggest all possible routes.
  • If sensor detects driver is drunk, driver can be asked to stop immediately.
  • If driver driving patterns are very different and very bad, it means there is some thing wrong.
  • Fuel savings is one best thing that can save millions.
  • Every thing can be brought under control with right models and machine learning algorithms.
Finally i would like to conclude with Data Science, journey plan for each driver can be generated automatically, from journey start to end time, every point can be deeply analysed and summarized, with predictive analytics performance of each driver can be predicted and right driver for each delivery can be assigned.

Feature Engineering in Machine Learning - Data Science

Feature Engineering is a process by which features or predictor variables are extracted from the datasets available.

This is probably the most important and difficult part of the Data Science models. Following are the key points to be remembered for Feature Engineering:

1.       Learn enough domain before getting in to feature extraction.
2.       Try to extract features that help you in predicting the outcome of class variable.
3.       Extract as many features as you can.
4.       Important point to remembered is, before feeding data to Machine Learning algorithm, make sure that each row represents the features of unique entity and each column represents unique feature.
5.       Descriptive statistics play a major role in feature engineering.
6.       Features extracted might vary from Scientist to Scientist and it is solely dependent up on the creativity of individual.
7.       Many researchers worry about the importance of features extracted, but in reality once done with feature extraction, there are many statistical techniques and machine learning algorithms help to identify them.

Example:

Loyal Customer Analysis:

Problem Statement:

                Identify the loyal customers from the historical demographic, transactions, offers data.

Consider we have the following data from a retail store:

1.       Customer ID, Transaction ID, Product, Brand, Category, Company, Date, Quantity
2.       Customer ID, Demographic details
3.       OfferID, Offer details.

Now think a while on what features need to be extracted to know the loyal customer. In retail domain, following features need to extracted:
1.       Recent visit
2.       Frequency of visits from the past 7,14,30,60,90,180 days.
3.        Monetary invested  from the past 7,14,30,60,90,180 days.
4.       Quantity bought.
5.       Favorite category
6.       Favorite Brand
7.       Favorite Company

Finally input to machine learning algorithm looks something like this:

Customer ID, Recency, Frequency of visits, Monetary invested, Quantity, Favorite Category, Brand, Company, LoyalCustomer(Yes or No)

One of my projects in which I dealt is with 22GB of data, but it came down to 54 MB, when feature extraction is done.

This new minute dataset resulted in 95% accuracy of prediction model.


               



Data scientist VS Data Analyst

"We live in a data-driven world. Increasingly, the efficient operation of organizations across sectors relies on the effective use of vast amounts of data. Making sense of big data is a combination of organizations having the tools, skills and more importantly, the mindset to see data as the new "oil" fueling a company. Unfortunately, the technology has evolved faster than the workforce skills to make sense of it and organizations across sectors must adapt to this new reality or perish."

     --Andreas Weigend, Ph.D Stanford, Head of the Social Data Lab at Stanford, former Chief Scientist Amazon.com



Data Analysts:

Data analysts translate numbers into plain English Every business collects data, whether it's sales figures, market research, logistics, or transportation costs.
A data analyst's job is to take that data and use it to help companies make better business decisions.

Tasks:
  • Analyze the data provided
  • Build dash boards and generate reports.
  • Use descriptive statistics to summarize the data.
  • Use inferential statistics to do sample to population inference; if it is survey data.
  • Traditional statistical techniques like logistic regression is used for classification and ARIMA for time series forecasting.
  • They predominantly work on enterprise commercial software(Tableau, qlikview, SAS).
  • In most of the cases, they deal with clean data.
  • Role is restricted to a particular task or particular domain.

Data Scientist:

A data scientist represents an evolution from the business or data analyst role.

Data science is, in general terms, the extraction of knowledge from data. The key word in this job title is "science," with the main goals being to extract meaning from data and to produce data products.

Tasks:
  • Since they work across various domains, first step in a project is understanding domain.
  • Visualizing the data using open source tools like R, Python.
  • Data Understanding by plotting on outcome variables, and using descriptive statistics.
  • Data Pre-processing for cleaning and transforming data in meaningful way.
  • Feature Engineering to extract the features out of data.
  • Feature Selection to extract the important features using machine learning and statistics.
  • Train, test and validation data sets creation.
  • Machine learning algorithms to build a prediction model
  • Design data products for helping customers make right decisions.
  • Work more with open source softwares like R for machine learning and python for machine learning and text analytics.
  • Data Scientists even work with Big Data.
  • They are familiar with hadoop and related stack.
  • The data scientist role has been described as “part analyst, part artist.
  • Data scientist will most likely explore and examine data from multiple disparate sources.
  • A data scientist does not simply collect and report on data, but also looks at it from many angles, determines what it means, then recommends ways to apply the data.
  • Armed with data and analytical results, a top-tier data scientist will then communicate informed conclusions and recommendations across an organization’s leadership structure.
  • Creativity with common sense are the strengths of data scientist.
  • Out of box thinking.
  • In short Data Scientist are unicorns in this data world.