What is data science?
Let us try to understand what did we do before the advent of data science and machine learning? Is it a completely new field which appeared from no where? An alternative terminology for data science that was being used earlier business analytics. So what where we doing earlier before the advent of data science. We were doing descriptive analysis, reporting and dashboarding. We were building models without using it prediction by that I mean we were finding out as to why this particular event happened without using this to help predict what will happen in the future. That was the state of affairs
So what is data science? Data science is "concept to unify statistics, data analysis, machine learning and their related methods" in order to "understand and analyze actual business phenomena" with data. First, This implies one needs to have a good knowledge of mathematics and statistics. Second, The domain knowledge that is understanding of the business problems from the functional point of view i.e. Marketing, Risk, finance, Human Resources and operations etc. Sectoral / industry knowledge like retail, telecom, financial services, manufacturing, and insurance services. Third, an analytical tool / software which involves some computer programming to build the models. Therefore data science becomes an interdisciplinary field.
Slide 7 Data science is multidisciplinary in nature
Data science is all about solving a business problem with the help of data
Did you hear about linear regression or an did you see an equation like y = mx + c, this equation in its simplest form is considered as machine learning algorithm. So an equation which has both dependent variable and independent variable will be considered as supervised machine learning algorithm. And a problem where there is no presence of independent and independent variable you call it as unsupervised machine learning algorithm. This also implies that if one aspires to become a data scientist one needs to also build its knowledge of statistics as one goes ahead.
Data Science work flow for data products
In this blog, we are going to learn about the overall data science work flow with a real world example.
A simplest example that we deal with in our everyday life is that either we wish to resale our two wheeler or four wheeler. In this case you are reseller. If you think the other way, let us say you wish to buy a presold two-wheeler or four wheeler. The problems that we face as a buyer or seller, what the right price of a resale car is?
In the last 5 years a lot of intermediaries or brokers (our local language) have flourished who are helping the buyers and sellers get the right price for their purchase like https://www.cars24.com/, https://droom.in, www.cardekho.com/, and www.carwale.com/.
Figure: Data Science work flow
Stage 1: Business problem and opportunity identification
To simplify the process of determining, the price / value of the resale car, these portals have developed a machine learning model to predict the price of a resale car. Now, this is the business problem that we wish to predict the price of resale car. This is an opportunity to build data products Let us understand the process of finding the price of a resale car.
Visit the https://www.cars24.com/ website. Scroll down the page and click on the “Start car valuation” online. If you observe it will ask you to enter various details of your car like Brand, Model, year of manufacturing, variant, car registration state, Kilometers driven, your email id and mobile number. Please us go ahead and enter these details on the website. If you enter these details you will get the value of your car.
The question to ask is how cars24 is able to determine the value of your car. What goes behind in predicting the price of a used car?The answer to this question is a machine learning model is built to predict the price of the car. In this case possibly linear regression algorithm is being used. To build such model one needs to identify the parameters which helps in predicting the price. The parameters which cars24 is using are Brand, Model, year of manufacturing, variant, car registration state, and Kilometers driven etc.
The next question to ask is how these models are built. This will help us understand the overall data science work flow. The steps that are followed in a data science work flow is identifying the business problem, acquisition of Raw Data, Data processing, Descriptive analysis, Model building, Model validation, Communication / Visualization of results, Implementation / Decision related to the model which results in a data product.
Any data science work flow, after understanding the business problem, starts with the acquisition of the raw data. What would be the source of data? Whether you need to purchase this data offline, collect the data online or look inside your database whether you have the right kind of data?
Stage 2: Data acquisition
The next step after acquisition of the data is, what are the factors contributing to the price of a resale car? Now you have started thinking like a data scientist. The pre-requisite for becoming data scientist is asking the right questions. Now since you have asked this question about factors contributing to the price of a resale car? The importance of the domain knowledge comes into the picture. Possibly go out and ask these questions to potential buyers and sellers, how do they arrive at a price of resale car? The answers will start flowing: Year of registration, vehicle type, Year of make, kms driven, make of the car, original value of the car, fuel type, mileage, so on so forth. This is one of the way your parameters might emerge. The parameters might emerge from your own experience and domain expertise.
Stage 3: Exploratory Data analysis
The next step after data acquisition is data cleaning and pre-processing. The application of your statistical knowledge starts here: Removing outliers, Treatment of Missing Data, Malicious Data, Erroneous Data, Irrelevant Data, Inconsistent Data, and Formatting etc. The next step is performing exploratory data analysis. The measures of aggregation in the data are means, median and mode. The measures of dispersion in the data are variance, standard deviation and range. The other aspects of exploratory data analysis would be frequency distribution and whether the data is normally distributed or not?
Stage 4: Model building steps
After you are done with the exploratory data analysis, the model building process starts. What is the right algotrim to predict price? As a data scientist, you will face such kinds of questions on a continuous basis depending upon the business problems you are dealing with. In case of predicting price of resale car. The algorithm being used is linear regression.
Stage 5: Model validation
The model built is to be tested to another “test” dataset to evaluate and check accuracy of the model.
Stage 6: Data visualization
This relationship has to be visualized in order to communicate the model to the top management. Linear relationship is represented through a linear line or a linear equation. Number of kms driven is inversely related to the value of the car or the oldness or newness of the car is directly related to value of the car. The parameters that you identified in the model building process is represented through an equation and this is how visually it will look like.Figure
Stage 7: Model deployment
Thereafter after seeing the robustness of the results you decide to deploy the model. The model deployment of cars24 is in real time. Whenever a user enters the details he gets the value of the car. At the backend the linear regression machine learning algotrithm after inputing the value actually gives you the output of the value of the resale car.
Your Order has been sent successfully. We will contact you as soon as possible.
Error: Please try again