Prerequisite of Data Scientist

Prerequisite of Data Scientist

1.Know the problem you want to solve: Take time and ask question on what you want to solve . Is it a classification problem or regression problem?

2.Get data: Before you get data try not to re-invent the wheel. This means that make sure you don't create data that is already existing. We are flooded by information these days so there is a high chance the data you are looking for is already there so no need to do data warehouse or further research and wrangling of data. Kaggle is one of the best site to get data.

3.Clean the data/data preprocessing: Data needs to be cleaned before it is passed to the ml model. 4c's of cleaning data are: Correcting , Completing, Converting and Creating

unclean.jpeg

a)Correcting - is more unless feature selection to remove unwanted features or anomalies.

b)Completing - filling of null and missing values.Take note that missing or null qualitative values are replaced with the mode and missing or null quantitative values are replaced with mean or median.

c)Converting - not all machine learning algorithm work well with text data. Decision Tree do work well with such data but not Neural Network. So the data needs to be converted. Sklearn function like get_dummies and LabelEncoder help automate the conversion of data.

d)Creating - this is the feature engineering part of ml. You observe the data and come up with new data that can be a good predictive factor.

4.Exploratory Data Analysis(EDA) - This part is what others refer to us story telling . We use the data to create insights that help analyse the data.Some of the tools we use here are: seaborn, matplotlib etc

eda.jpeg

5.Modelling: You then pass the data to a model to train it for future prediction. Model is the output of machine learning algorithm. The type of problem we are trying to solve detects the type of model we will use.Example of model is the 'Sequential Model' of keras.

6.Make some prediction: We predict on the testing dataset ie the dataset the machine model has not seen before or trained upon before. The score here determines how good our model will work on real world.

7.Finalize your work: here you go back and try different models, data that really output good score and use it in the real world.

hm2.jpeg .