How I realised Data Science is for me
My fathers' friend once met me and asked what do you offer in the college and I replied that I am a Machine Learning Engineer. Meanwhile I was just in my curious state of just exploring the many fields of computer science(data science, web/app development,etc) to just choose a particular field.He went on to ask me what will I do with the Machine Learning Engineer field.Then I said I will use it to predict data in order to draw insights for companies . So he told me he has a data which he would want me to make some analysis. And I responded that I will do that so he should just send the data. Meanwhile I only learnt to just build diabetes project . So I went home to play with some data provided by Madhu Charan blog so that I could know some of the most used functions in data science libraries(pandas, numpy etc). So I gathered the functions in my notebook so that I could refer to it incase I want to analyze data. Let me show you some:
Pandas
Pandas normally need matplotlib to work with in order to display the plot/graph
df = pd.read_csv() -> to read/display the dataframe
df.describe() - statistics description
df.info
) - data types of column head/labels
df.columns - column header
df[“Index”] = values or rows under the label “Index”
Df.plot.box(),df[‘petal_species].plot.hist(),df.scater.plot() - pandas power df api to visualize data
df.sort_values(by=),df.sort_index()
Df.concat([], join=inner) - inner means intersection ,outer is union
Df.merge() - just like df.concat but used to combine df of similarity or link
Df.dropna()- drop missing values
df.sepal_length = df.sepal_length.fillna(df.sepal_length.mean())
Df.apply()-allows you to apply a function to your df
Win: df.apply is approximately df[‘index’] anywhere
Df.reset_index(drop=True) - to reset index column after dropping
df.iloc[3] = index 3 values will be showed
df.loc[row,column] = the value intersecting will appear
Df.random.choice = accept an array usually 1d to select randomly from
df.hist() is for full chart on all data/input and df[‘index’].plot.hist() is for individual
Df.ndim - number of dimension of dataframe
Seaborn
sns.countplot - Barchart of labels sns.boxplot - to visualize data in box formm and also outliers(data that is anomaly) Density - helps us see relationship b/n each variable and target variable plt.subplots(4,4, figsize=(20,25)) is different from plt.subplot - one is for multiple plots
Matplotlib.pyplot
plt.figure(figsize=(3,3)) - in terms of inches(width and length) %matplotlib inline- for different visualisation in notebook itself
Numpy
df["Price"] = df["Price"].clip(lower=df["Price"].quantile(0.05), upper=df["Price"].quantile(0.95))- to clip(cut short) outliers Np.random.choice(replace=false Np.random.seed = to make random numbers stagnant and not keep changing
Keras
It takes an input and pass it to next layer. each layer performs mathematical equations to it b4 passing it to the next. The core layers in keras are: dense, activation, dropout.“there are other layers that are more complex, including convolutional layers and pooling layers
So this was some the most used functions used in most dataset analysis. I was doing better😂
In conclusion
After learning some basics like python , sql , do not worry to read every library documentation. Documentation is for reference Build projects via help and on your own. Then do something different from what you already know There are people who are ahead of you. Ask for help . Get a good mentor