Data Science Project Flows

Published on 2024-02-03

Data Science

When I started learning about data mining (2021), it confused me. Maybe because I jumped straight into classifcations. There are so many different approaches for data science and I could not quite picture what I am doing. What helps me is structuring the flow..

What helps me is structuring the flow, that is, listing what comes first and what comes next in the whole Data Science Project. I plan to keep updating this post as my knowledge in Data Science accmulates. So, here we go:

Our goal: given an input, we want a predicted output!

Raw data --> Preprocess Data --> Clustering (if necessary) --> Choose Approach for modelling (can be more than one, we call it ensemble method) --> Evaluate Model --> Model good? Deploy it to Production

1. Data Preprocessing First, we need to understand our data, that means cleaning, transforming and sometimes, removing unnecessary data.

(1) Data Cleanse & Organize: But Perfection is not good for data modelling. 😆 (2) Data Transformation: Normalization, Standardization since much real-world data are right-screwed. (3) Dimensionality Reduction: Linear Principal components analysis (PCA), Non-Linear t-distributed Stochastic Neighbor Embedding (t-SNE), Uniform Manifold Approximation and Projection (umap)

This whole process of data preprocessing is part of Feature Engineering (the job of Data Engineer), that is, polishing our data, choosing appropriate data as this data would be used to train the model which would be used for prediction. From now one, we are going to call this pre-processed data as 'training data'.

2. Preparing to Model the data With the knowledge of data we have, the next step is to plan how we are going to model our data.

(1) Supervised vs Un-supervised: Do we have an output or target or labelled data in our 'training data'? Otherwise, it's un-supervised. (2) Cross-Validation: How are we going to vallidate our model? (3) Overfitting vs Generalization (bias-variance trade off) (4) Establishing te baseline performance (Resampling, ground truth labelling)

Now we have a rough plan, we just need to furnish with more details. Then, execution!

3. Clustering (un-supervised) Clustering is a preliminary step in a data mining process i.e. the output serves as an input into a different technique downstream such as neural network (we will get into it later). So, we use data mining algorithm to searches for patterns and structure among all the variables or features.

(1) Partition based Clusting: k-means (2) Hierarchical Clustering: Agglomerative, Divisive (3) Density Based Clustering (DBSCAN) (4) Mean-shift

How to evaluate cluster validity? - Cluster Cohension (SSE) - Cluster Separation (SSB)

4. Association Rules (data maybe supervised or un-supervised) Association rules are generally used for unsupervised learning but may apply to supervised learning for a classification task. (1) Affinity Analysis (2) Priori Algorithm

5. Classifications (supervised model: training the prediction model)

Explainable (given inputs, we know how the model makes a decision to get an output(s)) (1) K-nearest neighbors (2) Decision Tree (3) Rule-based models (4) Support vector machine (SVM) (5) Naive Bayer (Probablisitc model) (6) Linear Regression (7) Logistic Regression

Un-explainable (Blackbox: given inputs, we don't know how the model makes a decision to get an outpu(s))) (8) Deep Learning (Neural Networks)

6. Model Evaluation Techniques (1) Cross-validation (2) Tuning the hyper-parametsrs (3) Metrics and Scoring Error Rate or Accuracy Score, False Positives, False Negatives, ec. (4) Validation Curve

7. Model Deployment (1)