boring-is-good.com

Features Engineering > Data Preprocessing > Normalize, Standardize

Published on 2024-02-20

Data Science

Values of the features or variables in the dataset are not the same. Even if they are numerical values, their ranges can be quite dramatic. Ensuring that all features are treated equally in terms of scale and range is essential for the performance and stability of many machine learning algorithms.

In addition to handling missing data values (known as data imputation) and encoding categorical variables (one-hot encoding, ordinal encoding, etc.), we also have to preprocess the data before the model training:

Normalization is a scaling technique in which data points are shifted and re-scaled so that transformed data end up in the range of 0 to 1. It is also known as min-max scaling. The purpose of normalization is to bring all feature values to a similar scale so that no feature dominates the others due to its magnitude, otherwise features with larger scales can dominate the learning process, leading to biased models. We use normalization approach when we don't have any assumptions about dataset. Do note, however, that normalization is sensitive to outliners because of the use of min-max value. X' = (X - Xmin) / (Xmax - Xmin)

With Numpy:

import numpy as np
data = np.array([[1, 2, 3],[4, 5, 6],[7, 8, 9]]) # Sample data
normalized_data = (data - np.min(data)) / (np.max(data) - np.min(data))

With SkLearn:

from sklearn.preprocessing import MinMaxScaler
data = np.array([[1, 2, 3],[4, 5, 6],[7, 8, 9]]) # Sample data
scaler = MinMaxScaler() # create minmaxscaler object 
normalized_data = scaler.fit_transform(data) # fit to data to normalize

Standardization or z-score normalization or mean normalization is a technique used to transform the features of a dataset to have a mean of 0 and a standard deviation of 1. It centers the data around 0 and scales it by the standard deviation. As such, it assumes original dataset follows a normal distribution. Standardization is less sensitive to outliers because it uses the mean and standard deviation, which are more robust to extreme values. Outliers may still affect the mean and standard deviation but to a lesser extent. However, standardized data loses the original scale and interpret-ability. The resulting values are in standard deviations from the mean and may not have a clear interpretation in the original units. X' = (X - mean) / std

With Numpy:

import numpy as np
data = np.array([[1, 2, 3],[4, 5, 6],[7, 8, 9]]) # Sample data
standardized_data = (data - np.min(data)) / np.std(data)

With SkLearn:

from sklearn.preprocessing import StandardScaler
data = np.array([[1, 2, 3],[4, 5, 6],[7, 8, 9]]) # Sample data
scaler = StandardScaler() # create minmaxscaler object 
standardized_data = scaler.fit_transform(data) # fit to data to normalize

Log Scaling is another data transformation technique used to alter the scale of numeric data by taking the logarithm of each data point. Taking the logarithm of the data can help spread out the values and make the distribution more symmetrical. The most common logarithmic transformations used are the natural logarithm (base e) and the base 10 logarithm. This approach is suitable when original dataset follows a exponential distribution. As with standardization, interpretation of data after log scaling may differ from the interpretation of the original data.

Further Readings sklearn.preprocessing module includes scaling, centering, normalization, binarization methods.