Machine learning pipeline study notes-feature engineering, feature selection and hyperparameter optimization

2021-11-22 07:52:52 By : Mr. Eden Li

Machine learning is the scientific study of algorithms and statistical models to efficiently perform specific tasks without using explicit instructions. Machine learning algorithms include-supervised and unsupervised algorithms. 

In supervised learning, the target is known and used for model prediction.

In unsupervised learning, the target is unknown and should be determined by the model.

Feature engineering is the process of using domain knowledge of data to create features or variables for machine learning. This section covers the following topics:

If the distribution of the variable is skewed, apply a transformation to make the distribution closer to the normal distribution. 

Machine learning algorithms are only applicable to numerical variables. Therefore, the categories are replaced with numerical representations so that machine learning models can use these variables. 

Feature scaling is a method used to standardize the range of values. This is done to keep all variables at the same scale. 

Discretization is the process of transforming continuous variables into discrete variables through a set of continuous intervals. It is also called binning. 

The act of replacing missing data with statistical estimates of missing values. The goal is to generate a complete data set that can be used to train machine learning models. 

Outliers are data points that are significantly different from the remaining data. Outliers may affect the performance of linear models, but they have little effect on tree-based algorithms. You can use Gaussian distribution (u -/ 3*s), interquartile range (Q3-Q1), and quartiles (1 percentile and 99 percentile) to identify outliers. 

Feature selection is the process of selecting a subset of relevant features for machine learning model construction. This section covers the following topics:

The filtering method depends on the characteristics of the data and has nothing to do with the model. Their computational cost is often low and suitable for rapid screening. 

The packaging method uses a machine learning model to score a subset of features. Train a new model on each feature subset, and usually provide the best performing subset. 

Perform feature selection during the model building process, and consider the interaction between the model and the feature. The embedding method is faster than the packaging method and more accurate than the filtering method. 

The goal of the learning algorithm is to find a function that reduces errors in the data set. Hyperparameters are not directly learned by the model and are important to prevent overfitting. Hyperparameters are specified outside the training process, and they control the flexibility of the model. 

The process of finding the best hyperparameters for a given data set is called hyperparameter optimization. The goal is to minimize the generalization error. Generalization is the ability of an algorithm to be effective for various inputs. The search for the best hyperparameter includes hyperparameter space, sampling method, cross-validation scheme and performance index. 

This section covers the following topics:

It is not possible to define a formula to obtain the best hyperparameters; therefore, different combinations of hyperparameters need to be evaluated. Some hyperparameters have a great impact on performance, but most hyperparameters do not have much impact on performance. Therefore, it is very important to identify and optimize the hyperparameters that affect the performance of machine learning. 

The following is a list of hyperparameters that have been found to have a huge impact on the performance of their respective machine learning algorithms:

The training set is divided into k folds. The model is trained on the k-1 fold and tested on the k-th fold. This process is repeated k times. The final performance is averaged. Cross-validation can be used to select the best hyperparameters, select the best performing model, and estimate the generalization error of a given model. Hierarchical k-fold cross-validation is useful when the data set is not balanced. Each fold has similar proportions of observations for each class, and the test sets do not overlap. 

The performance of the machine learning model should be constant on different data sets. When the model performs well on the training set but does not perform well on the real-time data, the model will overfit the training data. 

Rohit Garg has nearly 7 years of work experience in the field of data analysis and machine learning. He has extensive work in predictive modeling, time series analysis, and segmentation technology. Rohit owns BE from BITS Pilani and PGDM from IIM Raipur.

Copyright Analytics India Magazine Pvt Ltd