Possible Param Distributions for ML Model Hyperparameter Tuning

Ihsanul Haque Asif
4 min readApr 24, 2024

--

Hyperparameters in machine learning models are settings or configurations that are external to the model and cannot be directly estimated from the data. Unlike parameters, which are learned from the training data during the model-fitting process, hyperparameters are predetermined choices that govern the behavior of the learning algorithm.

Tuning hyperparameters is an essential part of building machine learning models, as it involves finding the optimal settings that result in the best performance on unseen data. This process often involves experimentation, such as grid search or randomized search, to systematically explore different hyperparameter combinations and select the ones that lead to the most effective model. We will see some suggested param distribution for the following models:

  1. k-nearest neighbors (KNN)
# Parameter distributions for RandomizedSearchCV
param_dist_knn = {
'n_neighbors': randint(1, 50), # Number of neighbors to use
'weights': ['uniform', 'distance'], # Weight function used in prediction
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'], # Algorithm used to compute the nearest neighbors
'leaf_size': randint(10, 50), # Leaf size passed to BallTree or KDTree
'p': [1, 2] # Power parameter for the Minkowski metric
}

# Parameter grid for GridSearchCV
param_grid_knn = {
'n_neighbors': [3, 5, 7, 10], # Number of neighbors to use
'weights': ['uniform', 'distance'], # Weight function used in prediction
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'], # Algorithm used to compute the nearest neighbors
'leaf_size': [20, 30, 40], # Leaf size passed to BallTree or KDTree
'p': [1, 2] # Power parameter for the Minkowski metric
}

2. Decision Tree

# Parameter distributions for RandomizedSearchCV
param_dist_dt = {
'max_depth': randint(1, 20), # Maximum depth of the tree
'min_samples_split': randint(2, 20), # Minimum number of samples required to split an internal node
'min_samples_leaf': randint(1, 20), # Minimum number of samples required to be at a leaf node
'criterion': ['gini', 'entropy'] # Function to measure the quality of a split
}

# Parameter grid for GridSearchCV
param_grid_dt = {
'max_depth': [3, 5, 7, 10], # Maximum depth of the tree
'min_samples_split': [2, 5, 10], # Minimum number of samples required to split an internal node
'min_samples_leaf': [1, 2, 4], # Minimum number of samples required to be at a leaf node
'criterion': ['gini', 'entropy'] # Function to measure the quality of a split
}

3. Random Forest

# Parameter distributions for RandomizedSearchCV
param_dist_rf = {
'n_estimators': randint(10, 200), # Number of trees in the forest
'max_depth': randint(1, 20), # Maximum depth of the tree
'min_samples_split': randint(2, 20), # Minimum number of samples required to split an internal node
'min_samples_leaf': randint(1, 20), # Minimum number of samples required to be at a leaf node
'criterion': ['gini', 'entropy'] # Function to measure the quality of a split
}

# Parameter grid for GridSearchCV
param_grid_rf = {
'n_estimators': [50, 100, 150], # Number of trees in the forest
'max_depth': [5, 10, 15], # Maximum depth of the tree
'min_samples_split': [2, 5, 10], # Minimum number of samples required to split an internal node
'min_samples_leaf': [1, 2, 4], # Minimum number of samples required to be at a leaf node
'criterion': ['gini', 'entropy'] # Function to measure the quality of a split
}

4. Support Vector Machine (SVM)


# Parameter distributions for RandomizedSearchCV
param_dist_svm = {
'C': [0.1, 1, 10, 100], # Regularization parameter
'kernel': ['linear', 'poly', 'rbf', 'sigmoid'], # Kernel type
'gamma': ['scale', 'auto'], # Kernel coefficient
'degree': randint(1, 10) # Degree of the polynomial kernel function
}

# Parameter grid for GridSearchCV
param_grid_svm = {
'C': [0.1, 1, 10, 100], # Regularization parameter
'kernel': ['linear', 'poly', 'rbf', 'sigmoid'], # Kernel type
'gamma': ['scale', 'auto'], # Kernel coefficient
'degree': [2, 3, 4] # Degree of the polynomial kernel function
}

5. Logistic Regression


# Parameter distributions for RandomizedSearchCV
param_dist_lr = {
'C': uniform(0.1, 10), # Inverse of regularization strength
'penalty': ['l1', 'l2'], # Norm used in the penalization
'solver': ['liblinear', 'saga'] # Algorithm to use in the optimization problem
}

# Parameter grid for GridSearchCV
param_grid_lr = {
'C': [0.1, 1, 10], # Inverse of regularization strength
'penalty': ['l1', 'l2'], # Norm used in the penalization
'solver': ['liblinear', 'saga'] # Algorithm to use in the optimization problem
}

A different Technique!

There is also a suggested technique, that you can design your param grid for GridSearch by using your estimated params from RandomizedSearch. That will save your computational cost and time.

This example is demonstrated on a random forest model:

# See the best estimated params by RandomSearch CV
rf_randomcv.best_params_
# Sample Output:

{'n_estimators': 1400,
'min_samples_split': 5,
'min_samples_leaf': 1,
'max_features': None,
'max_depth': 15,
'criterion': 'entropy'}

Now design a param grid for your GridSearchCV by using the best params estimated by RandomSearchCV:

# Best parameters by RandomSearch
randomSearch_params = {
'n_estimators': 1400,
'min_samples_split': 5,
'min_samples_leaf': 1,
'max_features': None,
'max_depth': 15,
'criterion': 'entropy'
}

param_grid = {
'criterion': [randomSearch_params['criterion']],
'max_depth': [randomSearch_params['max_depth']],
'max_features': [randomSearch_params['max_features']],
'min_samples_leaf': [randomSearch_params['min_samples_leaf'],
randomSearch_params['min_samples_leaf']+2,
randomSearch_params['min_samples_leaf'] + 4],
'min_samples_split': [randomSearch_params['min_samples_split'] - 2,
randomSearch_params['min_samples_split'] - 1,
randomSearch_params['min_samples_split'],
randomSearch_params['min_samples_split'] +1,
randomSearch_params['min_samples_split'] + 2],
'n_estimators': [randomSearch_params['n_estimators'] - 200,
randomSearch_params['n_estimators'] - 100,
randomSearch_params['n_estimators'],
randomSearch_params['n_estimators'] + 100,
randomSearch_params['n_estimators'] + 200]
}

print(param_grid)
# Output:
{'criterion': ['entropy'],
'max_depth': [15],
'max_features': [None],
'min_samples_leaf': [1, 3, 5],
'min_samples_split': [3, 4, 5, 6, 7],
'n_estimators': [1200, 1300, 1400, 1500, 1600]
}
# GridSearchCV on RandomForestClassifier
from sklearn.model_selection import GridSearchCV

rf = RandomForestClassifier(random_state=42)

grid_search = GridSearchCV(estimator=rf,
param_grid=param_grid,
cv=5,
n_jobs=-1,
verbose=3)

# Fit the grid_search to the data
grid_search.fit(x_train,y_train)

# See the best estimated params by GridSearchCV
print(grid_search.best_params_)

Note: This param distribution for different ML models is just a suggestion, this distribution may vary depending on requirements and dataset types.

--

--

Ihsanul Haque Asif
Ihsanul Haque Asif

Written by Ihsanul Haque Asif

Talks about Computer Science | Machine Learning | Software Engineering

No responses yet