PyDimRed package
Subpackages
Submodules
PyDimRed.evaluation module
This module provides a class to evaluate the performance of DR methods. The two main methods are cross validation and a variation of it where seperate models are trained on a train and validation set due to some DR models only having a ‘fit_transform’ method.
- class PyDimRed.evaluation.ModelEvaluator(X: array, y: array, parameters: dict[str, list] | list[dict[str, list]], estimator=Pipeline(steps=[('Scaler', StandardScaler()), ('OneNN', KNeighborsClassifier(n_neighbors=1))]), scorer=None, K: int = 5, max_or_min: str = 'MAX', n_repeats: int = 1, n_jobs: int = 1)
Bases:
objectModelEvaluator is a class that evaluates the performance of DR methods on a given X, y data set.
- __init__(X: array, y: array, parameters: dict[str, list] | list[dict[str, list]], estimator=Pipeline(steps=[('Scaler', StandardScaler()), ('OneNN', KNeighborsClassifier(n_neighbors=1))]), scorer=None, K: int = 5, max_or_min: str = 'MAX', n_repeats: int = 1, n_jobs: int = 1) None
ModelEvaluator constructor
Args:
X (np.array): N x D dimensional dataset
y (np.array): N dimensional features
parameters (dict[str, list]): dictionary that maps from parameter name (str) to values parameter will be set to (list)
estimator : estimator that determines fitness of model. The estimator must implement score() and fit() methods.
- scorerOptional function to override estimator.score() function. For example default estimator has R squared score function
than can be replaced by sklearn’s Mean Square Error scoring function. Must have signature: score_func(y, y_pred, **kwargs). Default is None
K (int): number of folds in K-fold cross validation like split. Default = 5
max_or_min (str) : ‘MAX’ if best score is maximum, else ‘MIN’
n_repeats (int): number of repeats for K_fold. Default = 1 (no repeats)
n_jobs (int): Number of jobs for the joblib backend. Default = 1. Setting n_jobs = -1 is equivalent to setting to the maximum number of jobs system can handle
Returns:
None
- cross_validation()
Grid search cross validation for dimensionality reduction. ‘parameters’ defines a map from parameter name to all values that parameter takes. Fitness / scoring of DR model is determined by estimator or scorer (must have a score function). Note that all DR models passed must implement a ‘fit’ and ‘transform’ method to be coherent with the sklearn API
Returns:
best_score (float): best score (accuracy in this case) best_params (dict): best parameters results (pd.Dataframe): dataframe of results
- grid_search_dr_performance()
Evaluate the performance of dimensionality reduction (DR) models.
This method performs the following steps to assess the quality of DR models like TSNE and TRIMAP:
Fits a DR model on the training data and transforms the training data.
Fits a new DR model on the validation data and transforms the validation data.
Uses an estimator to obtain performance value. For classification an sklearn pipeline with a Standard Scaler and 1-Nearest Neighbour classifier is used by default. For regression a 1-NN regressor can be used instead of a classifier
Process is repeated according to ‘K’ fold cross validation with ‘n_repeats’ repetitions per fold
Returns:
best_score (float): best score (accuracy in this case) best_params (dict): best parameters results (pd.Dataframe): dataframe of results. There is a column for each parameter and two extra columns: one for empirical mean score with values of parameters on that line, and one for the empirical variance
- PyDimRed.evaluation.one_NN_accuracy(X_train: array, y_train: array, X_val: array, y_val: array, task_type: str = 'CLASSIFICATION') float
Trains a 1-NN algorithm on Xtrain and ytrain to output accuracy of prediction on Xtest dataset
Args:
X_train (np.array): N x D dimensional array of training data. N data points, D features
y_train (np.array): N dimensional array of training labels
X_val (np.array): N x D dimensional array of test data
y_val (np.array): N dimensional array of test labels
task_type (str): Type of task - ‘classification’ or ‘regression’.
Returns:
accuracy (float): accuracy of test data with 1-NN
PyDimRed.exceptions module
This module defines new Error(s) used in the library. They are:
DimensionError: should be raised when dimension constraints are not respected in a method
This module also contains utility functions to raise these errors if a boolean condition is not satisfied.
- exception PyDimRed.exceptions.DimensionError(message: str, X1=None, X2=None)
Bases:
ExceptionException raised for errors in data dimensions.
Attributes:
message (str): explanation of the error
X1 : Shape of first array
X2 : Shape of second array
- __init__(message: str, X1=None, X2=None) None
- PyDimRed.exceptions.checkCondition(cond: bool, message: str = '')
Check if a boolean condition is true. If not raise a Value Error
Args:
cond (bool): condition to check
message (str): error message to print
Returns:
None
- PyDimRed.exceptions.checkDimensionCondition(cond: bool, message: str = '', X1=None, X2=None)
Check if a boolean condition related to array dimensions is true. If not raise a Dimension Error
Args:
cond (bool): condition to check
message (str): error message to print
X1: shape of first array
X2: shape of second array
Returns:
None
PyDimRed.plot module
This module provides various functions to visualize and evaluate the performance of dimensionality reduction (DR) models using different plots. The functions included can display relational plots, scatter plots, heatmaps, line plots, and bar plots to help in analyzing the performance of models such as TSNE, TRIMAP, and others.
- PyDimRed.plot.display(X: array, y: array, marker_size=5, title: str = None, x_label: str = 'x', y_label: str = 'y', hue_label: str = 'z', figsize: tuple = None) None
Display 3 dimensional data on a seaborn.relplot. Two input feature dimensions in X and one output dimension in Y
Args:
X (np.array): N x 2 dimensional array of feature data
y (np.array): N dimensional array of label / output data
marker_size (int) : size of markers on seaborn.relplot, default=5
title (str) : plot title. default=None
x_label (str): x axis label name
y_label (str): y axis label name
hue_label (str): name of color hue variable
figsize (tuple): x and y figure size in inches. Default = None
Returns:
None
- PyDimRed.plot.display_accuracies(names: list[str], accuracies: list[float], title: str = None, hue_label: str = 'accuracy', figsize: tuple = None) None
Simple function to plot accuracies and method name on a bar graph with color scale proportional to accuracy
Args:
names (list[str]) : name of each method
accuracies (list[float]): accuracy for each method
hue_label (str): name of color hue variable
figsize (tuple): x and y figure size in inches. Default = None
Return:
None
- PyDimRed.plot.display_group(names: list[str], X_train_list: list[array], y_train: array, X_test_list: list[array] = None, y_test: array = None, nbr_cols: int = 3, nbr_rows: int = 4, marker_size=5, legend: str = 'full', title: str = None, x_label: str = 'x', y_label: str = 'y', hue_label: str = 'z', grid_x_label: list[str] = None, grid_y_label: list[str] = None, figsize: tuple = None) None
Given a list of train data (optionally test data) with the same labels (must all be in same order) create a multi scatter plot of all data on a grid. If both XtestList and ytest are not None then train data and test data will be differentiated via different markers. Another option that can be specified is the use of a ‘global’ grid on the graph. If grid_x_label or grid_y_label are a list of names grid like naming of each suplot will occur
Args:
names (list[str]): Title of each subplot. Subplots don’t have titles when default = None
X_train_list (list[np.array]): List of N x 2 dimensional data sets to be plotted
ytrain (np.array): N dimensional array of label / output data
X_test_list (list[np.array]): Optional list of N x 2 dimensional test data sets to be plotted
ytest (np.array): Optional N dimensional array of label / output data
nbr_cols (int): Number of columns in the grouped plot, will have at most nbrCols graphs stacked vertically
nbr_rows (int): Number of rows in the grouped plot, will have at most nbrRows graphs stacked horizontally
marker_size (int) : size of markers on seaborn.relplot, default=5
legend (str): seaborn legend argument. “full” show each different value on legend, “auto” will make seaborn decide
x_label (str): x axis label name
y_label (str): y axis label name
hue_label (str): name of color hue variable
grid_x_label (list[str]) : x-axis labels for global grid of subplots. default = None,
grid_y_label (list[str]) : y-axis labels for global grid of subplots. default = None
figsize (tuple): x and y figure size in inches. Default = None
Returns:
None
- PyDimRed.plot.display_heatmap(X: array, x_range: list, y_range: list, x_label: str = 'x', y_label: str = 'y', title: str = None, figsize: tuple = None) None
Plot values of a 2D array on a heatmap with range of x and y values and corresponding feature names
Args:
X (np.array) : N1 x N2 two dimensional array of values.
x_range (list) : list of values for feature x of length N1. Each value corresponds to a row
y_range (list) : list of values for feature y of length N2. Each value corresponds to a column
x_label (str): x axis label name
y_label (str): y axis label name
figsize (tuple): x and y figure size in inches. Default = None
Returns:
None
- PyDimRed.plot.display_heatmap_df(df: DataFrame, feature1: str, feature2: str, values: str, title: str = None, figsize: tuple = None) None
Plot values of a data frame on a heatmap given feature column names and output column name
Args:
df (pd.df) : N1 x N2 two dimensional array of values.
xRange (list) : list of values for feature x of length N1
yRange (list) : list of values for feature y of length N2
feature1 (str) : name of first feature (x)
feature2 (str): name of second feature (y)
values (str): name of values in data frame
figsize (tuple): x and y figure size in inches. Default = None
Returns:
None
- PyDimRed.plot.display_training_validation(x: list, y: DataFrame, x_name: str = 'NumberNeighbours', title: str = None, figsize: tuple = None) None
Line plot of accuracy vs. a common variable parameter for multiple methods
Args:
x (list): list of values x feature takes
y (pd.DataFrame): Each column is a method, and rows are corresponding accuracies for that method at given parameter value
x_name (str): Name of x feature being varied
title (str): plot title. Default is None, no title
figsize (tuple): x and y figure size in inches. Default = None
Return:
None
Example:
>>> parameters = [5, 10, 15, 20] # list of n_nbrs >>> y = { >>> "TSNE" : [94.0990, 93.4519, 91.9385, 92.4469], >>> "TRIMAP" : [82.5421, 82.1633, 82.5700, 82.3850] >>> } >>> y = pd.DataFrame(y) >>> displayTrainVal(parameters, y)
PyDimRed.transform module
This module provides a wrapper class for any dimensionality reduction (DR) technique such as PCA, UMAP, TSNE, TRIMAP, and PACMAP. The TransformWrapper class standardizes the interface for these techniques, making it easier to integrate and use them in machine learning workflows.
- class PyDimRed.transform.TransformWrapper(base_model=None, method: str = None, n_nbrs: int = 10, d: int = 2, default_model: bool = False, n_outliers: int = 4)
Bases:
TransformerMixin,BaseEstimatorA wrapper class that provides a consistent interface for multiple dimensionality reduction methods.
- __init__(base_model=None, method: str = None, n_nbrs: int = 10, d: int = 2, default_model: bool = False, n_outliers: int = 4)
Initialize the TransformWrapper with the specified DR method and hyperparameters. Note that either base_model or method must be None as they both define the wrapped Transformer. If a ‘TransformWrapper’ is initialized via the method parameters the ‘set_params’ and ‘get_params’ methods will use this classes attributes. If a ‘TransformWrapper’ is initialized via the base_model parameters the ‘set_params’ and ‘get_params’ methods will call the wrapped objects method. Using the method parameter is more restricting as this class does not have acess to every models underlying parameters but it allows for different models to share common parameter names like n_nbrs. This allows for easier use of parameter changing in evaluation methods like cross validation.
Args:
base_model : default is None
method (str): a supported dimensionality reduction method. Available method values are ‘PCA’ / ‘UMAP’ / ‘TSNE’ / ‘TRIMAP’ / ‘PACMAP’. Default is None
n_nbrs (int): Common parameter value in DR methods. Can represent n_neighbours / n_inliers / perplexity. default is 10
d (int): Dimension data is reduced to. Default is 2
default_model (bool): If True wrapped model with default values is set. Default is False
n_outliers (int): Number of outlier data points. Is used in trimap.TRIMAP. Default is 4
Returns:
TransformWrapper
Default pacmap is set to true means that pacmap will determine n_nbrs on its own.
- fit(X: array, y: array = None)
Call the fit function on wrapped model. Not all models implement this function, if not the case value error is thrown
Args:
X (np.array): N x D array used to fit the model
y (np.array): default = None, usually unused but label array for corresponding X data
Returns:
TransformWrapper: fitted transform wrapper
- fit_transform(X: array, y: array) array
Calls the wrapped model’s fit transform method
Args:
X (np.array): N x D dimensional array to fit model with
y (np.array): unused
Returns:
np.array: transformed data
- classmethod from_params(**params) TransformWrapper
Takes the same arguments as set params, and returns a new instance of type TransformWrapper. Factory / Class method.
Args:
**params : name value pairs that correspond to TransformWrapper attribute
Returns:
TransformWrapper
- get_params(deep=True)
Get parameters for this estimator.
Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns
- paramsdict
Parameter names mapped to their values.
- reset_model() None
Reset the model with previously passed parameters. For example models like trimap.TRIMAP can only have fit transform called once!
Returns:
None
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.Parameters
- **paramsdict
Estimator parameters.
Returns
- selfestimator instance
Estimator instance.