PyDimRed package

Subpackages

PyDimRed.utils package

Submodules

PyDimRed.evaluation module

This module provides a class to evaluate the performance of DR methods. The two main methods are cross validation and a variation of it where seperate models are trained on a train and validation set due to some DR models only having a ‘fit_transform’ method.

class PyDimRed.evaluation.ModelEvaluator(X: array, y: array, parameters: dict[str, list] | list[dict[str, list]], estimator=Pipeline(steps=[('Scaler', StandardScaler()), ('OneNN', KNeighborsClassifier(n_neighbors=1))]), scorer=None, K: int = 5, max_or_min: str = 'MAX', n_repeats: int = 1, n_jobs: int = 1)

Bases: object

ModelEvaluator is a class that evaluates the performance of DR methods on a given X, y data set.

__init__(X: array, y: array, parameters: dict[str, list] | list[dict[str, list]], estimator=Pipeline(steps=[('Scaler', StandardScaler()), ('OneNN', KNeighborsClassifier(n_neighbors=1))]), scorer=None, K: int = 5, max_or_min: str = 'MAX', n_repeats: int = 1, n_jobs: int = 1) → None

ModelEvaluator constructor

Args:

X (np.array): N x D dimensional dataset

y (np.array): N dimensional features

parameters (dict[str, list]): dictionary that maps from parameter name (str) to values parameter will be set to (list)

estimator : estimator that determines fitness of model. The estimator must implement score() and fit() methods.

scorerOptional function to override estimator.score() function. For example default estimator has R squared score function
than can be replaced by sklearn’s Mean Square Error scoring function. Must have signature: score_func(y, y_pred, **kwargs). Default is None

K (int): number of folds in K-fold cross validation like split. Default = 5

max_or_min (str) : ‘MAX’ if best score is maximum, else ‘MIN’

n_repeats (int): number of repeats for K_fold. Default = 1 (no repeats)

n_jobs (int): Number of jobs for the joblib backend. Default = 1. Setting n_jobs = -1 is equivalent to setting to the maximum number of jobs system can handle

Returns:

None

cross_validation(): Grid search cross validation for dimensionality reduction. ‘parameters’ defines a map from parameter name to all values that parameter takes. Fitness / scoring of DR model is determined by estimator or scorer (must have a score function). Note that all DR models passed must implement a ‘fit’ and ‘transform’ method to be coherent with the sklearn API

Returns:

best_score (float): best score (accuracy in this case) best_params (dict): best parameters results (pd.Dataframe): dataframe of results

grid_search_dr_performance()

Evaluate the performance of dimensionality reduction (DR) models.

This method performs the following steps to assess the quality of DR models like TSNE and TRIMAP:

Fits a DR model on the training data and transforms the training data.
Fits a new DR model on the validation data and transforms the validation data.
Uses an estimator to obtain performance value. For classification an sklearn pipeline with a Standard Scaler and 1-Nearest Neighbour classifier is used by default. For regression a 1-NN regressor can be used instead of a classifier

Process is repeated according to ‘K’ fold cross validation with ‘n_repeats’ repetitions per fold

Returns:

best_score (float): best score (accuracy in this case) best_params (dict): best parameters results (pd.Dataframe): dataframe of results. There is a column for each parameter and two extra columns: one for empirical mean score with values of parameters on that line, and one for the empirical variance

PyDimRed.evaluation.one_NN_accuracy(X_train: array, y_train: array, X_val: array, y_val: array, task_type: str = 'CLASSIFICATION') → float: Trains a 1-NN algorithm on Xtrain and ytrain to output accuracy of prediction on Xtest dataset

Args:

X_train (np.array): N x D dimensional array of training data. N data points, D features

y_train (np.array): N dimensional array of training labels

X_val (np.array): N x D dimensional array of test data

y_val (np.array): N dimensional array of test labels

task_type (str): Type of task - ‘classification’ or ‘regression’.

Returns:

accuracy (float): accuracy of test data with 1-NN

PyDimRed.exceptions module

This module defines new Error(s) used in the library. They are:

DimensionError: should be raised when dimension constraints are not respected in a method

This module also contains utility functions to raise these errors if a boolean condition is not satisfied.

exception PyDimRed.exceptions.DimensionError(message: str, X1=None, X2=None)

Bases: Exception

Exception raised for errors in data dimensions.

Attributes:

message (str): explanation of the error

X1 : Shape of first array

X2 : Shape of second array

__init__(message: str, X1=None, X2=None) → None

PyDimRed.exceptions.checkCondition(cond: bool, message: str = ''): Check if a boolean condition is true. If not raise a Value Error

Args:

cond (bool): condition to check

message (str): error message to print

Returns:

None

PyDimRed.exceptions.checkDimensionCondition(cond: bool, message: str = '', X1=None, X2=None): Check if a boolean condition related to array dimensions is true. If not raise a Dimension Error

Args:

cond (bool): condition to check

message (str): error message to print

X1: shape of first array

X2: shape of second array

Returns:

None

PyDimRed.plot module

This module provides various functions to visualize and evaluate the performance of dimensionality reduction (DR) models using different plots. The functions included can display relational plots, scatter plots, heatmaps, line plots, and bar plots to help in analyzing the performance of models such as TSNE, TRIMAP, and others.

PyDimRed.plot.display(X: array, y: array, marker_size=5, title: str = None, x_label: str = 'x', y_label: str = 'y', hue_label: str = 'z', figsize: tuple = None) → None: Display 3 dimensional data on a seaborn.relplot. Two input feature dimensions in X and one output dimension in Y

Args:

X (np.array): N x 2 dimensional array of feature data

y (np.array): N dimensional array of label / output data

marker_size (int) : size of markers on seaborn.relplot, default=5

title (str) : plot title. default=None

x_label (str): x axis label name

y_label (str): y axis label name

hue_label (str): name of color hue variable

figsize (tuple): x and y figure size in inches. Default = None

Returns:

None

PyDimRed.plot.display_accuracies(names: list[str], accuracies: list[float], title: str = None, hue_label: str = 'accuracy', figsize: tuple = None) → None: Simple function to plot accuracies and method name on a bar graph with color scale proportional to accuracy

Args:

names (list[str]) : name of each method

accuracies (list[float]): accuracy for each method

hue_label (str): name of color hue variable

figsize (tuple): x and y figure size in inches. Default = None

Return:

None

PyDimRed.plot.display_group(names: list[str], X_train_list: list[array], y_train: array, X_test_list: list[array] = None, y_test: array = None, nbr_cols: int = 3, nbr_rows: int = 4, marker_size=5, legend: str = 'full', title: str = None, x_label: str = 'x', y_label: str = 'y', hue_label: str = 'z', grid_x_label: list[str] = None, grid_y_label: list[str] = None, figsize: tuple = None) → None: Given a list of train data (optionally test data) with the same labels (must all be in same order) create a multi scatter plot of all data on a grid. If both XtestList and ytest are not None then train data and test data will be differentiated via different markers. Another option that can be specified is the use of a ‘global’ grid on the graph. If grid_x_label or grid_y_label are a list of names grid like naming of each suplot will occur

Args:

names (list[str]): Title of each subplot. Subplots don’t have titles when default = None

X_train_list (list[np.array]): List of N x 2 dimensional data sets to be plotted

ytrain (np.array): N dimensional array of label / output data

X_test_list (list[np.array]): Optional list of N x 2 dimensional test data sets to be plotted

ytest (np.array): Optional N dimensional array of label / output data

nbr_cols (int): Number of columns in the grouped plot, will have at most nbrCols graphs stacked vertically

nbr_rows (int): Number of rows in the grouped plot, will have at most nbrRows graphs stacked horizontally

marker_size (int) : size of markers on seaborn.relplot, default=5

legend (str): seaborn legend argument. “full” show each different value on legend, “auto” will make seaborn decide

x_label (str): x axis label name

y_label (str): y axis label name

hue_label (str): name of color hue variable

grid_x_label (list[str]) : x-axis labels for global grid of subplots. default = None,

grid_y_label (list[str]) : y-axis labels for global grid of subplots. default = None

figsize (tuple): x and y figure size in inches. Default = None

Returns:

None

PyDimRed.plot.display_heatmap(X: array, x_range: list, y_range: list, x_label: str = 'x', y_label: str = 'y', title: str = None, figsize: tuple = None) → None: Plot values of a 2D array on a heatmap with range of x and y values and corresponding feature names

Args:

X (np.array) : N1 x N2 two dimensional array of values.

x_range (list) : list of values for feature x of length N1. Each value corresponds to a row

y_range (list) : list of values for feature y of length N2. Each value corresponds to a column

x_label (str): x axis label name

y_label (str): y axis label name

figsize (tuple): x and y figure size in inches. Default = None

Returns:

None

PyDimRed.plot.display_heatmap_df(df: DataFrame, feature1: str, feature2: str, values: str, title: str = None, figsize: tuple = None) → None: Plot values of a data frame on a heatmap given feature column names and output column name

Args:

df (pd.df) : N1 x N2 two dimensional array of values.

xRange (list) : list of values for feature x of length N1

yRange (list) : list of values for feature y of length N2

feature1 (str) : name of first feature (x)

feature2 (str): name of second feature (y)

values (str): name of values in data frame

figsize (tuple): x and y figure size in inches. Default = None

Returns:

None

PyDimRed.plot.display_training_validation(x: list, y: DataFrame, x_name: str = 'NumberNeighbours', title: str = None, figsize: tuple = None) → None

Line plot of accuracy vs. a common variable parameter for multiple methods

Args:

x (list): list of values x feature takes

y (pd.DataFrame): Each column is a method, and rows are corresponding accuracies for that method at given parameter value

x_name (str): Name of x feature being varied

title (str): plot title. Default is None, no title

figsize (tuple): x and y figure size in inches. Default = None

Return:

None

Example:

>>> parameters = [5, 10, 15, 20] # list of n_nbrs
>>> y = {
>>>    "TSNE" : [94.0990, 93.4519, 91.9385, 92.4469],
>>>    "TRIMAP" : [82.5421, 82.1633, 82.5700, 82.3850]
>>> }
>>> y = pd.DataFrame(y)
>>> displayTrainVal(parameters, y)

PyDimRed.transform module

This module provides a wrapper class for any dimensionality reduction (DR) technique such as PCA, UMAP, TSNE, TRIMAP, and PACMAP. The TransformWrapper class standardizes the interface for these techniques, making it easier to integrate and use them in machine learning workflows.

class PyDimRed.transform.TransformWrapper(base_model=None, method: str = None, n_nbrs: int = 10, d: int = 2, default_model: bool = False, n_outliers: int = 4)

Bases: TransformerMixin, BaseEstimator

A wrapper class that provides a consistent interface for multiple dimensionality reduction methods.

__init__(base_model=None, method: str = None, n_nbrs: int = 10, d: int = 2, default_model: bool = False, n_outliers: int = 4): Initialize the TransformWrapper with the specified DR method and hyperparameters. Note that either base_model or method must be None as they both define the wrapped Transformer. If a ‘TransformWrapper’ is initialized via the method parameters the ‘set_params’ and ‘get_params’ methods will use this classes attributes. If a ‘TransformWrapper’ is initialized via the base_model parameters the ‘set_params’ and ‘get_params’ methods will call the wrapped objects method. Using the method parameter is more restricting as this class does not have acess to every models underlying parameters but it allows for different models to share common parameter names like n_nbrs. This allows for easier use of parameter changing in evaluation methods like cross validation.

Args:

base_model : default is None

method (str): a supported dimensionality reduction method. Available method values are ‘PCA’ / ‘UMAP’ / ‘TSNE’ / ‘TRIMAP’ / ‘PACMAP’. Default is None

n_nbrs (int): Common parameter value in DR methods. Can represent n_neighbours / n_inliers / perplexity. default is 10

d (int): Dimension data is reduced to. Default is 2

default_model (bool): If True wrapped model with default values is set. Default is False

n_outliers (int): Number of outlier data points. Is used in trimap.TRIMAP. Default is 4

Returns:

TransformWrapper

Default pacmap is set to true means that pacmap will determine n_nbrs on its own.

fit(X: array, y: array = None): Call the fit function on wrapped model. Not all models implement this function, if not the case value error is thrown

Args:

X (np.array): N x D array used to fit the model

y (np.array): default = None, usually unused but label array for corresponding X data

Returns:

TransformWrapper: fitted transform wrapper

fit_transform(X: array, y: array) → array: Calls the wrapped model’s fit transform method

Args:

X (np.array): N x D dimensional array to fit model with

y (np.array): unused

Returns:

np.array: transformed data

classmethod from_params(**params) → TransformWrapper: Takes the same arguments as set params, and returns a new instance of type TransformWrapper. Factory / Class method.

Args:

**params : name value pairs that correspond to TransformWrapper attribute

Returns:

TransformWrapper

get_params(deep=True)

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsdict: Parameter names mapped to their values.

reset_model() → None: Reset the model with previously passed parameters. For example models like trimap.TRIMAP can only have fit transform called once!

Returns:

None

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**paramsdict: Estimator parameters.

Returns

selfestimator instance: Estimator instance.

transform(X: array): Call the transform function on wrapped model. Not all models implement this function, if not the case value error is thrown

Args:

X (np.array): N x D array of data to be transformed

Returns:

X_reduced (np.array): transformed data

PyDimRed package

Subpackages

Submodules

PyDimRed.evaluation module

Args:

Returns:

Returns:

Returns:

Args:

Returns:

PyDimRed.exceptions module

Attributes:

Args:

Returns:

Args:

Returns:

PyDimRed.plot module

Args:

Returns:

Args:

Return:

Args:

Returns:

Args:

Returns:

Args:

Returns:

Args:

Return:

Example:

PyDimRed.transform module

Args:

Returns:

Args:

Returns:

Args:

Returns:

Args:

Returns:

Parameters

Returns

Returns:

Parameters

Returns

Args:

Returns:

Module contents