Aligning¶
Sign flips¶
- class graspologic.align.SignFlips[source]¶
Flips the signs of all entries in one dataset,
X
along some of the dimensions. In particular, it does so in a way that brings this dataset to the same orthant as the second dataset,Y
, according to some criterion, computed along each dimension. The two critera currently available are the median and the maximum (in magnitude) value along each dimension.This module can also be used to bring the dataset to the first orthant (i.e. with all criteras being positive) by providing the identity matrix as the second dataset.
- Parameters:
- criterionstring, {'median' (default), 'max'}, optional
String describing the criterion used to choose whether to flip signs. Two options are currently supported:
- 'median'
Uses the median along each dimension
- 'max'
Uses the maximum (in magintude) along each dimension
- Attributes:
- Q_array, size (d, d)
Final orthogonal matrix, used to modify
X
.
- fit(X, Y)[source]¶
Uses the two datasets to learn the matrix
Q_
that aligns the first dataset with the second.In sign flips,
Q_
is an diagonal orthogonal matrices (i.e. a matrix with 1 or -1 in each entry on diagonal and 0 everywhere else) picked such that all dimensions ofX
@Q_
andY
are in the same orthant using some critera (median or max magnitude).- Parameters:
- Xnp.ndarray, shape (n, d)
Dataset to be mapped to
Y
, must have same number of dimensions (axis 1) asY
.- Ynp.ndarray, shape (m, d)
Target dataset, must have same number of dimensions (axis 1) as
X
.
- Returns:
- selfreturns an instance of self
- Parameters:
X (ndarray)
Y (ndarray)
- Return type:
- fit_transform(X, Y)¶
Uses the two datasets to learn the matrix
Q_
that aligns the first dataset with the second. Then, transforms the first datasetX
using the learned matrixQ_
.- Parameters:
- Xnp.ndarray, shape (n, d)
Dataset to be mapped to
Y
, must have same number of dimensions (axis 1) asY
.- Ynp.ndarray, shape (m, d)
Target dataset, must have same number of dimensions (axis 1) as
X
.
- Returns:
- X_primenp.ndarray, shape (n, d)
First dataset of vectors, aligned to second. Equal to
X
@Q_
.
- Parameters:
X (ndarray)
Y (ndarray)
- Return type:
ndarray
- get_metadata_routing()¶
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it's possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.
- transform(X)¶
Transforms the dataset
X
using the learned matrixQ_
. This may be the same as the first dataset as infit()
, or a new dataset. For example, additional samples from the same dataset.- Parameters:
- Xnp.ndarray, shape(m, d)
Dataset to be transformed, must have same number of dimensions (axis 1) as
X
andY
that were passed to fit.
- Returns:
- X_primenp.ndarray, shape (n, d)
First dataset of vectors, aligned to second. Equal to
X
@Q_
.
- Parameters:
X (ndarray)
- Return type:
ndarray
Orthogonal Procrustes¶
- class graspologic.align.OrthogonalProcrustes[source]¶
Computes the matrix solution of the classical orthogonal Procrustes [1] problem, which is that given two matrices
X
andY
of equal shape (n, d), find an orthogonal matrix that most closely mapsX
toY
. Subsequently, uses that matrix to transform either the originalX
, or a different dataset in the same space.Note that when used to match two datasets, this method unlike
SeedlessProcrustes
, not only requires that the datasets have the same number of entries, but also that there is some correspondence between the entries. In graph embeddings, this usually corresponds to the assumption that the vertex \(i\) in graphX
has the same latent position as the vertex \(i\) in graphY
.- Attributes:
- Q_array, size (d, d)
Final orthogonal matrix, used to modify
X
.- score_float
Final value of the objective function: \(|| X Q - Y ||_F\) Lower means the datasets have been matched together better.
Notes
Formally, minimizes \(|| X Q - Y ||_F\), which has a closed form solution, whenever \(Q\) is constrained to be an orthogonal matrix, that is a matrix that satisfies \(Q^T Q = Q Q^T = I\). For the more details, including the proof of the closed-form solution see [1].
Implementation-wise, this class is a wrapper of the
scipy.linalg.orthogonal_procrustes()
, which itself uses an algorithm described in find the optimal solution algorithm [2].References
[2]Peter H. Schonemann, "A generalized solution of the orthogonal Procrustes problem", Psychometrica -- Vol. 31, No. 1, March, 1996.
- fit(X, Y)[source]¶
Uses the two datasets to learn the matrix
Q_
that aligns the first dataset with the second.- Parameters:
- Xnp.ndarray, shape (n, d)
Dataset to be mapped to
Y
, must have the same shape asY
.- Ynp.ndarray, shape (m, d)
Target dataset, must have the same shape as
X
.
- Returns:
- selfreturns an instance of self
- Parameters:
X (ndarray)
Y (ndarray)
- Return type:
- fit_transform(X, Y)[source]¶
Uses the two datasets to learn the matrix
Q_
that aligns the first dataset with the second. Then, transforms the first datasetX
using the learned matrixQ_
.- Parameters:
- Xnp.ndarray, shape (n, d)
Dataset to be mapped to
Y
, must have the same shape asY
.- Ynp.ndarray, shape (m, d)
Target dataset, must have the same shape as
X
.
- Returns:
- X_primenp.ndarray, shape (n, d)
First dataset of vectors, aligned to second. Equal to
X
@Q_
.
- Parameters:
X (ndarray)
Y (ndarray)
- Return type:
ndarray
- get_metadata_routing()¶
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it's possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.
- transform(X)¶
Transforms the dataset
X
using the learned matrixQ_
. This may be the same as the first dataset as infit()
, or a new dataset. For example, additional samples from the same dataset.- Parameters:
- Xnp.ndarray, shape(m, d)
Dataset to be transformed, must have same number of dimensions (axis 1) as
X
andY
that were passed to fit.
- Returns:
- X_primenp.ndarray, shape (n, d)
First dataset of vectors, aligned to second. Equal to
X
@Q_
.
- Parameters:
X (ndarray)
- Return type:
ndarray
Seedless Procrustes¶
- class graspologic.align.SeedlessProcrustes[source]¶
Matches two datasets using an orthogonal matrix. Unlike
OrthogonalProcrustes
, this does not require a matching between entries. It can even be used in the settings where the two datasets do not have the same number of entries.In graph setting, it is used to align the embeddings of two different graphs, when it requires some simultaneous inference task and no 1-1 matching between the vertices of the two graphs can be established, for example, inside of the test for the equivalence of the latent distributions (see:
LatentDistributionTest
).- Parameters:
- optimal_transport_lambdafloat (default=0.1), optional
Regularization term of the Sinkhorn optimal transport algorithm.
- optimal_transport_epsfloat (default=0.01), optional
Tolerance parameter for the each Sinkhorn optimal transport algorithm. I.e. tolerance for each "E-step".
- optimal_transport_num_repsint (default=1000), optional
Number of repetitions in each iteration of the iterative optimal transport problem. I.e. maximum number of repetitions in each "E-step".
- iterative_num_repsint (default=100), optional
Number of reps in each iteration of the iterative optimal transport problem. I.e. maxumum number of total iterations the whole "EM" algorithm.
- initstring, {'2d' (default), 'sign_flips', 'custom'}, optional
- '2d'
Uses \(2^d\) different restarts, where \(d\) is the dimension of the datasets. In particular, tries all matrices that are simultaneously diagonal and orthogonal. In other words, these are diagonal matrices with all entries on the diagonal being either +1 or -1. This is motivated by the fact that spectral graph embeddings have two types of orthogonal non-identifiability, one of which is captured by the orthogonal diagonal matrices. The final result is picked based on the final values of the objective function. For more on this, see [2].
- 'sign_flips'
Initial alignment done by making the median value in each dimension have the same sign. The motivation is similar to that in '2d', except this is a heuristic that can save time, but can sometimes yield suboptimal results.
- 'custom'
Expects either an initial guess for
Q_
or an initial guess forP_
, but not both. Seeinitial_Q
andinitial_P
, respectively. If neither is provided, initializesinitial_Q
to an identity with an appropriate number of dimensions.
- initial_Qnp.ndarray, shape (d, d) or None, optional (default=None)
An initial guess for the alignment matrix,
Q_
, if such exists. Only one ofinitial_Q
,initial_P
can be provided at the same time, and only ifinit
argument is set to 'custom'. If None, andinitial_P
is also None - initializesinitial_Q
to identity matrix. Must be an orthogonal matrix, if provided.- initial_Pnp.ndarray, shape (n, m) or None, optional (default=None)
Initial guess for the optimal transport matrix,
P_
, if such exists. Only one ofinitial_Q
,initial_P
can be provided at the same time, and only ifinit
argument is set to 'custom'. If None, andinitial_Q
is also None - initializesinitial_Q
to identity matrix. Must be a soft assignment matrix if provided (rows sum up to 1/n, cols sum up to 1/m.)
- Attributes:
- Q_array, size (d, d)
Final orthogonal matrix, used to modify
X
.- P_array, size (n, m) where n and m are the sizes of two datasets
Final matrix of optimal transports, represent soft matching weights from points in one dataset to the other, normalized such that all rows sum to 1/n and all columns sum to 1/m.
- score_float
Final value of the objective function: \(|| X Q - P Y ||_F\) Lower means the datasets have been matched together better.
- selected_initial_Q_array, size (d, d)
Initial orthogonal matrix which was used as the initialization. If
init
was set to '2d' or 'sign_flips', then it is the adaptively selected matrix. Ifinit
was set to 'custom', andinitial_Q
was provided, then equal to that. If it was not provided, butinitial_P
was, then it is the matrix after the first procrustes performed. If neither was provided, then it is the identity matrix.
Notes
In essence, the goal of this procedure is to simultaneously obtain a, not necessarily 1-to-1, correspondence between the vertices of the two data sets, and an orthogonal alignment between two datasets. If the two datasets are represented with matrices \(X \in M_{n, d}\) and \(Y \in M_{m, d}\), then the correspondence is a matrix \(P \in M_{n, m}\) that is soft assignment matrix (that is, its rows sum to \(1/n\), and columns sum to \(1/m\)) and the orthogonal alignment is an orthogonal matrix \(Q \in M_{d, d}\) (an orthogonal matrix is any matrix that satisfies \(Q^T Q = Q Q^T = I\)). The global objective function is \(|| X Q - P Y ||_F\).
Note that both \(X\) and \(PY\) are matrices in \(M_{n, d}\). Thus, if one knew \(P\), it would be simple to obtain an estimate for \(Q\), using the regular orthogonal procrustes. On the other hand, if \(Q\) was known, then \(XQ\) and \(Y\) could be thought of distributions over a finite number of masses, each with weight \(1/n\) or \(1/m\), respectively. These distributions could be "matched" via solving an optimal transport problem.
However, both \(Q\) and \(P\) are simultaneously unknown here. So the algorithm performs a sequence of alternating steps, obtaining iteratively improving estimates of \(Q\) and \(P\), similarly to an expectation-maximization (EM) procedure. It is not known whether this procedure is formally an EM, but the analogy can be drawn as follows: after obtaining an initial guess of of \(\hat{Q}_{0}\), obtaining an assignment matrix \(\hat{P}_{i+1} | \hat{Q}_{i}\) ("E-step") is done by solving an optimal transport problem via Sinkhorn algorithm, whereas obtaining an orthogonal alignment matrix \(Q_{i+1} | P_{i}\) ("M-step") is done via regular orthogonal procurstes. These alternating steps are performed until
iterative_num_reps
is reached.For more on how the initial guess can be performed, see
init
.References
[1]Agterberg, J., Tang, M., Priebe., C. E. (2020). "Nonparametric Two-Sample Hypothesis Testing for Random Graphs with Negative and Repeated Eigenvalues" arXiv:2012.09828
[2]Agterberg, J., Tang, M., Priebe., C. E. (2020). "On Two Distinct Sources of Nonidentifiability in Latent Position Random Graph Models" arXiv:2003.14250
- __init__(optimal_transport_lambda=0.1, optimal_transport_eps=0.01, optimal_transport_num_reps=1000, iterative_num_reps=100, init='2d', initial_Q=None, initial_P=None)[source]¶
- fit(X, Y)[source]¶
Uses the two datasets to learn the matrix self.Q_ that aligns the first dataset with the second.
- Parameters:
- Xnp.ndarray, shape (n, d)
Dataset to be mapped to
Y
, must have same number of dimensions (axis 1) asY
.- Ynp.ndarray, shape (m, d)
Target dataset, must have same number of dimensions (axis 1) as
X
.
- Returns:
- selfreturns an instance of self
- Parameters:
X (ndarray)
Y (ndarray)
- Return type:
- fit_transform(X, Y)¶
Uses the two datasets to learn the matrix
Q_
that aligns the first dataset with the second. Then, transforms the first datasetX
using the learned matrixQ_
.- Parameters:
- Xnp.ndarray, shape (n, d)
Dataset to be mapped to
Y
, must have same number of dimensions (axis 1) asY
.- Ynp.ndarray, shape (m, d)
Target dataset, must have same number of dimensions (axis 1) as
X
.
- Returns:
- X_primenp.ndarray, shape (n, d)
First dataset of vectors, aligned to second. Equal to
X
@Q_
.
- Parameters:
X (ndarray)
Y (ndarray)
- Return type:
ndarray
- get_metadata_routing()¶
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it's possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.
- transform(X)¶
Transforms the dataset
X
using the learned matrixQ_
. This may be the same as the first dataset as infit()
, or a new dataset. For example, additional samples from the same dataset.- Parameters:
- Xnp.ndarray, shape(m, d)
Dataset to be transformed, must have same number of dimensions (axis 1) as
X
andY
that were passed to fit.
- Returns:
- X_primenp.ndarray, shape (n, d)
First dataset of vectors, aligned to second. Equal to
X
@Q_
.
- Parameters:
X (ndarray)
- Return type:
ndarray