SimilarityScorer

class emmet.core.similarity.SimilarityScorer

Bases: object

Mixin for ranking the similarity between structures.

Parameters

fingerprinter: BaseFeaturizer or None (default): A structural featurizer. If None, defaults to the featurizer used in the above reference.

featurize_structures(structures, num_procs=1)

Featurize structures using the user-defined _featurize_structure.

This method may be preferred when dealing with huge sets of structures. In those cases, getting distances between structures may lead to memory errors.

Parameters

structures : list of Structure objects num_procs : int = 1

Number of parallel processes to run in featurizing structures.

Returns

np.ndarray : the feature vectors of the input structures.

Parameters:

structures (list[Structure])
num_procs (int)

get_all_similarity_scores(structures, num_procs=1, **kwargs)

Rank the similarity between structures using CrystalNN.

Return type:

tuple[ndarray, ndarray]

Parameters:

structures (list[Structure])
num_procs (int)

Parameters

structures : list of Structure objects num_procs : int = 1

Number of parallel processes to run in featurizing structures.

**kwargs: Kwargs to pass to vector_difference_matrix

Returns

tuple of np.ndarray, np.ndarray: The first array contains the structure feature vectors, and the second their similarity scores.

get_most_similar(feature_vectors, num_procs=1, num_top=100, labels=None)

Rank the similarity between structures using CrystalNN.

Return type:

dict[str, dict[str, list[str] | ndarray]]

Parameters:

feature_vectors (ndarray)
num_procs (int)
num_top (int)
labels (list[str] | None)

Parameters

feature_vectors : list of feature vectors num_procs : int = 1

Number of parallel processes to run in featurizing structures.

num_topint or None: If an int, returns that number of most similar structures indicated by their indices in the original list. If None, returns all distances.
labelslist of str or None: If a list of str, the labels corresponding to the feature vectors, e.g., MPIDs. If None, defaults to the list indices.

Returns

dict[int,dict[str,np.ndarray]]], containing the index of: the structure in structures, with a dict containing the indices of the top num_top most similar structures and their corresponding distances.

build_similarity_collection_from_structures(structures, num_procs=1, num_top=100)

Build a collection of similarity documents.

This defines the build pipeline for the MP similarity collection.

Return type:

list[SimilarityDoc]

Parameters:

structures (dict[str, Structure])
num_procs (int)
num_top (int)

Parameters

structures : dict of str (e.g., MPID) to a corresponding structure. num_procs : int = 1

Number of parallel processes to run in featurizing structures.

num_topint or None: If an int, returns that number of most similar structures indicated by their indices in the original list. If None, returns all distances.

Returns

A list of SimilarityDoc.

static get_vendi_score(feature_vectors)

Get the Vendi score of a set of feature vectors.

Uses the conventions described in arXiv:2210.02410 Describes the diveristy of a set of structures.

Return type:: float
Parameters:: feature_vectors (ndarray)

Parameters

feature_vectorsnp.ndarray: The feature vectors, such as those from SimilarityScorer._featurize_structure Each row should be a distinct feature vector.

Returns

float, the Vendi score.: A Vendi score close to feature_vectors.shape[0] indicates high sample diversity, and a Vendi score close to 1 indicates low sample diversity.