SimilarityScorer

class emmet.core.similarity.SimilarityScorer

Bases: object

Mixin for ranking the similarity between structures.

Parameters

fingerprinter: BaseFeaturizer or None (default)

A structural featurizer. If None, defaults to the featurizer used in the above reference.

featurize_structures(structures, num_procs=1)

Featurize structures using the user-defined _featurize_structure.

This method may be preferred when dealing with huge sets of structures. In those cases, getting distances between structures may lead to memory errors.

Parameters

structures : list of Structure objects num_procs : int = 1

Number of parallel processes to run in featurizing structures.

Returns

np.ndarray : the feature vectors of the input structures.

Parameters:
  • structures (list[Structure])

  • num_procs (int)

get_all_similarity_scores(structures, num_procs=1, **kwargs)

Rank the similarity between structures using CrystalNN.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • structures (list[Structure])

  • num_procs (int)

Parameters

structures : list of Structure objects num_procs : int = 1

Number of parallel processes to run in featurizing structures.

**kwargs

Kwargs to pass to vector_difference_matrix

Returns

tuple of np.ndarray, np.ndarray

The first array contains the structure feature vectors, and the second their similarity scores.

get_most_similar(feature_vectors, num_procs=1, num_top=100, labels=None)

Rank the similarity between structures using CrystalNN.

Return type:

dict[str, dict[str, list[str] | ndarray]]

Parameters:
  • feature_vectors (ndarray)

  • num_procs (int)

  • num_top (int)

  • labels (list[str] | None)

Parameters

feature_vectors : list of feature vectors num_procs : int = 1

Number of parallel processes to run in featurizing structures.

num_topint or None

If an int, returns that number of most similar structures indicated by their indices in the original list. If None, returns all distances.

labelslist of str or None

If a list of str, the labels corresponding to the feature vectors, e.g., MPIDs. If None, defaults to the list indices.

Returns

dict[int,dict[str,np.ndarray]]], containing the index of

the structure in structures, with a dict containing the indices of the top num_top most similar structures and their corresponding distances.

build_similarity_collection_from_structures(structures, num_procs=1, num_top=100)

Build a collection of similarity documents.

This defines the build pipeline for the MP similarity collection.

Return type:

list[SimilarityDoc]

Parameters:
  • structures (dict[str, Structure])

  • num_procs (int)

  • num_top (int)

Parameters

structures : dict of str (e.g., MPID) to a corresponding structure. num_procs : int = 1

Number of parallel processes to run in featurizing structures.

num_topint or None

If an int, returns that number of most similar structures indicated by their indices in the original list. If None, returns all distances.

Returns

A list of SimilarityDoc.

static get_vendi_score(feature_vectors)

Get the Vendi score of a set of feature vectors.

Uses the conventions described in arXiv:2210.02410 Describes the diveristy of a set of structures.

Return type:

float

Parameters:

feature_vectors (ndarray)

Parameters

feature_vectorsnp.ndarray

The feature vectors, such as those from SimilarityScorer._featurize_structure Each row should be a distinct feature vector.

Returns

float, the Vendi score.

A Vendi score close to feature_vectors.shape[0] indicates high sample diversity, and a Vendi score close to 1 indicates low sample diversity.