SimilarityScorer
- class emmet.core.similarity.SimilarityScorer
Bases:
objectMixin for ranking the similarity between structures.
Parameters
- fingerprinter: BaseFeaturizer or None (default)
A structural featurizer. If None, defaults to the featurizer used in the above reference.
- featurize_structures(structures, num_procs=1)
Featurize structures using the user-defined _featurize_structure.
This method may be preferred when dealing with huge sets of structures. In those cases, getting distances between structures may lead to memory errors.
Parameters
structures : list of Structure objects num_procs : int = 1
Number of parallel processes to run in featurizing structures.
Returns
np.ndarray : the feature vectors of the input structures.
- Parameters:
structures (list[Structure])
num_procs (int)
- get_all_similarity_scores(structures, num_procs=1, **kwargs)
Rank the similarity between structures using CrystalNN.
- Return type:
tuple[ndarray,ndarray]- Parameters:
structures (list[Structure])
num_procs (int)
Parameters
structures : list of Structure objects num_procs : int = 1
Number of parallel processes to run in featurizing structures.
- **kwargs
Kwargs to pass to vector_difference_matrix
Returns
- tuple of np.ndarray, np.ndarray
The first array contains the structure feature vectors, and the second their similarity scores.
- get_most_similar(feature_vectors, num_procs=1, num_top=100, labels=None)
Rank the similarity between structures using CrystalNN.
- Return type:
dict[str,dict[str,list[str] |ndarray]]- Parameters:
feature_vectors (ndarray)
num_procs (int)
num_top (int)
labels (list[str] | None)
Parameters
feature_vectors : list of feature vectors num_procs : int = 1
Number of parallel processes to run in featurizing structures.
- num_topint or None
If an int, returns that number of most similar structures indicated by their indices in the original list. If None, returns all distances.
- labelslist of str or None
If a list of str, the labels corresponding to the feature vectors, e.g., MPIDs. If None, defaults to the list indices.
Returns
- dict[int,dict[str,np.ndarray]]], containing the index of
the structure in structures, with a dict containing the indices of the top num_top most similar structures and their corresponding distances.
- build_similarity_collection_from_structures(structures, num_procs=1, num_top=100)
Build a collection of similarity documents.
This defines the build pipeline for the MP similarity collection.
- Return type:
list[SimilarityDoc]- Parameters:
structures (dict[str, Structure])
num_procs (int)
num_top (int)
Parameters
structures : dict of str (e.g., MPID) to a corresponding structure. num_procs : int = 1
Number of parallel processes to run in featurizing structures.
- num_topint or None
If an int, returns that number of most similar structures indicated by their indices in the original list. If None, returns all distances.
Returns
A list of SimilarityDoc.
- static get_vendi_score(feature_vectors)
Get the Vendi score of a set of feature vectors.
Uses the conventions described in arXiv:2210.02410 Describes the diveristy of a set of structures.
- Return type:
float- Parameters:
feature_vectors (ndarray)
Parameters
- feature_vectorsnp.ndarray
The feature vectors, such as those from SimilarityScorer._featurize_structure Each row should be a distinct feature vector.
Returns
- float, the Vendi score.
A Vendi score close to feature_vectors.shape[0] indicates high sample diversity, and a Vendi score close to 1 indicates low sample diversity.