pymatgen.db package

Subpackages

Submodules

pymatgen.db.config module

Database configuration functions. Main class is DBConfig, which encapsulates a database configuration passed in as a file or object. For example:

cfg1 = DBConfig()  # use defaults
cfg2 = DBConfig("/path/to/myfile.json")  # read from file
f = open("/other/file.json")
cfg3 = DBConfig(f)  # read from file object
# access dict of parsed conf. settings
settings = cfg1.settings
exception ConfigurationFileError(filename, err)[source]

Bases: Exception

Error for Config File

Init for ConfigurationFileError.

class DBConfig(config_file=None, config_dict=None)[source]

Bases: object

Database configuration.

Constructor. Settings are created from config_dict, if given, or parsed config_file, if given, otherwise the DEFAULT_FILE is tried and if that is not present the DEFAULT_SETTINGS are used without modification. :param config_file: Read configuration from this file. :type config_file: file or str path :param config_dict: Set configuration from this dictionary. :raises: ConfigurationFileError if cannot read/parse config_file

ALL_SETTINGS = ['host', 'port', 'database', 'collection', 'aliases']
DEFAULT_FILE = 'db.json'
DEFAULT_PORT = 27017
DEFAULT_SETTINGS = [('host', 'localhost'), ('port', 27017), ('database', 'vasp'), ('aliases', {})]
property collection

Return collection.

copy()[source]

Return a copy of self (internal settings are copied).

property dbname

Name of the database.

property host

Return host

property password

Return password.

property port

Return port.

property settings

Return settings

property user

Return user.

auth_aliases(d)[source]

Interpret user/password aliases.

get_settings(infile)[source]

Read settings from input file. :param infile: Input file for JSON settings. :type infile: file or str path :return: Settings parsed from file :rtype: dict

normalize_auth(settings, admin=True, readonly=True, readonly_first=False)[source]

Transform the readonly/admin user and password to simple user/password, as expected by QueryEngine. If return value is true, then admin or readonly password will be in keys “user” and “password”. :param settings: Connection settings :type settings: dict :param admin: Check for admin password :param readonly: Check for readonly password :param readonly_first: Check for readonly password before admin :return: Whether user/password were found :rtype: bool

pymatgen.db.creator module

This module defines a Drone to assimilate vasp data and insert it into a Mongo database.

class VaspToDbTaskDrone(host='127.0.0.1', port=27017, database='vasp', user=None, password=None, collection='tasks', parse_dos=False, compress_dos=False, parse_projected_eigen=False, simulate_mode=False, additional_fields=None, update_duplicates=True, mapi_key=None, use_full_uri=True, runs=None)[source]

Bases: AbstractDrone

VaspToDbTaskDrone assimilates directories containing vasp input to inserted db tasks. This drone is meant ot be used with pymatgen’s BorgQueen to assimilate entire directory structures and insert them into a database using Python’s multiprocessing. The current format assumes standard VASP relaxation runs. If you have other kinds of runs, you may design your own Drone class based on this one.

There are some restrictions on the valid directory structures:

  1. There can be only one vasp run in each directory. Nested directories are fine.

  2. Directories designated “relax1”, “relax2” are considered to be 2 parts of an aflow style run.

  3. Directories containing vasp output with “.relax1” and “.relax2” are also considered as 2 parts of an aflow style run.

Constructor.

Parameters:
  • host – Hostname of database machine. Defaults to 127.0.0.1 or localhost.

  • port – Port for db access. Defaults to mongo’s default of 27017.

  • database – Actual database to access. Defaults to “vasp”.

  • user – User for db access. Requires write access. Defaults to None, which means no authentication.

  • password – Password for db access. Requires write access. Defaults to None, which means no authentication.

  • collection – Collection to query. Defaults to “tasks”.

  • parse_dos – Whether to parse the DOS data. Options are True, False, and ‘final’ Defaults to False. If True, all dos will be inserted into a gridfs collection called dos_fs. If ‘final’, only the last calculation will be parsed.

  • parse_projected_eigen – Whether to parse the element and orbital projections. Options are True, False, and ‘final’; Defaults to False. If True, projections will be parsed for each calculation. If ‘final’, projections for only the last calculation will be parsed.

  • compress_dos – Whether to compress the DOS data. Valid options are integers 1-9, corresponding to zlib compression level. 1 is usually adequate.

  • simulate_mode – Allows one to simulate db insertion without actually performing the insertion.

  • additional_fields – Dict specifying additional fields to append to each doc inserted into the collection. For example, allows one to add an author or tags to a whole set of runs for example.

  • update_duplicates – If True, if a duplicate path exists in the collection, the entire doc is updated. Else, duplicates are skipped.

  • mapi_key

    A Materials API key. If this key is supplied, the insertion code will attempt to use the Materials REST API to calculate stability data for inserted calculations. Stability assessment requires a large quantity of materials data. E.g., to compute the stability of a new LixFeyOz calculation, you need to the energies of all known phases in the Li-Fe-O chemical system. Using the Materials API, we can obtain the pre-calculated data from the Materials Project.

    Go to www.materialsproject.org/profile to generate or obtain your API key.

  • use_full_uri – Whether to use full uri path (including hostname) for the directory name. Defaults to True. If False, only the abs path will be used.

  • runs – Ordered list of runs to look for e.g. [“relax1”, “relax2”]. Automatically detects whether the runs are stored in the subfolder or file extension schema.

as_dict()[source]

Dict representation.

assimilate(path)[source]

Parses vasp runs. Then insert the result into the db. and return the task_id or doc of the insertion.

Returns:

If in simulate_mode, the entire doc is returned for debugging purposes. Else, only the task_id of the inserted doc is returned.

calculate_stability(d)[source]

Calculate the stability (e_above_hull and decomposes_to) for a entry dict.

convert(d)[source]

Just return the dict.

classmethod from_dict(d)[source]

From dict

generate_doc(dir_name, vasprun_files)[source]

Process aflow style runs, where each run is actually a combination of two vasp runs.

get_task_doc(path)[source]

Get the entire task doc for a path, including any post-processing.

get_valid_paths(path)[source]

There are some restrictions on the valid directory structures:

  1. There can be only one vasp run in each directory. Nested directories are fine.

  2. Directories designated “relax1”, “relax2” are considered to be 2 parts of an aflow style run.

  3. Directories containing vasp output with “.relax1” and “.relax2” are also considered as 2 parts of an aflow style run.

post_process(dir_name, d)[source]

Simple post-processing for various files other than the vasprun.xml. Called by generate_task_doc. Modify this if your runs have other kinds of processing requirements.

Parameters:
  • dir_name – The dir_name.

  • d – Current doc generated.

process_killed_run(dir_name)[source]

Process a killed vasp run.

process_vasprun(dir_name, taskname, filename)[source]

Process a vasprun.xml file.

contains_vasp_input(dir_name)[source]

Checks if a directory contains valid VASP input.

Parameters:

dir_name – Directory name to check.

Returns:

True if directory contains all four VASP input files (INCAR, POSCAR, KPOINTS and POTCAR).

get_basic_analysis_and_error_checks(d, max_force_threshold=0.5, volume_change_threshold=0.2)[source]

Generate basic analysis and error checks data for a run.

get_coordination_numbers(d)[source]

Helper method to get the coordination number of all sites in the final structure from a run.

Parameters:

d – Run dict generated by VaspToDbTaskDrone.

Returns:

site_dict, “coordination”: number}, …].

Return type:

Coordination numbers as a list of dict of [{“site”

get_uri(dir_name)[source]

Returns the URI path for a directory. This allows files hosted on different file servers to have distinct locations.

Parameters:

dir_name – A directory name.

Returns:

/full/path/of/dir_name.

Return type:

Full URI path, e.g., fileserver.host.com

pymatgen.db.matproj module

A module for replicating the MP database creator.

See https://medium.com/@shyuep/a-local-materials-project-database-1ea909430c95

class MPDB(*args, **kwargs)[source]

Bases: object

This module allows you to create a local MP database based on ComputedStructureEntries.

@param args: Pass through to MongoClient. E.g., you can create a connection using uri strings, etc. @param kwargs: Pass through to MongoClient. E.g., you can create a connection using uri strings, etc.

create(criteria=None, property_data: list | None = None)[source]

Creates the database. Typically only used once.

@param criteria: Criteria passed to MPRester.get_entries to obtain the entries. None means you get the entire

MP database.

@param property_data: List of additional property data to obtain. These are stored in the data.* keys.

get_entries_in_chemsys(elements, additional_criteria=None, **kwargs)[source]

Helper method to get a list of ComputedEntries in a chemical system. For example, elements = [“Li”, “Fe”, “O”] will return a list of all entries in the Li-Fe-O chemical system, i.e., all LixOy, FexOy, LixFey, LixFeyOz, Li, Fe and O phases. Extremely useful for creating phase diagrams of entire chemical systems.

Parameters:
  • elements (str or [str]) – Chemical system string comprising element symbols separated by dashes, e.g., “Li-Fe-O” or List of element symbols, e.g., [“Li”, “Fe”, “O”].

  • additional_criteria (dict) – Any additional criteria to pass. For instance, if you are only interested in stable entries, you can pass {“e_above_hull”: {“$lte”: 0.001}}.

Returns:

List of ComputedStructureEntries.

pymatgen.db.query_engine module

This module provides a QueryEngine that simplifies queries for Mongo databases generated using hive.

class QueryEngine(host='127.0.0.1', port=27017, database='vasp', user=None, password=None, collection='tasks', aliases_config=None, default_properties=None, query_post=None, result_post=None, connection=None, replicaset=None, **ignore)[source]

Bases: object

This class defines a QueryEngine interface to a Mongo Collection based on a set of aliases. This query engine also provides convenient translation between various pymatgen objects and database objects.

The major difference between the QueryEngine’s query() method and pymongo’s find() method is the treatment of nested fields. QueryEngine’s query will map the final result to a root level string, while pymmongo will return the doc as is. For example, let’s say you have a document that is of the following form:

{"a": {"b" : 1}}

Using pymongo.find({}, fields=[“a.b”]), you will get a doc where you need to do doc[“a”][“b”] to access the final result (1). Using QueryEngine.query(properties=[“a.b”], you will obtain a result that can be accessed simply as doc[“a.b”].

Constructor.

Parameters:
  • host (str) – Hostname of database machine.

  • port (int) – Port for db access.

  • database (str) – Name of database to access.

  • user (str) – User for db access. None means no authentication.

  • password (str) – Password for db access. None means no auth.

  • collection (str) – Collection to query. Defaults to “tasks”.

  • connection (pymongo.Connection) – If given, ignore ‘host’ and ‘port’ and use existing connection.

  • aliases_config (dict) –

    An alias dict to use. Defaults to None, which means the default aliases defined in “aliases.json” is used. The aliases config should be of the following format:

    {
        "aliases": {
            "e_above_hull": "analysis.e_above_hull",
            "energy": "output.final_energy",
            ....
        },
        "defaults": {
            "state": "successful"
        }
    }
    
    aliases (dict): Keys are the incoming property, values are the

    property it will be translated to. This makes it easier to organize the doc format in a way that is different from the query format.

    defaults (dict): Criteria that should be applied

    by default to all queries. For example, a collection may contain data from both successful and unsuccessful runs but for most querying purposes, you may want just successful runs only. Note that defaults do not affect explicitly specified criteria, i.e., if you suppy a query for {“state”: “killed”}, this will override the default for {“state”: “successful”}.

  • default_properties (list) – Property names (strings) to use by default, if no properties are given to query().

  • query_post (list) – Functions to post-process the criteria passed to query(), after aliases are resolved. Function takes two args, the criteria dict and list of result properties. Both may be modified in-place.

  • result_post (list) – Functions to post-process the cursor records. Function takes one arg, the document for the current record, that is modified in-place.

ALIASES_CONFIG_KEY = 'aliases_config'
COLLECTION_KEY = 'collection'
DB_KEY = 'database'
HOST_KEY = 'host'
PASSWORD_KEY = 'password'
PORT_KEY = 'port'
USER_KEY = 'user'
aliases = None

See aliases arg to constructor

close()[source]

Disconnects the connection.

property collection_name

Returns collection name.

default_criteria = None

See default_criteria arg to constructor

default_properties = None

See default_properties arg to constructor

ensure_index(key, unique=False)[source]

Wrapper for pymongo.Collection.ensure_index

static from_config(config_file, use_admin=False)[source]

Initialize a QueryEngine from a JSON config file generated using mgdb init.

Parameters:
  • config_file – Filename of config file.

  • use_admin – If True, the admin user and password in the config file is used. Otherwise, the readonly_user and password is used. Defaults to False.

Returns:

QueryEngine

get_dos_from_id(task_id)[source]

Overrides the get_dos_from_id for the MIT gridfs format.

get_entries(criteria, inc_structure=False, optional_data=None)[source]

Get ComputedEntries satisfying a particular criteria.

Note

The get_entries_in_system and get_entries methods should be used with care. In essence, all entries, GGA, GGA+U or otherwise, are returned. The dataset is very heterogeneous and not directly comparable. It is highly recommended that you perform post-processing using pymatgen.entries.compatibility.

Parameters:
  • criteria – Criteria obeying the same syntax as query.

  • inc_structure – Optional parameter as to whether to include a structure with the ComputedEntry. Defaults to False. Use with care - including structures with a large number of entries can potentially slow down your code to a crawl.

  • optional_data – Optional data to include with the entry. This allows the data to be access via entry.data[key].

Returns:

List of pymatgen.entries.ComputedEntries satisfying criteria.

get_entries_in_system(elements, inc_structure=False, optional_data=None, additional_criteria=None)[source]

Gets all entries in a chemical system, e.g. Li-Fe-O will return all Li-O, Fe-O, Li-Fe, Li-Fe-O compounds.

Note

The get_entries_in_system and get_entries methods should be used with care. In essence, all entries, GGA, GGA+U or otherwise, are returned. The dataset is very heterogeneous and not directly comparable. It is highly recommended that you perform post-processing using pymatgen.entries.compatibility.

Parameters:
  • elements – Sequence of element symbols, e.g. [‘Li’,’Fe’,’O’]

  • inc_structure – Optional parameter as to whether to include a structure with the ComputedEntry. Defaults to False. Use with care - including structures with a large number of entries can potentially slow down your code to a crawl.

  • optional_data – Optional data to include with the entry. This allows the data to be access via entry.data[key].

  • additional_criteria – Added ability to provide additional criteria other than just the chemical system.

Returns:

List of ComputedEntries in the chemical system.

get_structure_from_id(task_id, final_structure=True)[source]

Returns a structure from the database given the task id.

Parameters:
  • task_id – The task_id to query for.

  • final_structure – Whether to obtain the final or initial structure. Defaults to True.

query(properties=None, criteria=None, distinct_key=None, **kwargs)[source]

Convenience method for database access. All properties and criteria can be specified using simplified names defined in Aliases. You can use the supported_properties property to get the list of supported properties.

Results are returned as an iterator of dicts to ensure memory and cpu efficiency.

Note that the dict returned have keys also in the simplified names form, not in the mongo format. For example, if you query for “analysis.e_above_hull”, the returned result must be accessed as r[‘analysis.e_above_hull’] instead of mongo’s r[‘analysis’][‘e_above_hull’]. This is a feature of the query engine to allow simple access to deeply nested docs without having to resort to some recursion to go deep into the result.

However, if you query for ‘analysis’, the entire ‘analysis’ key is returned as r[‘analysis’] and then the subkeys can be accessed in the usual form, i.e., r[‘analysis’][‘e_above_hull’]

Parameters:
  • properties – Properties to query for. Defaults to None which means all supported properties.

  • criteria – Criteria to query for as a dict.

  • distinct_key – If not None, the key for which to get distinct results

  • **kwargs – Other kwargs supported by pymongo.collection.find. Useful examples are limit, skip, sort, etc.

Returns:

A QueryResults Iterable, which is somewhat like pymongo’s cursor except that it performs mapping. In general, the dev does not need to concern himself with the form. It is sufficient to know that the results are in the form of an iterable of dicts.

query_one(*args, **kwargs)[source]

Return first document from query(), with same parameters.

query_post = None

See query_post arg to constructor

result_post = None

See result_post arg to constructor

set_aliases_and_defaults(aliases_config=None, default_properties=None)[source]

Set the alias config and defaults to use. Typically used when switching to a collection with a different schema.

Parameters:
  • aliases_config – An alias dict to use. Defaults to None, which means the default aliases defined in “aliases.json” is used. See constructor for format.

  • default_properties – List of property names (strings) to use by default, if no properties are given to the ‘properties’ argument of query().

exception QueryError[source]

Bases: Exception

Exception class for errors occuring during queries.

class QueryListResults(prop_dict, result_cursor, postprocess=None)[source]

Bases: QueryResults

Set of QueryResults on a list instead of a MongoDB cursor.

Constructor.

Parameters:
  • prop_dict – Properties

  • result_cursor – Iterable returning records

  • postprocess – List of functions, each taking a record and modifying it in-place, or None, or an empty list

clone()[source]

Return a clone of the QueryListResults.

class QueryResults(prop_dict, result_cursor, postprocess=None)[source]

Bases: Iterable

Iterable wrapper for results from QueryEngine. Like pymongo’s cursor, this object should generally not be instantiated, but should be obtained from a queryengine. It delegates many attributes to the underlying pymongo cursor, and should support nearly all cursor like attributes such as count(), explain(), hint(), etc. Please see pymongo cursor documentation for details.

Constructor.

Parameters:
  • prop_dict – Properties

  • result_cursor – Iterable returning records

  • postprocess – List of functions, each taking a record and modifying it in-place, or None, or an empty list

clone()[source]

Provide a clone of the QueryResults.

from_cursor(cursor)[source]

Create a QueryResults object from a cursor object

pymatgen.db.util module

Utility functions used across scripts.

class MongoJSONEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Bases: JSONEncoder

JSON encoder to support ObjectIDs and datetime used in Mongo.

Constructor for JSONEncoder, with sensible defaults.

If skipkeys is false, then it is a TypeError to attempt encoding of keys that are not str, int, float or None. If skipkeys is True, such items are simply skipped.

If ensure_ascii is true, the output is guaranteed to be str objects with all incoming non-ASCII characters escaped. If ensure_ascii is false, the output can contain non-ASCII characters.

If check_circular is true, then lists, dicts, and custom encoded objects will be checked for circular references during encoding to prevent an infinite recursion (which would cause an RecursionError). Otherwise, no such check takes place.

If allow_nan is true, then NaN, Infinity, and -Infinity will be encoded as such. This behavior is not JSON specification compliant, but is consistent with most JavaScript based encoders and decoders. Otherwise, it will be a ValueError to encode such floats.

If sort_keys is true, then the output of dictionaries will be sorted by key; this is useful for regression tests to ensure that JSON serializations can be compared on a day-to-day basis.

If indent is a non-negative integer, then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0 will only insert newlines. None is the most compact representation.

If specified, separators should be an (item_separator, key_separator) tuple. The default is (’, ‘, ‘: ‘) if indent is None and (‘,’, ‘: ‘) otherwise. To get the most compact JSON representation, you should specify (‘,’, ‘:’) to eliminate whitespace.

If specified, default is a function that gets called for objects that can’t otherwise be serialized. It should return a JSON encodable version of the object or raise a TypeError.

default(o)[source]

Override default to support ObjectID and datetime.

collection_keys(coll, sep='.')[source]

Get a list of all (including nested) keys in a collection. Examines the first document in the collection. :param sep: Separator for nested keys :return: List of str

get_collection(config_file, admin=False, settings=None)[source]

Get a collection from a config file. :param config_file Path to filename :param admin Whether to use admin credentials. Default to False. :param settings Whether to override settings or obtain from config file (None).

get_database(config_file=None, settings=None, admin=False, **kwargs)[source]

Get a database object from a config file.

get_settings(config_file)[source]

Get settings from file.

Module contents

Pymatgen-db is a database add-on for the Python Materials Genomics (pymatgen) materials analysis library. It enables the creation of Materials Project-style MongoDB databases for management of materials data and provides a clean and intuitive web ui for exploring that data. A query engine is also provided to enable the easy translation of MongoDB docs to useful pymatgen objects for analysis purposes.