Materials Project Database Validation: mgvv¶
The command-line program, mgvv, is used to validate MongoDB databases. This command has two sub-commands:
mgvv validate - Filter records on a set of criteria (conjunction) and apply a set of constraints.
mgvv diff - Look at the “diff” of any two collections, for identifiers that are missing from one or the other, or for mis-matching values for identifiers that are the same.
mgvv validate subcommand¶
Database connections are configured from a file, exactly as for mgdb/ The patterns to validate against can come from a file or the command-line.
Arguments¶
Constraints can be provided on the command line. The syntax of each constraint is given in Constraint syntax. It is usually wise to quote the constraints so that the operators are not interpreted by the Unix shell. The constraints can be listed as separate quoted arguments or a single quoted argument using commas. For example, the following two command-lines are equivalent:
mgvv [options] 'nelements > 1' 'e_above_hull >= 0'
mgvv [options] 'nelements > 1, e_above_hull >= 0'
The validation constraints could also be configured from a file using the –file option, as described below.
Options¶
- --help, -h¶
show help message and exit
- --alias, -a¶
Value is an alias for a field used in a constraint or condition, in form alias=name_in_db. This option is repeatable to allow multiple aliases.
- --collection, -C <name>¶
Name of collection on which to operate.
- --email, -e <file or spec>¶
Email a report. The sender, server, and recipients can be specified either in a configuration file, or on the command line in the format “{from}:{to}[:server]”. The “from” and “to” are required; default server is localhost. If a file path is given (i.e. the option does not have a ‘:’ in it), then the format is JSON/YAML and the specification should be given under the “_email” key, which should be a mapping with keys “from”, “to”, and “server”. For example the following commandline uses localhost:9101 to send mail:
mgvv [options] --email dkgunter@lbl.gov:somebody@host.org:localhost:9101
See email-config for an example of the configuration file syntax, which unlike the command-line syntax allows for multiple recipients.
- -f <file>, --file <file>¶
Main configuration file. Has constraints, and optionally email configuration (see –email option).
- --limit <num>, -m <num>¶
In output, limit number of displayed validation errors, per collection, to num. The default is 50. To show as many errors as you can find, use 0.
- --format <type>, -F <type>¶
Use the specified report type to format and send the validation output. Recognized types are:
- html
A simple HTML report, with some minimal CSS styling. This is arguably the most visually pleasing format. Default
- json
A JSON document (indented).
- md
Markdown with an embedded fixed-width table. This is the easiest format to read from the console.
- -c <file>, --config <file>¶
Configuration file for database connection. Generate one using mgdb init –config filename.json, if necessary. Otherwise, the code searches for a db.json. If none is found, a no-authentication localhost:27017/vasp database is assumed.
- -v, --verbose¶
Increase log message verbosity. Repeatable. Messages are logged to standard error.
Configuration files¶
You can use up to two configuration files: one for constraints (and aliases), one for the database, and one for the constraints and email.
Database configuration¶
The database connection uses the same format as the mgdb command for its configuration file. The readonly_user is preferred over the administrative user, if both are present.
Email configuration¶
Reports can be sent by email. This can be configured on the command-line, or within the main configuration file.
Here is an example configuration:
_email:
from: you@host.org
to:
- you@host.org
- othersucker@host.otherorg
The section for email must always be named _email. The purpose of the _email key is to make it easy to embed this information into the configuration file used for the constraints (the –file option). The following keywords are recognized:
- from
Sender email, as a string. Required.
- to
Recipients of the email. If a single one, a string; if multiple, a list of strings. Required.
- server
Email server address. Use ‘localhost’ if none is given. Optional.
- port
Email server port. Use default SMTP port if none is given. Optional.
Constraint configuration¶
The constraints are configured from a YAML file.
At the top level are keys, which are the names of the collection on which to apply the constraints. The specification of the constraints in each collection takes two possible forms, simple and complex. In both cases the syntax of the constraints is the same, see Constraint syntax.
Simple: A list of constraints, which are simply combined. Any document in the collection that violates any of the constraints will generate a validation error.
collection_name:
- field1 <= value
- field2 > value
- # ..etc..
Complex: An initial filter, given as a map with an filter key, and a set of constraints under the constraints key. The filter key selects records for applying the constraints. The constraints key provides the list of constraints associated with that condition. As in the simple format, any document in the collection that violates any of the constraints will generate a validation error.
mycollection:
-
filter:
- field1 = 'negatory'
constraints:
- field2 <= value
- field3 > value
- # ..etc..
-
filter:
- field1 = 'excellent'
- field4 > 0
then:
- field5 < value
- # ..etc..
As shown in the second constraint block above, there may also be a list of conditions for the filter. All of these conditions must be true for the record to pass the filter and be evaluated according to the constraints.
Aliases can be defined (these operate across all collections, for better or worse, at the moment). Constraints that use these aliases will automatically be converted to the aliased name before the query is submitted to the database. The aliases are simply a list in the format “name = value” in a section called _aliases, as shown below.
_aliases:
- snl_id = mps_id
- energy = analysis.e_above_hull
Partial arrays can be fetched, which is very useful for not spending a ton of bandwidth, by adding /<path> after the name of the field. For example:
collection_name:
- calculations/density size 2
If, for exampe, the calculations array was full of large sub-arrays this would save a lot of bandwidth by only retrieving that density values for each array item. By default, the arrays are sliced to only retrieve enough elements to test against the condition, but this may not be sufficiently efficient for cases where each sub-element is very large. Note that this only applies to constraints that use the ‘size’ family of array operators.
Constraint syntax¶
The constraint syntax is taken from the smoqe package.
mgvv diff subcommand¶
The diff sub-command takes options and two DB configurations:
mgvv diff [options..] old.json new.json
The command provides a convenient command-line interface to take the difference of two MongoDB database collections. Database connections are configured from JSON files, exactly as for mgdb. See the examples at the end of this section for full usage examples.
Arguments¶
Two positional arguments are required, to set the two collections. These are called the old and new collections, respectively. Both are configured using a pymatgen-db JSON config file.
For an unauthenticated database, we only need 3 keys:
{
"host" : "myhost",
"database": "mydatabase",
"collection": "mycollection"
}
For an authenticated database, user and password (or readonly_user and readonly_password, or admin_user and admin_password) are required:
{
"host" : "myhost",
"database": "mydatabase",
"collection": "mycollection",
"user": "me",
"password": "let-me-in"
}
Options¶
usage: mgvv diff [-h] [–verbose] [-D CONFIG] [-E ADDR] [-f FILE] [-F FORMAT] [-s HOST] [-i INFO] [-k KEY] [-m] [-n EXPR] [-p PROPERTIES] [-P] [-q EXPR] [-u URL] [-V] old new
- --help, -h¶
show help message and exit
- -v, --verbose¶
Increase log message verbosity. Repeatable. Messages are logged to standard error.
- -D CONFIG, --db CONFIG¶
Insert a JSON record of the report in the MongoDB collection pointed to by CONFIG, which is a standard pymatgen-db JSON configuration file. Note that the target database and collection must be writable.
- -E ADDR, --email ADDR¶
Email report to one or more email addresses. ADDR is a list of the form:
‘sender/receiver,[receiver2...][/subject]
’.
- -s HOST, --email-server HOST¶
Server HOST for an email report, in form hostname[:port]. Default is localhost
- -f FILE, --file FILE¶
Read options from FILE instead of command line. File format is YAML (or JSON, a subset), with the long option names as keys. Any time the command-line option takes a comma-separated list, the config file uses a real list; command-line key/value pair lists become config file mappings.
For example:
Command-line Config file
============ ==============
--numeric='x=+-1.5,y=+-0.5' numeric:
x: '+-1.5'
y: '+-0.5'
--info=foo,bar info:
- foo
- bar
- -F FORMAT, --format FORMAT¶
Default report format: ‘text’, ‘html’, or ‘json’. If not given, the format will be determined by the output: text for console, html for email.
- -i INFO, --info INFO¶
Extra fields for records, as comma-separated list, e.g ‘extra,fields,to_include
’.
- -k KEY, --key KEY¶
Key for matching records.
- -m, --missing¶
Only report keys that are in the ‘old’ collection, but not in the ‘new’ collection.
- -n, --numeric¶
Fields with numeric values that must match, with a tolerance, as a comma-separated list, e.g.,
<name1>=<expr1>, <name2>=<expr2>, ..
. <name> is a field name, <expr> syntax is:
Expression |
Meaning |
---|---|
+- |
Change in sign. |
+-= |
Change in sign, including change from positive or negative to zero. |
+-X |
Plus or minus more than X. abs(new - old) > X |
+X-Y |
Plus more than X or minus more than Y. (new - old) > X or (old - new) > Y |
+-X= |
Plus or minus X or more. abs(new - old) >= X |
+X-Y= |
Plus X or more, or minus Y or more. (new - old) >= X or (old - new) >= Y |
+X[=] |
Positive changes only |
-Y[=] |
Negative changes only |
…% |
Percent change. Instead of “(new - old)”, use “100 * (new - old) / old” |
Some examples follow.
Report records where the value
analysis.e_above_hull
changes by more than 20% in either direction:mgvv diff -k task_id --numeric "analysis.e_above_hull=+-20%" prod.json dev.json
Report records where the value
energy
changes sign:mgvv diff -k name --numeric "energy=+-" conf/test1.json conf/test2.json
Report records where either the value
frequency
orduration
change is some quantity or more:mgvv diff -k somekey --numeric "frequency=+1.0-0.5=, duration=+10-15=" foo.json bar.json
This option may be combined with any of the other options.
- -p PROPS, --properties PROPS¶
Fields with properties that must match, as comma-separated list , e.g ‘these_must,match
’.
- -P, --print¶
Print report to the console.
- -q EXPR, --query EXPR¶
Query to filter records before key and value tests. Uses simplified constraint syntax, from the smoqe package. For example:
--query='name = "oscar" and grouchiness > 3'
- -u URL, --url URL¶
In HTML reports, make the key into a hyperlink by prefixing with URL. This can be used to take advantage of clean RESTful URL schemes such as those found in the Materials Project webpages:
mgvv diff -k task_id -u 'https://materialsproject.org/tasks/'
.. option:: -V, --values
Only report changes in values, not missing or added keys.
Examples¶
Let’s say you want to compare the ‘materials’ collection in a development and production database. You could have two JSON configuration files, prod.json and dev.json that specified the servers, user and password, and database and collection names:
# prod database
{
"host": "server1.my.domain",
"database": "core_prod",
"readonly_user": "xxxx",
"readonly_password": "yyyy",
"collection": "materials"
}
# dev database
{
"host": "server2.my.domain",
"database": "core_dev",
"readonly_user": "xxxx",
"readonly_password": "yyyy",
"collection": "materials"
}
You could issue this command-line:
mgvv diff -k task_id -p icsd_id -v -i pretty_formula mdev.json mprod.json
This compares with the key task_id and matches items with the same key on the property icsd_id, adding to the output the value of the field pretty_formula. Because output is to the console, the format will default to text.
To produce and view an HTML output report instead, just use the –format option:
mgvv diff --format html -k task_id -p icsd_id -v -i pretty_formula mdev.json mprod.json > page.html
open page.html # on OSX, view in a browser
To add an email report set the recipient and, optionally, the relay server (default will be localhost):
mgvv diff -e "me@my.mail.domain/you@your.mail.domain/DB diff" \
-k task_id -p icsd_id -v -i pretty_formula mdev.json mprod.json
Note that the third part of the -e/–email command, the subject, is optional – but if you leave it out the email will arrive with no subject line.
As you may have noticed, the command-lines begin to get rather complicated. To replace that last example with a configuration file, use instead this command:
mgvv diff --file diff.yaml mdev.json mprod.json
and then put the options into diff.yaml like this:
email:
- "me@my.mail.domain/you@your.mail.domain/DB diff"
key: task_id
properties:
- icsd_id
info:
- pretty_formula