Configuring the checks

A simple introduction to customising the types of checks that are done against your datapackage.json by using the Config class, such as excluding certain checks or adding your own.

You can pass a Config object to check() to customise the checks done on your Data Package’s properties. The following configuration options are available:

Important

The Data Package standard uses language from RFC 2119 to define its specifications. They use “MUST” for required properties and “SHOULD” for properties that should be included but are not strictly required. We try to match this language in check-datapackage by using the terms “MUST” and “SHOULD”, though we also use “required” for “MUST” in our documentation.

Excluding checks

You can exclude checks based on their type and the fields they apply to.

The Data Package standard defines a range of check types (e.g., required or pattern) and it is also possible to create your own. For example, to exclude checks flagging missing fields, you would exclude the required check by defining an Exclusion object with this type:

from textwrap import dedent
import check_datapackage as cdp

exclusion_required = cdp.Exclusion(type="required")
exclusion_required
Exclusion(jsonpath=None, type='required')

To exclude checks of a specific field or fields, you can use a JSON path in the jsonpath attribute of an Exclusion object. For example, you can exclude all checks on the name field of the Data Package properties by writing:

exclusion_name = cdp.Exclusion(jsonpath="$.name")
exclusion_name
Exclusion(jsonpath='$.name', type=None)

Or you can use the wildcard JSON path selector to exclude checks on the path field of all Data Resource properties:

exclusion_path = cdp.Exclusion(jsonpath="$.resources[*].path")
exclusion_path
Exclusion(jsonpath='$.resources[*].path', type=None)

The type and jsonpath arguments can also be combined, so we can ignore an Issue of a specific type on a specific field. For example, to exclude checks of whether the created field is in a specific format (type="format"), we can use:

exclusion_created_format = cdp.Exclusion(type="format", jsonpath="$.created")
exclusion_created_format
Exclusion(jsonpath='$.created', type='format')

To apply your exclusions when running the check(), you add them to the Config object passed to the check() function. First, let’s make an example that has three Issue items: the package name is a number, the created field is not a date, and the resource path doesn’t point to a data file (isn’t a real path). So we’ll modify our example package_properties from example_package_properties() to make these Issues appear:

package_properties = cdp.example_package_properties()
package_properties["name"] = 123
package_properties["created"] = "not-a-date"
package_properties["resources"][0]["path"] = "\\not/a/path"
package_properties
{
    'name': 123,
    'title': 'Hibernation Physiology of the Woolly Dormouse: A Scoping Review.',
    'description': '\nThis scoping review explores the hibernation physiology of the\nwoolly dormouse, drawing on data collected over a 10-year period\nalong the Taurus Mountain range in Turkey.\n',
    'id': '123-abc-123',
    'created': 'not-a-date',
    'version': '1.0.0',
    'licenses': [{'name': 'odc-pddl'}],
    'resources': [
        {
            'name': 'woolly-dormice-2015',
            'title': 'Body fat percentage in the hibernating woolly dormouse',
            'path': '\\not/a/path',
            'schema': {
                'fields': [
                    {
                        'name': 'eye-colour',
                        'type': 'string',
                        'title': 'Woolly dormouse eye colour'
                    }
                ]
            }
        }
    ]
}

When we run check() on these properties, we get the three expected issues:

cdp.check(properties=package_properties)
[
    Issue(
        jsonpath='$.created',
        type='format',
        message="'not-a-date' is not a 'date-time'",
        instance='not-a-date'
    ),
    Issue(
        jsonpath='$.name',
        type='type',
        message="123 is not of type 'string'",
        instance=123
    ),
    Issue(
        jsonpath='$.resources[0].path',
        type='pattern',
        message="'\\\\not/a/path' does not match '^((?=[^./~])(?!file:)((?!\\\\/\\\\.\\\\.\\\\/)(?!\\\\\\\\)(?!:\\\\/\\\\/).)*|(http|ftp)s?:\\\\/\\\\/.*)$'",
        instance='\\not/a/path'
    )
]

Now let’s exclude these Issues so that check() finds no issues by adding our exclusions to a Config object and giving it to check():

config = cdp.Config(exclusions=[exclusion_name, exclusion_path, exclusion_created_format])
cdp.check(properties=package_properties, config=config)
[]

Adding extensions

It is possible to add checks in addition to the ones defined in the Data Package standard. We call these additional checks extensions. There are currently two types of extensions supported: CustomCheck and RequiredCheck. You can add as many CustomChecks and RequiredChecks to your Config as you want to fit your needs.

Custom checks

Let’s say your organisation only accepts Data Packages licensed under MIT. You can express this CustomCheck as follows:

license_check = cdp.CustomCheck(
    type="only-mit",
    jsonpath="$.licenses[*].name",
    message=dedent("""
        Data Packages may only be licensed under MIT. Please review
        the licenses listed in the Data Package.
        """),
    check=lambda license_name: license_name == "mit",
)

For more details on what each parameter means, see the CustomCheck documentation. Specific to this example, the type is setting the identifier of the check to only-mit and the jsonpath is indicating to only check the name property of each license in the licenses property of the Data Package.

To register your custom checks with the check() function, you add them to the Config object passed to the function:

config = cdp.Config(extensions=cdp.Extensions(custom_checks=[license_check]))
cdp.check(properties=package_properties, config=config)
[
    Issue(
        jsonpath='$.created',
        type='format',
        message="'not-a-date' is not a 'date-time'",
        instance='not-a-date'
    ),
    Issue(
        jsonpath='$.licenses[0].name',
        type='only-mit',
        message='\nData Packages may only be licensed under MIT. Please review\nthe licenses listed in the Data Package.\n',
        instance=None
    ),
    Issue(
        jsonpath='$.name',
        type='type',
        message="123 is not of type 'string'",
        instance=123
    ),
    Issue(
        jsonpath='$.resources[0].path',
        type='pattern',
        message="'\\\\not/a/path' does not match '^((?=[^./~])(?!file:)((?!\\\\/\\\\.\\\\.\\\\/)(?!\\\\\\\\)(?!:\\\\/\\\\/).)*|(http|ftp)s?:\\\\/\\\\/.*)$'",
        instance='\\not/a/path'
    )
]

We can see that the custom check was applied: check() returned one issue flagging the first license attached to the Data Package.

Required checks

You can also set specific properties in the datapackage.json file to be required, even when they aren’t required by the Data Package standard with a RequiredCheck. For example, if you want to make the description field of Data Package a required field, you can define a RequiredCheck like this:

description_required = cdp.RequiredCheck(
    jsonpath="$.description",
    message="The 'description' field is required in the Data Package properties.",
)

See the RequiredCheck documentation for more details on its parameters.

To apply this RequiredCheck, it should be added to the Config object passed to check() like shown below. We’ll create a package_properties without a description field to see the effect of this check:

package_properties = cdp.example_package_properties()
del package_properties["description"]
config = cdp.Config(extensions=cdp.Extensions(required_checks=[description_required]))
cdp.check(properties=package_properties, config=config)
[
    Issue(
        jsonpath='$.description',
        type='required',
        message="The 'description' field is required in the Data Package properties.",
        instance=None
    )
]

Strict mode

The Data Package standard includes properties that “MUST” and “SHOULD” be included and/or have a specific format in a compliant Data Package. By default, check() only includes “MUST” checks. To include “SHOULD” checks, set the strict argument to True in the Config object.

For example, the name field of a Data Package “SHOULD” not contain special characters. So running check() in strict mode (strict=True) on the following properties would output an Issue:

package_properties = cdp.example_package_properties()
package_properties["name"] = "data-package!@#"
cdp.check(properties=package_properties, config=cdp.Config(strict=True))
[
    Issue(
        jsonpath='$.name',
        type='pattern',
        message="'data-package!@#' does not match '^[a-z0-9._-]+$'",
        instance='data-package!@#'
    )
]