Skip to content

Decouple MW API client code and use Pydantic for validation

Muniza requested to merge mnz/revision-schema into main

This MR decouples MediaWiki API client code from core and improves error handling for MW API requests.

This MR also replaces core dataclasses with pydantic models. Previously, we were defining types that are used across knowledge integrity and then we were duplicating the constraints on those types to define schemas for parsing and validating external data. After converting dataclasses to pydantic models, we can use the same models for both representing types internally and validating external data. This also means that we'd no longer be using jsonschema for data validation, which would give us:

More concise error messages

Pydantic's error messages for validation failures tend to be more human-friendly and to the point. Compare error messages for the following:

Missing field

Pydantic
1 validation error for RevisionSchema
rev_id
  Field required [type=missing, input_value={'rev_bytes': 2800, 'rev_...0:02:02Z', 'lang': 'en'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.1/v/missing
Jsonschema
jsonschema.exceptions.ValidationError: 'rev_id' is a required property

Failed validating 'required' in schema:
    {'properties': {'lang': {'type': 'string'},
                    'page_first_edit_timestamp': {'format': 'date-time',
                                                  'type': 'string'},
     [...]               
     'required': ['rev_id',
                  'rev_bytes',
                  'rev_comment',
     [...]

On instance:
    {'lang': 'en',
     'page_first_edit_timestamp': '2018-01-01T10:02:02Z',
     
     [...]

Incorrect field type

Pydantic
1 validation error for RevisionSchema
rev_comment
  Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.1/v/string_type
Jsonschema
1 is not of type 'string'

Failed validating 'type' in schema['properties']['rev_comment']:
    {'type': 'string'}

On instance['rev_comment']:
    1

Less tedious schema definition

Pydantic allows defining models as python classes which is less verbose and much more natural than defining a json schema. As the diff would show, field types can be specified using type hints and making fields required or optional is also less work.

cc @fab and @nickifeajika as we have been using this in some research-datasets pipelines.

Edited by Muniza

Merge request reports