Decouple MW API client code and use Pydantic for validation
This MR decouples MediaWiki API client code from core and improves error handling for MW API requests.
This MR also replaces core dataclasses with pydantic models. Previously, we were defining types that are used across knowledge integrity and then we were duplicating the constraints on those types to define schemas for parsing and validating external data. After converting dataclasses to pydantic models, we can use the same models for both representing types internally and validating external data. This also means that we'd no longer be using jsonschema
for data validation, which would give us:
More concise error messages
Pydantic's error messages for validation failures tend to be more human-friendly and to the point. Compare error messages for the following:
Missing field
Pydantic
1 validation error for RevisionSchema
rev_id
Field required [type=missing, input_value={'rev_bytes': 2800, 'rev_...0:02:02Z', 'lang': 'en'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.1/v/missing
Jsonschema
jsonschema.exceptions.ValidationError: 'rev_id' is a required property
Failed validating 'required' in schema:
{'properties': {'lang': {'type': 'string'},
'page_first_edit_timestamp': {'format': 'date-time',
'type': 'string'},
[...]
'required': ['rev_id',
'rev_bytes',
'rev_comment',
[...]
On instance:
{'lang': 'en',
'page_first_edit_timestamp': '2018-01-01T10:02:02Z',
[...]
Incorrect field type
Pydantic
1 validation error for RevisionSchema
rev_comment
Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
For further information visit https://errors.pydantic.dev/2.1/v/string_type
Jsonschema
1 is not of type 'string'
Failed validating 'type' in schema['properties']['rev_comment']:
{'type': 'string'}
On instance['rev_comment']:
1
Less tedious schema definition
Pydantic allows defining models as python classes which is less verbose and much more natural than defining a json schema. As the diff would show, field types can be specified using type hints and making fields required or optional is also less work.
cc @fab and @nickifeajika as we have been using this in some research-datasets
pipelines.