Read from MariaDB charset using UTF-8 (!6) · Merge requests · repos / data-engineering / wmfdata-python

Neil Shah-Quinn (WMF) requested to merge mariadb-charset into master Jan 12, 2020

Created by: nshahquinn

We currently tell the python-mysql-connector package to read from our MariaDB databases using the binary character set. However, for some reason this stopped working in version 8.0.18 of the connector (https://phabricator.wikimedia.org/T242448).

It turns out that although our MariaDB databases are set to use the binary character set, binary isn't really a character set at all since at the root all characters sets use bytes. It's just a way of telling the database system to treat the bytes just as bytes without attempting to interpret them using a specific character set; for some reason, MediaWiki wants its databases to do to that even though the bytes actually represent text encoded with the UTF-8 character set.

So, we can more usefully tell the connector to read from the databases using UTF-8. This fixes the problem that arose with 8.0.18 and additionally causes the connector to handle the decoding of bytes to characters, allowing us to remove the code where we handled that decoding ourselves.

Admin message

Admin message

Read from MariaDB charset using UTF-8

Merge request reports