Article quality model as dependency in data pipelines
How should users of the article quality model make use of it?
Note that this issue is not about online "model inference" (which is a table lookup in this case), which is a open&recurring question to be solved also with other similar models (topic model, geography model).
For offline use cases, e.g. a spark pipeline (like the content gap pipeline) requires article quality scores for a set of articles, there are two options:
- Create a code dependency for offline "consumers", e.g. somebody can pip install a article-quality module and then import methods to read a "current" article quality dataframe, a timesseries of article qualities dataframe, helper methods, etc
- Another approach is to not have a code dependency. The article quality pipeline stores all article quality predictions on a specific hdfs path with a schema that is documented (or parquet, though this makes it harder to use for non-spark consumers). A consumer can then read from that hdfs, parse the data using the documented schema, and then use the quality model with a join operation. The disadvantages with this approach:
- the hdfs path will be used in other projects, thus the data itself becomes "production" and has to be maintained (akin to the data sources provided by data engineering)
- there is no way to provide code to represent the domain or provide helper methods. For the quality model this might be ok (since the quality is a single double value), but for e.g. the article topic model the data will be a array of 1000 values each corresponding to a topic
The "create a python package" option seems preferable in my opinion, but they are not mutually exclusive: we could start with the "hdfs path" option and create a package when the need arises.