Improve structured content parsing

A job that:

  • keep table cells and add a new option keep only "long" ones
  • collapse list items

Refactor extract_embeddings to use this intermediate dataset.

Bug: T414070

Edited by DCausse

Merge request reports

Loading