filter main pages by QID from lead images
Pass a list of Wikidata QIDs to be filtered from the lead images dataframe.
The default value is FREQUENTLY_UPDATED_PAGE_QIDS = ['Q5296']
, which filters main pages.
Also pull out the hardcoded QID threshold value of the existing filter.
python image_suggestions/commonswiki_file.py T325629 2024-08-12 64
prod = spark.read.table('analytics_platform_eng.image_suggestions_lead_image_data').where('snapshot="2024-08-12"')
dev = spark.read.table('T325629.image_suggestions_lead_image_data').where('snapshot="2024-08-12"')
prod.where('item_id="Q5296"').toPandas()
page_id item_id tag score found_on snapshot
0 59905123 Q5296 image.linked.from.wikipedia.lead_image 1 [dinwiki] 2024-08-12
1 8397621 Q5296 image.linked.from.wikipedia.lead_image 2 [mrjwiki] 2024-08-12
2 60126046 Q5296 image.linked.from.wikipedia.lead_image 1 [brwikiquote] 2024-08-12
3 1145429 Q5296 image.linked.from.wikipedia.lead_image 5 [wuuwiki] 2024-08-12
4 997212 Q5296 image.linked.from.wikipedia.lead_image 1 [chywiki] 2024-08-12
.. ... ... ... ... ... ...
291 1721280 Q5296 image.linked.from.wikipedia.lead_image 2 [novwiki] 2024-08-12
292 714512 Q5296 image.linked.from.wikipedia.lead_image 40 [euwiki] 2024-08-12
293 121293459 Q5296 image.linked.from.wikipedia.lead_image 10 [ltwiki] 2024-08-12
294 81454 Q5296 image.linked.from.wikipedia.lead_image 5 [kswiki] 2024-08-12
295 1382935 Q5296 image.linked.from.wikipedia.lead_image 11 [tawikiquote] 2024-08-12
[296 rows x 6 columns]
dev.where('item_id="Q5296"').toPandas()
Empty DataFrame
Columns: [page_id, item_id, tag, score, found_on, snapshot]
Index: []
Bug: T325629