Would it be safe to merely enumerate(headings)
for instance?
@mlitn , got it, thanks! References are currently removed in wikitext_headings_to_anchors(headings: List[str])
: how would you get their number?
See section-image-recs!12 (comment 73760) - same feedback applies
They should not simply be removed, but be replaced by [x]
(where x is the number of the reference)
Note how on this page, there's this title:
== Norges sommerby<ref>[[Nordkapp]] kommune vedtok [[Liste over norske byer|bystatus]] for Honningsvåg fra [[1. oktober]] [[1996]], og i ettertid har det vært mange diskusjoner rundt stedets status som by. En endring i [[kommuneloven]] krever at kommunen må ha minst 5 000 innbyggere for at et tettsted skal kunne kalle seg by, men [[Nordkapp]] kommune vedtok bystatus for Honningsvåg før innføringen av denne loven. Siden norske lover ikke har tilbakevirkende kraft, gjelder dette ikke for Honningsvåg, som dermed kan kalle seg en by.</ref>==
which ends up getting rendered like this:
<h2><span id="Norges_sommerby.5B5.5D"></span><span class="mw-headline" id="Norges_sommerby[5]">Norges sommerby<sup id="cite_ref-5" class="reference"><a href="#cite_note-5">[5]</a></sup></span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Honningsv%C3%A5g&veaction=edit&section=4" title="Rediger avsnitt: Norges sommerby[5]" class="mw-editsection-visualeditor"><span>rediger</span></a><span class="mw-editsection-divider"> | </span><a href="/w/index.php?title=Honningsv%C3%A5g&action=edit&section=4" title="Rediger kildekoden til seksjonen Norges sommerby[5]"><span>rediger kilde</span></a><span class="mw-editsection-bracket">]</span></span></h2>
The relevant part there is id="Norges_sommerby[5]"
, which is how we can directly link to this section, like so: https://no.wikipedia.org/wiki/Honningsv%C3%A5g#Norges_sommerby[5]
The current code would not produce Norges_sommerby[5]
, just Norges_sommerby
, and it would not be possible to link to the correct section directly.
Script run:
prod = spark.read.parquet('/user/analytics-platform-eng/structured-data/section_topics/2024-02-19').where("section_title like '%<ref%'")
dev = spark.read.parquet('section_topics/2024-02-19').where("section_title like '%<ref%'")
prod.count(), dev.count()
(255572, 51)
prod.select('wiki_db', 'page_id', 'section_title').distinct().count()
17054
devref = dev.select('wiki_db', 'page_id', 'section_title').distinct()
devref.count()
2
devref.show()
+-------+-------+--------------------------+
|wiki_db|page_id|section_title |
+-------+-------+--------------------------+
|srwiki |41595 |=_Бивши_корисници<ref_name|
|kowiki |259315 |==_남자부<ref_name |
+-------+-------+--------------------------+
srwiki is broken in real world!
kowiki is correct, perhaps it slipped in due to </br>
?
=== 남자부<ref name="드래프트">[http://www.cbs.co.kr/Nocut/Show.asp?IDX=976377 문성민, 신인드래프트 1순위로 한국전력에 지명] <노컷뉴스> 2008년 11월 3일</br>
[http://www.mydaily.co.kr/news/read.html?newsid=200810201459462275&ext=na 女배구 세터 염혜선, 드래프트 1순위 현대건설 행(종합)] <마이데일리> 2008년 10월 20일 보도</ref> ===
Shall we fix them?
Bug: T341113
Airflow test run results:
snapshot = '2024-02-19'
prod = spark.read.table('analytics_platform_eng.image_suggestions_suggestions').where(f'snapshot="{snapshot}" and section_index is not null')
dev = spark.read.table('slis_no_ref.image_suggestions_suggestions').where(f'snapshot="{snapshot}" and section_index is not null')
prod.count(), dev.count()
(3417560, 3419662)
prod.where("section_heading like '%<ref%'").count()
742
dev.where("section_heading like '%<ref%'").count()
0
Bug: T341113
Marco Fossati (0bcfc79c) at 06 Mar 14:54
skip reference tags
Marco Fossati (590b2886) at 05 Mar 17:25
update & fix tests
Marco Fossati (b737eca8) at 05 Mar 16:35
skip reference tags
This changes pipeline.py to start consuming parquets (that start to be generated since MR29) instead of the static input files bundled within this repo.
In addition to that, the commit to update the denylist
ingestion also includes some refactoring to get rid of
some code duplication for normalising section titles:
now that the denylist has also become a parquet, we can
get rid of the plain Python implementation and stick
with only the PySpark version.
This also includes further changes to where denylisted
rows are being filtered out; it now happens near the end
of main
instead of bundling it within extract_sections
I also removed work-dir
argument, expecting a more
complete path for each argument. Makes things simpler
when data is not all consolidated in the same dir.
I ran these scripts a couple of times:
Current (with bundled inputs):
$ python section_topics/pipeline.py 2023-11-20 --work-dir=/user/mlitn/section_topics --page-filter=2022-10_ptwiki_bad --table-filter=20231120_target_wikis_tables
spark.read.parquet('/user/mlitn/section_topics/2023-11-20').count() # 250121777
After 6cfec7d7 (qid-filter parquets):
$ python section_topics/pipeline.py 2023-11-20 --work-dir=/user/mlitn/section_topics --page-filter=2022-10_ptwiki_bad --table-filter=20231120_target_wikis_tables --qid-filter=qids_for_all_points_in_time --qid-filter=qids_for_media_outlets
spark.read.parquet('/user/mlitn/section_topics/2023-11-20').count() # 251654789
After d619d486 (qid-filter + section-title-filter parquets):
$ python section_topics/pipeline.py 2023-11-20 --work-dir=/user/mlitn/section_topics --page-filter=2022-10_ptwiki_bad --table-filter=20231120_target_wikis_tables --qid-filter=qids_for_all_points_in_time --qid-filter=qids_for_media_outlets --section-title-filter=section_titles_denylist
spark.read.parquet('/user/mlitn/section_topics/2023-11-20').count() # 143412657
Notice the significant drop in results! These are caused by the much bigger denylist, which in turn is caused by it using the new SEAL alignment instead of the old one. AFAICT, the new alignment data (and as a result, the new denylist) seems fine, and are indeed the kind of sections we want to exclude.
Final T339129_2 (with changes to path):
$ python section_topics/pipeline.py 2023-11-20 --page-filter=/user/mlitn/section_topics/2022-10_ptwiki_bad --table-filter=/user/mlitn/section_topics/20231120_target_wikis_tables --section-title-filter=/user/mlitn/section_topics/section_titles_denylist --qid-filter=/user/mlitn/section_topics/qids_for_all_points_in_time --qid-filter=/user/mlitn/section_topics/qids_for_media_outlets --output=/user/mlitn/section_topics/2023-11-20
spark.read.parquet('/user/mlitn/section_topics/2023-11-20').count() # 143412657
Bug: T339129
Closing (has all been moved into other MR)
Matthias Mullie (c1dd5cc8) at 22 Feb 07:16
Read dumps from HDFS
Matthias Mullie (a0558798) at 21 Feb 16:42
Read dumps from HDFS
Matthias Mullie (caa85caa) at 21 Feb 16:06
Read dumps from HDFS
Matthias Mullie (ae62e1a8) at 21 Feb 15:23
Read dumps from HDFS
Matthias Mullie (b4009c27) at 21 Feb 15:19
Read dumps from HDFS
Matthias Mullie (03ea3407) at 20 Feb 16:46
Read dumps from HDFS
Matthias Mullie (db312a80) at 20 Feb 15:24
Read dumps from HDFS
Matthias Mullie (ba318a8f) at 20 Feb 14:15
Read dumps from HDFS