Skip to content

Remove reference tags from section headings

Marco Fossati requested to merge T341113 into main

Script run:

prod = spark.read.parquet('/user/analytics-platform-eng/structured-data/section_topics/2024-02-19').where("section_title like '%<ref%'")
dev = spark.read.parquet('section_topics/2024-02-19').where("section_title like '%<ref%'")
prod.count(), dev.count()
(255572, 51)

prod.select('wiki_db', 'page_id', 'section_title').distinct().count()
17054
devref = dev.select('wiki_db', 'page_id', 'section_title').distinct()
devref.count()
2
devref.show()
+-------+-------+--------------------------+
|wiki_db|page_id|section_title             |
+-------+-------+--------------------------+
|srwiki |41595  |=_Бивши_корисници<ref_name|
|kowiki |259315 |==_남자부<ref_name        |
+-------+-------+--------------------------+

🤷 🤷 🤷

srwiki is broken in real world!

Screen_Shot_2024-03-07_at_19.38.06

kowiki is correct, perhaps it slipped in due to </br>?

=== 남자부<ref name="드래프트">[http://www.cbs.co.kr/Nocut/Show.asp?IDX=976377 문성민, 신인드래프트 1순위로 한국전력에 지명] <노컷뉴스> 2008년 11월 3일</br>
[http://www.mydaily.co.kr/news/read.html?newsid=200810201459462275&ext=na 女배구 세터 염혜선, 드래프트 1순위 현대건설 행(종합)] <마이데일리> 2008년 10월 20일 보도</ref> ===

Shall we fix them?

Bug: T341113

Merge request reports