Toggle navigation
Home
New Query
Recent Queries
Discuss
Database tables
Database names
MediaWiki
Wikibase
Replicas browser and optimizer
Login
History
Fork
Fork of
Articles with talk pages but not assigned to WikiProjects
by
The Anome
This query is marked as a draft
This query has been published
by
The Anome
.
Part of a plan to assign articles to WikiProjects using Naive Bayes. Still being worked on. Limited to just a few pages, for testing purposes. Sorting by all three result columns to make output more compressible later. Now using modulo arithmetic to sub-sample the page table more uniformly. 1/1000 of the table ~= 6700 articles page_id % 1000 = 77: Executed in 251.52 seconds as of Tue, 07 Mar 2023 08:21:01 UTC. Resultset (296041 rows) page_id % 1000 = 78: Executed in 265.67 seconds as of Tue, 07 Mar 2023 08:28:22 UTC. Resultset (316077 rows) page_id % 1000 = 392: Executed in 256.58 seconds as of Tue, 07 Mar 2023 09:22:56 UTC. Resultset (310016 rows) page_id % 1000 = 816: Executed in 269.80 seconds as of Tue, 07 Mar 2023 09:31:37 UTC. Resultset (295323 rows) trying chunkier... page_id % 500 = 61: Executed in 548.59 seconds as of Tue, 07 Mar 2023 08:42:24 UTC. Resultset (572651 rows) ...not significantly more efficient trying finer... page_id % 2000 = 72: running... Executed in 149.02 seconds as of Tue, 07 Mar 2023 08:54:04 UTC. Resultset (156369 rows) ... slightly less efficient? With 1/1000 scan, jsonl file is about 30 Mbytes in size, but bz2's to about 1.3 Mbytes. page_id % 100 = 81: Executed in 2025.27 seconds as of Sun, 19 Mar 2023 08:35:21 UTC. Resultset (2381044 rows) result as TSV = 163 Mbytes, bz2's to 10 Mbytes. Next step: Toolforge: Now uses page_random to filter articles, hopefully a better filter if indexed: Executed in 2310.36 seconds as of Fri, 06 Oct 2023 16:02:56 UTC. Resultset (2506435 rows) -- TSV download was 165 Mbytes, zips to 15 Mbytes. Now going for 5%.
Toggle Highlighting
SQL
SELECT pagecats.cl_to, talkcats.cl_to, -- REGEXP_EXTRACT(talkcats.cl_to, '(?WikiProject_|.*_importance_|.*_priority_|Unassessed)(.*)'), -- does not work COUNT(*) AS my_count FROM page AS article INNER JOIN page AS talk ON talk.page_title = article.page_title INNER JOIN categorylinks AS pagecats ON pagecats.cl_from = article.page_id INNER JOIN categorylinks AS talkcats ON talkcats.cl_from = talk.page_id WHERE article.page_random < 0.05 -- page_random is between 0 and 1, so this sets the probe size AND article.page_namespace = 0 AND talk.page_namespace = 1 AND article.page_is_redirect = 0 AND talk.page_is_redirect = 0 AND ( talkcats.cl_to LIKE "%WikiProject_%" OR talkcats.cl_to LIKE "%-Class_%_articles" OR talkcats.cl_to LIKE "%-importance_%_articles" OR talkcats.cl_to LIKE "%-priority_%_articles" OR talkcats.cl_to LIKE "Unassessed_%_articles" ) AND NOT ( talkcats.cl_to LIKE "%vital%" OR talkcats.cl_to LIKE "%Version%" ) AND NOT (pagecats.cl_to LIKE "%Disambig%" OR pagecats.cl_to LIKE "%disambig%" OR pagecats.cl_to LIKE "%set_index%" OR pagecats.cl_to LIKE "Set_index%") AND NOT (talkcats.cl_to LIKE "%Disambig%" OR talkcats.cl_to LIKE "%disambig%") AND NOT (pagecats.cl_to LIKE "Short_description%" OR pagecats.cl_to LIKE "%_errors%" OR pagecats.cl_to LIKE "CS1_%" OR pagecats.cl_to LIKE "%short_description%" OR pagecats.cl_to LIKE "%articles%" OR pagecats.cl_to LIKE "Articles%" OR pagecats.cl_to LIKE "%pages%" OR pagecats.cl_to LIKE "%disputes%" OR pagecats.cl_to LIKE "Pages%" OR pagecats.cl_to LIKE "Use_dmy_date%" OR pagecats.cl_to LIKE "%Wikipedia%" OR pagecats.cl_to LIKE "%articles%" OR pagecats.cl_to LIKE "%Articles%" OR pagecats.cl_to LIKE "%Wikidata%" OR pagecats.cl_to LIKE "Webarchive%") GROUP BY pagecats.cl_to, talkcats.cl_to ORDER BY my_count DESC, pagecats.cl_to, talkcats.cl_to -- LIMIT 10000
By running queries you agree to the
Cloud Services Terms of Use
and you irrevocably agree to release your SQL under
CC0 License
.
Submit Query
Stop Query
All SQL code is licensed under
CC0 License
.
Checking query status...