Toggle navigation
Home
New Query
Recent Queries
Discuss
Database tables
Database names
MediaWiki
Wikibase
Replicas browser and optimizer
Login
History
Fork
This query is marked as a draft
This query has been published
by
MarioGom
.
This query shows domains used in external links from the main namespace of English Wikipedia, along with a count of how many links they appear in, as well as the number of articles containing, at least, one link to the domain. Domains are simplified to the main domain: * news.bbc.com -> bbc.com * news.bbc.co.uk -> bbc.co.uk * www.dover.nj.us -> dover.nj.us Top 1000 are shown, according to number of articles containing links to the domain.
Toggle Highlighting
SQL
USE enwiki_p; SELECT all_agg.main_domain AS main_domain, COUNT(*) AS total_count FROM ( SELECT main_domain, COUNT(*) AS total_count FROM ( SELECT CASE WHEN domain LIKE '%.gouv.fr' THEN SUBSTRING_INDEX(domain, '.', -3) WHEN domain LIKE '%.govt.nz' THEN SUBSTRING_INDEX(domain, '.', -3) WHEN domain REGEXP '^.*[-\\w]{4,}\\.[[:alpha:]]{2,}$' THEN SUBSTRING_INDEX(domain, '.', -2) WHEN domain REGEXP '^.*\\.(ac|edu|org|or|com|co|net|gov|gob|go)\\.[[:alpha:]]{2,}$' THEN SUBSTRING_INDEX(domain, '.', -3) WHEN domain REGEXP '.*\\.[[:alpha:]]{2}\\.(us|ca|za|it)$' THEN SUBSTRING_INDEX(domain, '.', -3) ELSE SUBSTRING_INDEX(domain, '.', -2) END AS main_domain FROM ( SELECT REGEXP_REPLACE(LOWER(CONVERT(SUBSTRING(el_to FROM 1 FOR 255) USING utf8)), '^(?:[^:/]+:)?//((?:[-_0-9a-z]+\\.)+[a-z]+)(?:[/:?].*)?$', '\\1') AS domain FROM externallinks WHERE -- Only links from main namespace articles. el_from_namespace = 0 AND el_from % 1000 = 0 -- Get only fairly well-formed URLs and exclude IPv4. Only http, https, implicit https (//) and ftp. AND SUBSTRING(el_to FROM 1 FOR 40) REGEXP '(?i)^(?:https?:|ftp:)?//(?:[-_0-9a-z]+\\.)+[a-z]+(?:[/:?].*)?$' ) AS domains ) AS main_domains GROUP BY main_domain HAVING total_count >= 10 ) AS sampled_agg JOIN ( SELECT CASE WHEN domain LIKE '%.gouv.fr' THEN SUBSTRING_INDEX(domain, '.', -3) WHEN domain LIKE '%.govt.nz' THEN SUBSTRING_INDEX(domain, '.', -3) WHEN domain REGEXP '^.*[-\\w]{4,}\\.[[:alpha:]]{2,}$' THEN SUBSTRING_INDEX(domain, '.', -2) WHEN domain REGEXP '^.*\\.(ac|edu|org|or|com|co|net|gov|gob|go)\\.[[:alpha:]]{2,}$' THEN SUBSTRING_INDEX(domain, '.', -3) WHEN domain REGEXP '.*\\.[[:alpha:]]{2}\\.(us|ca|za|it)$' THEN SUBSTRING_INDEX(domain, '.', -3) ELSE SUBSTRING_INDEX(domain, '.', -2) END AS main_domain FROM ( SELECT REGEXP_REPLACE(LOWER(CONVERT(SUBSTRING(el_to FROM 1 FOR 255) USING utf8)), '^(?:[^:/]+:)?//((?:[-_0-9a-z]+\\.)+[a-z]+)(?:[/:?].*)?$', '\\1') AS domain FROM externallinks WHERE -- Only links from main namespace articles. el_from_namespace = 0 -- Get only fairly well-formed URLs and exclude IPv4. Only http, https, implicit https (//) and ftp. AND SUBSTRING(el_to FROM 1 FOR 40) REGEXP '(?i)^(?:https?:|ftp:)?//(?:[-_0-9a-z]+\\.)+[a-z]+(?:[/:?].*)?$' ) AS domains ) AS all_agg ON sampled_agg.main_domain = all_agg.main_domain GROUP BY all_agg.main_domain ORDER BY total_count DESC LIMIT 1000;
By running queries you agree to the
Cloud Services Terms of Use
and you irrevocably agree to release your SQL under
CC0 License
.
Submit Query
Stop Query
All SQL code is licensed under
CC0 License
.
Checking query status...