Toggle navigation
Home
New Query
Recent Queries
Discuss
Database tables
Database names
MediaWiki
Wikibase
Replicas browser and optimizer
Login
History
Fork
This query is marked as a draft
This query has been published
by
MarioGom
.
Toggle Highlighting
SQL
USE euwiki_p; USE euwiki_p; -- Extract all (page_id, main_domain) tuples SELECT page_id, -- Extract main_domain from domain. -- Examples: -- news.bbc.com -> bbc.com -- news.bbc.co.uk -> bbc.co.uk IF(domain REGEXP '^.*[-\\w]{4,}\\.[[:alpha:]]{2,}$', SUBSTRING_INDEX(domain, '.', -2), IF(domain REGEXP '^.*\\.(ac|edu|org|or|com|co|net|gov|go)\\.[[:alpha:]]{2,}$', SUBSTRING_INDEX(domain, '.', -3), IF(domain REGEXP '.*\\.[[:alpha:]]{2}\\.(us|ca|za)$', SUBSTRING_INDEX(domain, '.', -3), SUBSTRING_INDEX(domain, '.', -2) ))) AS main_domain FROM ( -- Extract all (page_id, domain) tuples. SELECT el_from AS page_id, REGEXP_REPLACE(el_to, '^(?:[^:/]+:)?//((?:[_0-9A-Za-z]+\\.)+[_0-9A-Za-z]+)(?:[/:?].*)?$', '\\1') AS domain FROM externallinks WHERE -- Only links from main namespace articles. el_from_namespace = 0 -- Get only fairly well-formed URLs for http and https (or implicit https?) and exclude IPv4. AND el_to REGEXP '^(https?:)?//[[:alnum:]]+\\..*[[:alpha:]].*' ) AS q LIMIT 100;
By running queries you agree to the
Cloud Services Terms of Use
and you irrevocably agree to release your SQL under
CC0 License
.
Submit Query
Stop Query
All SQL code is licensed under
CC0 License
.
Checking query status...