Toggle navigation
Home
New Query
Recent Queries
Discuss
Database tables
Database names
MediaWiki
Wikibase
Replicas browser and optimizer
Login
History
Fork
This query is marked as a draft
This query has been published
by
MarioGom
.
This query is based on https://quarry.wmflabs.org/query/26224 It is a refined version focused on sites instead of domains. It differs from the domain listing in the following aspects: * Domains that have a permanent redirection are changed to the new domain (e.g. ehu.es -> ehu.eus, blogspot.com -> blogger.com). This process is only semi-automated and results might not be fully accurate. Top 500 are shown, according to number of articles containing links to the domain. Script used to check most redirects was: cat quarry.csv | cut -d, -f1 | while read d ; do to=$(curl -s -I http://$d | grep '^Location:' | sed -r 's|^Location:\s\s*https?://(www\.)?([^:/?]*)([:/?].*)?|\2|g'); [[ ! -z $to && $d != $to ]] && echo "$d $to" ; done
Toggle Highlighting
SQL
USE euwiki_p; SELECT * FROM ( SELECT site, COUNT(DISTINCT page_id) AS article_count, COUNT(*) AS total_count FROM ( SELECT page_id, CASE main_domain -- FIXME: wmflabs.org subdomains should not be merged WHEN "wmflabs.org" THEN "wikitech.wikimedia.org" WHEN "euskaltzaindia.net" THEN "euskaltzaindia.eus" WHEN "paleodb.org" THEN "fossilworks.org" WHEN "faunaeur.org" THEN "fauna-eu.org" WHEN "berria.info" THEN "berria.eus" WHEN "argia.com" THEN "argia.eus" WHEN "armiarma.com" THEN "armiarma.eus" WHEN "blogspot.com" THEN "blogger.com" WHEN "ehu.es" THEN "ehu.eus" WHEN "zientzia.net" THEN "zientzia.eus" WHEN "gara.net" THEN "naiz.eus" WHEN "hiru.com" THEN "hiru.eus" WHEN "badok.info" THEN "badok.eus" WHEN "bizkaia.net" THEN "bizkaia.eus" WHEN "elhuyar.org" THEN "elhuyar.eus" WHEN "realsociedad.com" THEN "realsociedad.eus" WHEN "unz.org" THEN "unz.com" WHEN "gipuzkoa.net" THEN "gipuzkoa.eus" WHEN "elcorreodigital.com" THEN "elcorreo.com" WHEN "naiz.info" THEN "naiz.eus" WHEN "euskadi.net" THEN "euskadi.eus" WHEN "deia.com" THEN "deia.eus" WHEN "euskotren.es" THEN "euskotren.eus" WHEN "guardian.co.uk" THEN "theguardian.com" WHEN "donostiakultura.com" THEN "donostiakultura.eus" WHEN "donostia.org" THEN "donostia.eus" WHEN "inguma.org" THEN "inguma.eus" WHEN "azkuefundazioa.org" THEN "azkuefundazioa.eus" WHEN "ru.net" THEN "faitid.org" ELSE main_domain END AS site FROM ( -- Extract all (page_id, main_domain) tuples SELECT page_id, -- Extract main_domain from domain. -- Examples: -- news.bbc.com -> bbc.com -- news.bbc.co.uk -> bbc.co.uk IF(domain REGEXP '^.*[-\\w]{4,}\\.[[:alpha:]]{2,}$', SUBSTRING_INDEX(domain, '.', -2), IF(domain REGEXP '^.*\\.(ac|edu|org|or|com|co|net|gov|gob|go)\\.[[:alpha:]]{2,}$', SUBSTRING_INDEX(domain, '.', -3), IF(domain REGEXP '.*\\.[[:alpha:]]{2}\\.(us|ca|za|it)$', SUBSTRING_INDEX(domain, '.', -3), SUBSTRING_INDEX(domain, '.', -2) ))) AS main_domain FROM ( -- Extract all (page_id, domain) tuples. SELECT el_from AS page_id, LOWER(CONVERT(REGEXP_REPLACE(el_to, '^(?:[^:/]+:)?//((?:[_0-9A-Za-z]+\\.)+[A-Za-z]+)(?:[/:?].*)?$', '\\1') USING utf8)) AS domain FROM externallinks WHERE -- Only links from main namespace articles. el_from_namespace = 0 -- Get only fairly well-formed URLs for http and https (or implicit https?) and exclude IPv4. AND el_to REGEXP '^(https?:)?//(?:[_0-9A-Za-z]+\\.)+[A-Za-z]+(?:[/:?].*)?$' ) AS domains ) AS main_domains ) AS sites GROUP BY site ) AS results ORDER BY article_count DESC, total_count DESC LIMIT 500;
By running queries you agree to the
Cloud Services Terms of Use
and you irrevocably agree to release your SQL under
CC0 License
.
Submit Query
Stop Query
All SQL code is licensed under
CC0 License
.
Checking query status...