Toggle navigation
Home
New Query
Recent Queries
Discuss
Database tables
Database names
MediaWiki
Wikibase
Replicas browser and optimizer
Login
History
Fork
This query is marked as a draft
This query has been published
by
MarioGom
.
This query is based on https://quarry.wmflabs.org/query/26224 It is a refined version focused on sites instead of domains. It differs from the domain listing in the following aspects: * Domains that have a permanent redirection are changed to the new domain (e.g. ehu.es -> ehu.eus, blogspot.com -> blogger.com). This process is only semi-automated and results might not be fully accurate. Top 500 are shown, according to number of articles containing links to the domain. Script used to check most redirects was: cat quarry.csv | cut -d, -f1 | while read d ; do to=$(curl -s -I http://$d | grep '^Location:' | sed -r 's|^Location:\s\s*https?://(www\.)?([^:/?]*)([:/?].*)?|\2|g'); [[ ! -z $to && $d != $to ]] && echo "$d $to" ; done
Toggle Highlighting
SQL
USE euwiki_p; SELECT * FROM ( SELECT site, COUNT(DISTINCT page_id) AS article_count, COUNT(*) AS total_count FROM ( SELECT page_id, CASE main_domain -- FIXME: wmflabs.org subdomains should not be merged WHEN "wmflabs.org" THEN "wikitech.wikimedia.org" WHEN "euskaltzaindia.net" THEN "euskaltzaindia.eus" WHEN "paleodb.org" THEN "fossilworks.org" WHEN "faunaeur.org" THEN "fauna-eu.org" WHEN "berria.info" THEN "berria.eus" WHEN "argia.com" THEN "argia.eus" WHEN "armiarma.com" THEN "armiarma.eus" WHEN "blogspot.com" THEN "blogger.com" WHEN "ehu.es" THEN "ehu.eus" WHEN "zientzia.net" THEN "zientzia.eus" WHEN "gara.net" THEN "naiz.eus" WHEN "hiru.com" THEN "hiru.eus" WHEN "badok.info" THEN "badok.eus" WHEN "bizkaia.net" THEN "bizkaia.eus" WHEN "elhuyar.org" THEN "elhuyar.eus" WHEN "realsociedad.com" THEN "realsociedad.eus" WHEN "unz.org" THEN "unz.com" WHEN "gipuzkoa.net" THEN "gipuzkoa.eus" WHEN "elcorreodigital.com" THEN "elcorreo.com" WHEN "naiz.info" THEN "naiz.eus" WHEN "euskadi.net" THEN "euskadi.eus" WHEN "deia.com" THEN "deia.eus" WHEN "euskotren.es" THEN "euskotren.eus" WHEN "guardian.co.uk" THEN "theguardian.com" WHEN "bbc.co.uk" THEN "bbc.com" WHEN "donostiakultura.com" THEN "donostiakultura.eus" WHEN "donostia.org" THEN "donostia.eus" WHEN "inguma.org" THEN "inguma.eus" WHEN "jakingunea.com" THEN "jakin.eus" WHEN "sustatu.com" THEN "sustatu.eus" WHEN "ueu.org" THEN "ueu.eus" WHEN "goiena.net" THEN "goiena.eus" WHEN "bertsozale.com" THEN "bertsozale.eus" WHEN "eke.org" THEN "eke.eus" WHEN "euskaraz.net" THEN "donostiaeuskaraz.eus" WHEN "royalsocietypublishing.org" THEN "royalsociety.org" WHEN "outdoors.fi" THEN "nationalparks.fi" WHEN "alava.net" THEN "araba.eus" WHEN "xunta.es" THEN "xunta.gal" WHEN "ahotsak.com" THEN "ahotsak.eus" -- FIXME: euskarakultur.org xn--t8j4aa8f8dzk9dqey896e4z2b.com WHEN "oscar.com" THEN "oscar.go.com" WHEN "ttipi.net" THEN "erran.eus" WHEN "blogak.com" THEN "blogak.eus" WHEN "ligaact.com" THEN "euskolabelliga.com" WHEN "idazleak.org" THEN "idazleak.eus" WHEN "gencat.net" THEN "gencat.cat" WHEN "azkuefundazioa.org" THEN "azkuefundazioa.eus" WHEN "eustat.es" THEN "eustat.eus" WHEN "bilbokokonpartsak.com" THEN "bilbokokonpartsak.eus" WHEN "diba.es" THEN "diba.cat" WHEN "elkarlanean.com" THEN "elkarargitaletxea.eus" WHEN "mma.es" THEN "mapama.gob.es" WHEN "lavanguardia.es" THEN "lavanguardia.com" WHEN "govoffice2.com" THEN "govoffice.com" WHEN "ru.net" THEN "faitid.org" WHEN "cfnavarra.es" THEN "navarra.es" WHEN "eitb.com" THEN "eitb.eus" WHEN "hitza.info" THEN "hitza.eus" WHEN "unav.es" THEN "unav.edu" WHEN "euskaltzaindia.fr" THEN "euskaltzaindia.eus" WHEN "noticiasdealava.com" THEN "noticiasdealava.eus" ELSE main_domain END AS site FROM ( -- Extract all (page_id, main_domain) tuples SELECT page_id, -- Extract main_domain from domain. -- Examples: -- news.bbc.com -> bbc.com -- news.bbc.co.uk -> bbc.co.uk CASE WHEN domain REGEXP '^.*[-\\w]{4,}\\.[[:alpha:]]{2,}$' THEN SUBSTRING_INDEX(domain, '.', -2) WHEN domain REGEXP '^.*\\.(ac|edu|org|or|com|co|net|gov|gob|go)\\.[[:alpha:]]{2,}$' THEN SUBSTRING_INDEX(domain, '.', -3) WHEN domain REGEXP '.*\\.[[:alpha:]]{2}\\.(us|ca|za|it)$' THEN SUBSTRING_INDEX(domain, '.', -3) WHEN domain LIKE '%.gouv.fr' THEN SUBSTRING_INDEX(domain, '.', -3) WHEN domain LIKE '%.govt.nz' THEN SUBSTRING_INDEX(domain, '.', -3) WHEN domain LIKE '%.blogger.com' THEN SUBSTRING_INDEX(domain, '.', -3) WHEN domain LIKE '%.wordpress.com' THEN SUBSTRING_INDEX(domain, '.', -3) WHEN domain LIKE '%.wikimedia.org' THEN SUBSTRING_INDEX(domain, '.', -3) WHEN domain LIKE '%.wikipedia.org' THEN SUBSTRING_INDEX(domain, '.', -3) ELSE SUBSTRING_INDEX(domain, '.', -2) END AS main_domain FROM ( -- Extract all (page_id, domain) tuples. SELECT el_from AS page_id, REGEXP_REPLACE(LOWER(CONVERT(SUBSTRING(el_to FROM 1 FOR 255) USING utf8)), '^(?:[^:/]+:)?//(?:www[0-9]*)?((?:[-_0-9a-z]+\\.)+[a-z]+)(?:[/:?].*)?$', '\\1') AS domain FROM externallinks WHERE -- Only links from main namespace articles. el_from_namespace = 0 -- Only http, https, implicit https (//) and ftp. el_index is already lowercase. AND (el_index LIKE 'http://%' OR el_index LIKE 'https://%' OR el_index LIKE '//%' OR el_index LIKE 'ftp://%') -- Get only fairly well-formed URLs and exclude IPv4. AND SUBSTRING(el_to FROM 1 FOR 40) REGEXP '(?i)^(?:https?:|ftp:)?//(?:[-_0-9a-z]+\\.)+[a-z]+(?:[/:?].*)?$' ) AS domains ) AS main_domains ) AS sites GROUP BY site ) AS results ORDER BY article_count DESC, total_count DESC LIMIT 500;
By running queries you agree to the
Cloud Services Terms of Use
and you irrevocably agree to release your SQL under
CC0 License
.
Submit Query
Stop Query
All SQL code is licensed under
CC0 License
.
Checking query status...