Wikidata crawling

Sun 26 April 2020
Graph database representation (Photo credit: Wikipedia)

I wish to have reliable data about vehicles. I decided to rely on one large source, namely Wikipedia. I chose it because it is reviewable and most of the time reviewed, and regularly updated and completed.

Wikipedia - Wikidata relationship

Wikidata items are made to designate one entity and its relationship to other entities. One entity can is directly link to one or more Wikipedia pages, one per language.

The link Wikipedia \(\to\) Wikidata is assured via a hyperlink named "Wikidata item" in english, and the link Wikidata \(\to\) Wikipedia is present into a box named "Wikipedia".

Wikipedia to Wikidata link
Wikidata to Wikipedia link

Wikipedia data

Characteristics are mainly put in tables. I decided to retrieve them using classic crawling technics with BeautifulSoup.

def table_to_dict(table):

        table (bs4.element.tag): html table tag

    res = {}
    excluded_table = ("navbox", "navbox-inner", "navbox-subgroup")
    if any(word in table.attrs.get("class", [""]) for word in excluded_table):
        return res
    for row in table.find_all("tr"):
        prop = row.find("th")
        value = row.find("td")
        if prop and value:
            res[prop.text.strip()] = value.text.strip().replace("\xa0", " ")
    for row in table.find_all("tr"):
            prop, value = row.find_all("td")
        except ValueError:
            # not the right number of <td>
        prop = prop.text.strip().strip(":")
        value = value.text.strip().replace("\xa0", " ")
        res[prop] = value
    return res

def html_to_props(content):
        content (str): html
    res = {}
    soup = BeautifulSoup(content)
    for table in soup.find_all("table"):
    return res

def get_wikipedia_props(url):
    """Retrieve characteristic tables from wikipedia pages

        url (str): url
    req = requests.get(url)
    return html_to_props(req.content)
>>> get_wikipedia_props("")
{'Autres noms': '« La frégate de la liberté »',
 'Type': 'Frégate de 12',
 'Classe': 'Concorde',
 'Fonction': 'Navire de guerre',
 'Gréement': 'Trois-mâts carré',
 'Architecte': 'Henri Chevillard',
 'Chantier naval': 'Arsenal de Rochefort',
 'Fabrication': 'Coque en chêne',
 'Lancement': '1779 (coule en 1793)',
 'Équipage': '255 à 316 marins',
 'Longueur': '66 m (mât de pavillon compris)',
 'Longueur de coque': '44,20 m',
 'Maître-bau': '11,5 mètres (11,24 ?)',
 "Tirant d'eau": '4,94 m (5,78 ?)',
 "Tirant d'air": '46,9 m',
 'Déplacement': '1 166 tonnes à vide',
 'Hauteur de mât': '56,5 m (grand mât)35 m (artimon)',
 'Voilure': '2 200 à 3 315 m2',
 'Vitesse': '14,5 nœuds (27 km/h)',
 'Armement': '26 canons de 12 livres 8 canons de 6 livres',
 'Armateur': 'Marine française',
 'Pavillon': 'Marine royale française Marine de la République'}

Those characteristics are often available in multiple languages, requiring post processing to improve uniformity.

Wikidata structure

Wikidata is a graph database linked to Wikipedia. Each entity is identified by a "Q number", an ID beginning with the "Q" letter followed by a variable number of digits. This graph is oriented with relationship being themselves objects (e.g. "member of" is the item Q379825).

Wikidata exploration

I created the following function to get a JSON representation of an item.

from collections import defaultdict

import requests

def get_item(ref):
    """get widikada item

        ref (str): wikidata item reference

        (dict): json
    base_url = ""
    query="""SELECT DISTINCT ?p ?property_label ?value
    WHERE {{
       wd:{} ?p ?value .
       ?property wikibase:directClaim ?p ;
           rdfs:label ?property_label .
    headers = {"Accept": "application/json"}
    req = requests.get(base_url + query + "&format = JSON", headers=headers)
    return req.json()

def wikidata_to_dict(item):
    """wikidata item

        item (str): json output from get_item

        (dict): comprehensive dictionary
    result = defaultdict(list)
    for prop in item["results"]["bindings"]:
        label = prop["property_label"]["value"]
        value = prop["value"]["value"]
        if prop["value"]["type"] == "uri":
            value = value.split('/')[-1]
    return result
>>> wikidata_to_dict(get_item("Q498614"))
            {'image': ['Combatlouisbourg400%20004210900%201924%2014072007.jpg'],
             'instance of': ['Q11446'],
             'country': ['Q142'],
             'named after': ['Q658523'],
             'Freebase ID': ['/m/04jf4sf'],
             'inception': ['1779-01-01T00:00:00Z'],
             'Encyclopædia Universalis ID': ['hermione-fregate'],
             'derivative work': ['Q5502184'],
             'Commons category': ['Hermione (ship, 1779)']})>

I used lists as characteristics can be mutlievaluated. This request gives numerical characteristics such as beam and draft for boats.

Retrieve item ID belonging to a specific category

Code speaks by itself:

import requests

def get_membership(ref, prop="P31"):
    base_url = ""
    query = """SELECT ?item
                  ?item wdt:{} wd:{}.
        prop, ref
    headers = {"Accept": "application/json"}
    req = requests.get(base_url + query + "&format = JSON", headers=headers)
    res = req.json()
    for member in res["results"]["bindings"]:
        url = member["item"]["value"]
        yield url.split("/")[-1]

I decided to retrieve id from URL.

Relationship Wikidata - Wikipedia

Wikidata to Wikipedia

To get Wikipedia page URLs, I coded the following function using BeautifulSoup.

from bs4 import BeautifulSoup
from bs4.element import Tag

def wikidatapage_to_wikipedialinks(item):

        item (str): ref
        (str): url
    req = requests.get("{}".format(item))
    soup = BeautifulSoup(req.content)
    for tag in soup.find_all("div"):
        if "wikibase-sitelinklistview" not in tag.attrs.get("class", ""):
        for href in tag.find_all("a"):
            url = href.attrs["href"]
            if "" in url:
                yield url

Wikipedia to Wikidata

Here again, I used BeatifulSoup.

def wikipedia_to_wikidata(url):
    """Retrieve wikidata item number

        url (string): wikipedia url

        (string): reference
    req = requests.get(url)
    soup = BeautifulSoup(req.content)
    wikilink = next(
        for li in soup.find_all("li")
        if "id" in li.attrs and li.attrs["id"] == "t-wikibase"
    tag = wikilink.find("a")
        return tag.attrs["href"].split("/")[-1]
    except IndexError:
        return ""
>>> wikipedia_to_wikidata("")

Put it all together

From a Q item ID, if we try to get all properties, we end with:

def get_all_props(ref):
        ref (str): Q number
        (dict): all properties
    res = wikidata_to_dict(get_item(ref))

    wikipedia_urls = wikidatapage_to_wikipedialinks(ref)
    for url in wikipedia_urls:

    return res

Be careful, some properties are lists (those coming from Wikidata), other are strings (those coming from Wikipedia).

Mass crawling

Let's start with an iterable of Q items IDs. I'll put them into a set to assure unicity but any iterable is OK. I won't go in details into the retrieving of those IDs, but you may imagine the function get_membership() defined earlier is useful.

qitems = {

Wikidata limits the query rate, and is the list of items is long monitoring is desirable.

from datetime import datetime
from time import sleep

all_results = {}

for i, ref in enumerate(qitems):
    if ref in all_results:
        # speedup restarts if all_results is partially filled
    try_nb = 0
    while try_nb < 10 and ref not in all_results:
            all_results[ref] = get_all_props(ref)
        except: # excpetion should be specified
            now =
            print("{} WARNING item {} (try {}/10)".format(now, ref, try_nb))
            try_nb += 1
            sleep(60 * try_nb)
    if i % 100 == 0:
        now =
        print("{} STATUS: item {} / {}".format(now, i, len(boat_refs)))

The result is a python dict that can easily be serialized into a JSON.

Category: how to Tagged: python wikipedia wikidata html

Page 1 of 1