Wikidata crawling
Sun 26 April 2020
Graph database representation (Photo credit: Wikipedia)
I wish to have reliable data about vehicles. I decided to rely on one large source, namely Wikipedia. I chose it because it is reviewable and most of the time reviewed, and regularly updated and completed.
Wikipedia - Wikidata relationship
Wikidata items are made to designate one entity and its relationship to other entities. One entity can is directly link to one or more Wikipedia pages, one per language.
The link Wikipedia \(\to\) Wikidata is assured via a hyperlink named "Wikidata item" in english, and the link Wikidata \(\to\) Wikipedia is present into a box named "Wikipedia".
Wikipedia to Wikidata link
Wikidata to Wikipedia link
Wikipedia data
Characteristics are mainly put in tables. I decided to retrieve them using classic crawling technics with BeautifulSoup.
def table_to_dict(table):
"""
Args:
table (bs4.element.tag): html table tag
Returns:
(dict)
"""
res = {}
excluded_table = ("navbox", "navbox-inner", "navbox-subgroup")
if any(word in table.attrs.get("class", [""]) for word in excluded_table):
return res
for row in table.find_all("tr"):
prop = row.find("th")
value = row.find("td")
if prop and value:
res[prop.text.strip()] = value.text.strip().replace("\xa0", " ")
for row in table.find_all("tr"):
try:
prop, value = row.find_all("td")
except ValueError:
# not the right number of <td>
continue
prop = prop.text.strip().strip(":")
value = value.text.strip().replace("\xa0", " ")
res[prop] = value
return res
def html_to_props(content):
"""
Args:
content (str): html
Returns:
(dict)
"""
res = {}
soup = BeautifulSoup(content)
for table in soup.find_all("table"):
res.update(table_to_dict(table))
return res
def get_wikipedia_props(url):
"""Retrieve characteristic tables from wikipedia pages
Args:
url (str): url
Returns:
(dict)
"""
req = requests.get(url)
return html_to_props(req.content)
>>> get_wikipedia_props("https://fr.wikipedia.org/wiki/Hermione_(1779)")
{'Autres noms': '« La frégate de la liberté »',
'Type': 'Frégate de 12',
'Classe': 'Concorde',
'Fonction': 'Navire de guerre',
'Gréement': 'Trois-mâts carré',
'Architecte': 'Henri Chevillard',
'Chantier naval': 'Arsenal de Rochefort',
'Fabrication': 'Coque en chêne',
'Lancement': '1779 (coule en 1793)',
'Équipage': '255 à 316 marins',
'Longueur': '66 m (mât de pavillon compris)',
'Longueur de coque': '44,20 m',
'Maître-bau': '11,5 mètres (11,24 ?)',
"Tirant d'eau": '4,94 m (5,78 ?)',
"Tirant d'air": '46,9 m',
'Déplacement': '1 166 tonnes à vide',
'Hauteur de mât': '56,5 m (grand mât)35 m (artimon)',
'Voilure': '2 200 à 3 315 m2',
'Vitesse': '14,5 nœuds (27 km/h)',
'Armement': '26 canons de 12 livres 8 canons de 6 livres',
'Armateur': 'Marine française',
'Pavillon': 'Marine royale française Marine de la République'}
Those characteristics are often available in multiple languages, requiring post processing to improve uniformity.
Wikidata structure
Wikidata is a graph database linked to Wikipedia. Each entity is identified by a "Q number", an ID beginning with the "Q" letter followed by a variable number of digits. This graph is oriented with relationship being themselves objects (e.g. "member of" is the item Q379825).
Wikidata exploration
I created the following function to get a JSON representation of an item.
from collections import defaultdict
import requests
def get_item(ref):
"""get widikada item
Args:
ref (str): wikidata item reference
Returns:
(dict): json
"""
base_url = "https://query.wikidata.org/sparql?query="
query="""SELECT DISTINCT ?p ?property_label ?value
WHERE {{
wd:{} ?p ?value .
?property wikibase:directClaim ?p ;
rdfs:label ?property_label .
FILTER(LANG(?property_label)="en")
}}""".format(ref)
headers = {"Accept": "application/json"}
req = requests.get(base_url + query + "&format = JSON", headers=headers)
return req.json()
def wikidata_to_dict(item):
"""wikidata item
Args:
item (str): json output from get_item
Returns:
(dict): comprehensive dictionary
"""
result = defaultdict(list)
for prop in item["results"]["bindings"]:
label = prop["property_label"]["value"]
value = prop["value"]["value"]
if prop["value"]["type"] == "uri":
value = value.split('/')[-1]
result[label].append(value)
return result
>>> wikidata_to_dict(get_item("Q498614"))
defaultdict(list,
{'image': ['Combatlouisbourg400%20004210900%201924%2014072007.jpg'],
'instance of': ['Q11446'],
'country': ['Q142'],
'named after': ['Q658523'],
'Freebase ID': ['/m/04jf4sf'],
'inception': ['1779-01-01T00:00:00Z'],
'Encyclopædia Universalis ID': ['hermione-fregate'],
'derivative work': ['Q5502184'],
'Commons category': ['Hermione (ship, 1779)']})>
I used lists as characteristics can be mutlievaluated. This request gives numerical characteristics such as beam and draft for boats.
Retrieve item ID belonging to a specific category
Code speaks by itself:
import requests
def get_membership(ref, prop="P31"):
base_url = "https://query.wikidata.org/sparql?query="
query = """SELECT ?item
WHERE
{{
?item wdt:{} wd:{}.
}}
""".format(
prop, ref
)
headers = {"Accept": "application/json"}
req = requests.get(base_url + query + "&format = JSON", headers=headers)
res = req.json()
for member in res["results"]["bindings"]:
url = member["item"]["value"]
yield url.split("/")[-1]
I decided to retrieve id from URL.
Relationship Wikidata - Wikipedia
Wikidata to Wikipedia
To get Wikipedia page URLs, I coded the following function using BeautifulSoup.
from bs4 import BeautifulSoup
from bs4.element import Tag
def wikidatapage_to_wikipedialinks(item):
"""
Args:
item (str): ref
Yields:
(str): url
"""
req = requests.get("https://www.wikidata.org/wiki/{}".format(item))
soup = BeautifulSoup(req.content)
for tag in soup.find_all("div"):
if "wikibase-sitelinklistview" not in tag.attrs.get("class", ""):
continue
for href in tag.find_all("a"):
url = href.attrs["href"]
if "wikipedia.org" in url:
yield url
Wikipedia to Wikidata
Here again, I used BeatifulSoup.
def wikipedia_to_wikidata(url):
"""Retrieve wikidata item number
Args:
url (string): wikipedia url
Returns:
(string): reference
"""
req = requests.get(url)
soup = BeautifulSoup(req.content)
wikilink = next(
li
for li in soup.find_all("li")
if "id" in li.attrs and li.attrs["id"] == "t-wikibase"
)
tag = wikilink.find("a")
try:
return tag.attrs["href"].split("/")[-1]
except IndexError:
return ""
>>> wikipedia_to_wikidata("https://en.wikipedia.org/wiki/French_frigate_Hermione_(2014)")
'Q5502184'
Put it all together
From a Q item ID, if we try to get all properties, we end with:
def get_all_props(ref):
"""
Args:
ref (str): Q number
Returns:
(dict): all properties
"""
res = wikidata_to_dict(get_item(ref))
wikipedia_urls = wikidatapage_to_wikipedialinks(ref)
for url in wikipedia_urls:
res.update(get_wikipedia_props(url))
return res
Be careful, some properties are lists (those coming from Wikidata), other are strings (those coming from Wikipedia).
Mass crawling
Let's start with an iterable of Q items IDs. I'll put them into a set to
assure unicity but any iterable is OK. I won't go in details into the retrieving
of those IDs, but you may imagine the function get_membership()
defined
earlier is useful.
qitems = {
"Q877343",
"Q3079847",
"Q3445980",
"Q3017039",
"Q16222749",
"Q3153932",
}
Wikidata limits the query rate, and is the list of items is long monitoring is desirable.
from datetime import datetime
from time import sleep
all_results = {}
for i, ref in enumerate(qitems):
if ref in all_results:
# speedup restarts if all_results is partially filled
continue
try_nb = 0
while try_nb < 10 and ref not in all_results:
try:
all_results[ref] = get_all_props(ref)
except: # excpetion should be specified
now = datetime.now().isoformat()
print("{} WARNING item {} (try {}/10)".format(now, ref, try_nb))
try_nb += 1
sleep(60 * try_nb)
if i % 100 == 0:
now = datetime.now().isoformat()
print("{} STATUS: item {} / {}".format(now, i, len(boat_refs)))
The result is a python dict that can easily be serialized into a JSON.