Wikipedia crawling (part II)

Thu 11 June 2020
Crawling (Photo credit: Wikipedia)

This article is the follow up of the one about wikidata crawling.

Wikipedia has specific infobox templates. This is the normalized way to enter specification inside wikipedia articles. It provides templates with already defined fields. For example the planet template has fields such as periapsis or circumference. Thanks to those fields other parameters are computed and displayed. This normalization void duplicates and conversion errors (e.g. fields are automatically displayed in meters and feet when relevant while the information is provided in meters only).

My goal for the rest of this article is to retrieve those normalized parameters for one specific article.

\begin{equation*} \fbox{URL} \to \fbox{wiki source} \to \fbox{infobox} \to \fbox{python dict} \end{equation*}

Retrieve wikiepdia page source

The wikipedia source is accesssible throught the "edit" or "view source" tab.

wikipedia edit tab

I decided to use to retrieve the content of the editable text by crawling the webpage. I'm confident other better way to retrieve this content exist but I didn't took time to look for it.

The URL is deductible from the wikipedia URL. The editable source is inside a textarea HTML tag.

import urllib
import requests
from bs4 import BeautifulSoup


def get_source(url):
    """

    Args:
        url (str): wikipediapage url
    returns:
        (str) wikimedia source
    """
    source_url = _get_source_url(url)
    req = requests.get(source_url)
    soup = BeautifulSoup(req.content)
    source_tag = soup.find("textarea")
    if source_tag is None:
        return ""
    return source_tag.text


def _get_source_url(url):
    """Get source page
    """
    req = requests.get(url)
    soup = BeautifulSoup(req.content)
    parsed = urllib.parse.urlparse(url)
    base_url = "{}://{}/".format(parsed.scheme, parsed.netloc)

    title = parsed.path.split("/")[-1]
    return base_url + "w/index.php?title={}&action=edit".format(title)

Retrieve infobox

Infobox are delimited with {{ and }}. The goal is to find those patterns. Such patterns can be nested.

I coded an iterator over infoboxes based on those patterns. Unfortunately, I made a recursive function to deal with nested infoboxes.

import re

START = re.compile(r"{{", re.MULTILINE)
END = re.compile(r"}}", re.MULTILINE)


def _find_next_start(source, pos, start):
    try:
        return start.search(source, pos).end()
    except AttributeError:
        # no more matches, next_start is None
        return len(source)


def iter_specs(source, start=START, end=END):
    for match in start.finditer(source):
        start_pos = match.start()
        next_start = _find_next_start(source, match.end(), start)
        end_matches = end.finditer(source, start_pos)

        for end_match in end_matches:
            if end_match.end() > next_start:
                # nested
                yield from iter_specs(source[next_start : end_match.end()], start, end)
                next_start = _find_next_start(source, end_match.end(), start)
                continue
            else:
                yield source[start_pos : end_match.end()]
                break

Get specs

Infobox now looks like:

{{Infobox writer
| name          = Douglas Adams
| image         = Douglas adams portrait cropped.jpg
| caption       =
| birth_name    = Douglas Noel Adams
| birth_date    = {{birth date|1952|3|11|df=yes}}
| birth_place   = [[Cambridge]], [[Cambridgeshire]], England
| death_date    = {{Death date and age|2001|5|11|1952|3|11|df=yes}}
| death_place   = [[Montecito, California]], US
| resting_place   = [[Highgate Cemetery]], London, England
| occupation    = Writer
| alma_mater    = [[St John's College, Cambridge]]
| genre         = [[Science fiction]], [[comedy]], [[satire]]
|notablework =''[[The Hitchhiker's Guide to the Galaxy]]''
|signature= Douglas Adams Unterschrift (cropped).jpg
| website       = {{URL|douglasadams.com}}
}}

A simple and easy way to retrieve informations is to split fields on | and key/value on =

import re

 def spec_to_dict(spec):
     props = {}
     for prop in re.split(r"\n\|", sepc):
         if "=" not in prop:
             continue
         key, value = prop.split("=", 1)
         for repl in ("[[", "]]", "{{", "}}"):
             value = value.replace(repl, "")
         props[key.strip()] = value.strip(" '\"\n")
     return props

With the previous example, we get:

{'name': 'Douglas Adams',
 'image': 'Douglas adams portrait cropped.jpg',
 'caption': '',
 'birth_name': 'Douglas Noel Adams',
 'birth_date': 'birth date|1952|3|11|df=yes',
 'birth_place': 'Cambridge, Cambridgeshire, England',
 'death_date': 'Death date and age|2001|5|11|1952|3|11|df=yes',
 'death_place': 'Montecito, California, US',
 'resting_place': 'Highgate Cemetery, London, England',
 'occupation': 'Writer',
 'alma_mater': "St John's College, Cambridge",
 'genre': 'Science fiction, comedy, satire',
 'notablework': "The Hitchhiker's Guide to the Galaxy",
 'signature': 'Douglas Adams Unterschrift (cropped).jpg',
 'website': 'URL|douglasadams.com'}

Future works

There are still preprocessing to perform on the retrieved data, but as is, it is exploitable for numerical properties. Future works include:

  • handle internal links
  • handle other kind of information enclosed in {{ }}

Related articles (or not):

Category: how to Tagged: python wikipedia html data retrieval