Wikipedia crawling (part II)

Thu 11 June 2020
Crawling (Photo credit: Wikipedia)

This article is the follow up of the one about wikidata crawling.

Wikipedia has specific infobox templates. This is the normalized way to enter specification inside wikipedia articles. It provides templates with already defined fields. For example the planet template has fields such as periapsis or circumference. Thanks to those fields other parameters are computed and displayed. This normalization void duplicates and conversion errors (e.g. fields are automatically displayed in meters and feet when relevant while the information is provided in meters only).

My goal for the rest of this article is to retrieve those normalized parameters for one specific article.

\begin{equation*} \fbox{URL} \to \fbox{wiki source} \to \fbox{infobox} \to \fbox{python dict} \end{equation*}

Retrieve wikiepdia page source

The wikipedia source is accesssible throught the "edit" or "view source" tab.

wikipedia edit tab

I decided to use to retrieve the content of the editable text by crawling the webpage. I'm confident other better way to retrieve this content exist but I didn't took time to look for it.

The URL is deductible from the wikipedia URL. The editable source is inside a textarea HTML tag.

import urllib
import requests
from bs4 import BeautifulSoup


def get_source(url):
    """

    Args:
        url (str): wikipediapage url
    returns:
        (str) wikimedia source
    """
    source_url = _get_source_url(url)
    req = requests.get(source_url)
    soup = BeautifulSoup(req.content)
    source_tag = soup.find("textarea")
    if source_tag is None:
        return ""
    return source_tag.text


def _get_source_url(url):
    """Get source page
    """
    req = requests.get(url)
    soup = BeautifulSoup(req.content)
    parsed = urllib.parse.urlparse(url)
    base_url = "{}://{}/".format(parsed.scheme, parsed.netloc)

    title = parsed.path.split("/")[-1]
    return base_url + "w/index.php?title={}&action=edit".format(title)

Retrieve infobox

Infobox are delimited with {{ and }}. The goal is to find those patterns. Such patterns can be nested.

I coded an iterator over infoboxes based on those patterns. Unfortunately, I made a recursive function to deal with nested infoboxes.

import re

START = re.compile(r"{{", re.MULTILINE)
END = re.compile(r"}}", re.MULTILINE)


def _find_next_start(source, pos, start):
    try:
        return start.search(source, pos).end()
    except AttributeError:
        # no more matches, next_start is None
        return len(source)


def iter_specs(source, start=START, end=END):
    for match in start.finditer(source):
        start_pos = match.start()
        next_start = _find_next_start(source, match.end(), start)
        end_matches = end.finditer(source, start_pos)

        for end_match in end_matches:
            if end_match.end() > next_start:
                # nested
                yield from iter_specs(source[next_start : end_match.end()], start, end)
                next_start = _find_next_start(source, end_match.end(), start)
                continue
            else:
                yield source[start_pos : end_match.end()]
                break

Get specs

Infobox now looks like:

{{Infobox writer
| name          = Douglas Adams
| image         = Douglas adams portrait cropped.jpg
| caption       =
| birth_name    = Douglas Noel Adams
| birth_date    = {{birth date|1952|3|11|df=yes}}
| birth_place   = [[Cambridge]], [[Cambridgeshire]], England
| death_date    = {{Death date and age|2001|5|11|1952|3|11|df=yes}}
| death_place   = [[Montecito, California]], US
| resting_place   = [[Highgate Cemetery]], London, England
| occupation    = Writer
| alma_mater    = [[St John's College, Cambridge]]
| genre         = [[Science fiction]], [[comedy]], [[satire]]
|notablework =''[[The Hitchhiker's Guide to the Galaxy]]''
|signature= Douglas Adams Unterschrift (cropped).jpg
| website       = {{URL|douglasadams.com}}
}}

A simple and easy way to retrieve informations is to split fields on | and key/value on =

import re

 def spec_to_dict(spec):
     props = {}
     for prop in re.split(r"\n\|", sepc):
         if "=" not in prop:
             continue
         key, value = prop.split("=", 1)
         for repl in ("[[", "]]", "{{", "}}"):
             value = value.replace(repl, "")
         props[key.strip()] = value.strip(" '\"\n")
     return props

With the previous example, we get:

{'name': 'Douglas Adams',
 'image': 'Douglas adams portrait cropped.jpg',
 'caption': '',
 'birth_name': 'Douglas Noel Adams',
 'birth_date': 'birth date|1952|3|11|df=yes',
 'birth_place': 'Cambridge, Cambridgeshire, England',
 'death_date': 'Death date and age|2001|5|11|1952|3|11|df=yes',
 'death_place': 'Montecito, California, US',
 'resting_place': 'Highgate Cemetery, London, England',
 'occupation': 'Writer',
 'alma_mater': "St John's College, Cambridge",
 'genre': 'Science fiction, comedy, satire',
 'notablework': "The Hitchhiker's Guide to the Galaxy",
 'signature': 'Douglas Adams Unterschrift (cropped).jpg',
 'website': 'URL|douglasadams.com'}

Future works

There are still preprocessing to perform on the retrieved data, but as is, it is exploitable for numerical properties. Future works include:

  • handle internal links
  • handle other kind of information enclosed in {{ }}

Category: how to Tagged: python wikipedia html data retrieval


Wikidata crawling

Sun 26 April 2020
Graph database representation (Photo credit: Wikipedia)

I wish to have reliable data about vehicles. I decided to rely on one large source, namely Wikipedia. I chose it because it is reviewable and most of the time reviewed, and regularly updated and completed.

Wikipedia - Wikidata relationship

Wikidata items are made to …

Category: how to Tagged: python wikipedia wikidata html

Read More

Differential equation in python

Sat 04 April 2020
Second order differential equation (Photo credit: Wikipedia)

In python, differential equations can be numerically solved thanks to scipy [1]. Is usage is not as intuitive as I expected.

Simple equation

Let's start small. The first equation will be really simple:

\begin{equation*} \frac{\partial{f}}{\partial{t}} = a \times f …

Category: maths Tagged: python maths equation

Read More

Zombie propagation

Sat 21 March 2020
Zombie favorite food warning (Photo credit: wikipedia)

I recently read a paper [1] trying to model a disease propagation. I wanted to play with this model.

The model

The model is know as "SIR" as it divide the population into 3 groups:

  • S: suceptible to become a zombie
  • I: infected …

Category: maths Tagged: python maths zombie

Read More

Python virtualenv: quick reference

Sun 21 July 2019
Virtual environement (Photo credit: wikipedia)

To isolate python developments, I use virtualenv. This allow me to forget about the specific python version used for each project, avoid interferences with the default python installation and between my projects, is relatively light, and may have other advantages I cannot imagine with my …

Category: programming Tagged: python tools code

Read More
Page 1 of 1