Wikipedia crawling (part II)

Thu 11 June 2020
Crawling (Photo credit: Wikipedia)

This article is the follow up of the one about wikidata crawling.

Wikipedia has specific infobox templates. This is the normalized way to enter specification inside wikipedia articles. It provides templates with already defined fields. For example the planet template has fields such as periapsis or circumference. Thanks to those fields other parameters are computed and displayed. This normalization void duplicates and conversion errors (e.g. fields are automatically displayed in meters and feet when relevant while the information is provided in meters only).

My goal for the rest of this article is to retrieve those normalized parameters for one specific article.

\begin{equation*} \fbox{URL} \to \fbox{wiki source} \to \fbox{infobox} \to \fbox{python dict} \end{equation*}

Retrieve wikiepdia page source

The wikipedia source is accesssible throught the "edit" or "view source" tab.

wikipedia edit tab

I decided to use to retrieve the content of the editable text by crawling the webpage. I'm confident other better way to retrieve this content exist but I didn't took time to look for it.

The URL is deductible from the wikipedia URL. The editable source is inside a textarea HTML tag.

import urllib
import requests
from bs4 import BeautifulSoup


def get_source(url):
    """

    Args:
        url (str): wikipediapage url
    returns:
        (str) wikimedia source
    """
    source_url = _get_source_url(url)
    req = requests.get(source_url)
    soup = BeautifulSoup(req.content)
    source_tag = soup.find("textarea")
    if source_tag is None:
        return ""
    return source_tag.text


def _get_source_url(url):
    """Get source page
    """
    req = requests.get(url)
    soup = BeautifulSoup(req.content)
    parsed = urllib.parse.urlparse(url)
    base_url = "{}://{}/".format(parsed.scheme, parsed.netloc)

    title = parsed.path.split("/")[-1]
    return base_url + "w/index.php?title={}&action=edit".format(title)

Retrieve infobox

Infobox are delimited with {{ and }}. The goal is to find those patterns. Such patterns can be nested.

I coded an iterator over infoboxes based on those patterns. Unfortunately, I made a recursive function to deal with nested infoboxes.

import re

START = re.compile(r"{{", re.MULTILINE)
END = re.compile(r"}}", re.MULTILINE)


def _find_next_start(source, pos, start):
    try:
        return start.search(source, pos).end()
    except AttributeError:
        # no more matches, next_start is None
        return len(source)


def iter_specs(source, start=START, end=END):
    for match in start.finditer(source):
        start_pos = match.start()
        next_start = _find_next_start(source, match.end(), start)
        end_matches = end.finditer(source, start_pos)

        for end_match in end_matches:
            if end_match.end() > next_start:
                # nested
                yield from iter_specs(source[next_start : end_match.end()], start, end)
                next_start = _find_next_start(source, end_match.end(), start)
                continue
            else:
                yield source[start_pos : end_match.end()]
                break

Get specs

Infobox now looks like:

{{Infobox writer
| name          = Douglas Adams
| image         = Douglas adams portrait cropped.jpg
| caption       =
| birth_name    = Douglas Noel Adams
| birth_date    = {{birth date|1952|3|11|df=yes}}
| birth_place   = [[Cambridge]], [[Cambridgeshire]], England
| death_date    = {{Death date and age|2001|5|11|1952|3|11|df=yes}}
| death_place   = [[Montecito, California]], US
| resting_place   = [[Highgate Cemetery]], London, England
| occupation    = Writer
| alma_mater    = [[St John's College, Cambridge]]
| genre         = [[Science fiction]], [[comedy]], [[satire]]
|notablework =''[[The Hitchhiker's Guide to the Galaxy]]''
|signature= Douglas Adams Unterschrift (cropped).jpg
| website       = {{URL|douglasadams.com}}
}}

A simple and easy way to retrieve informations is to split fields on | and key/value on =

def spec_to_dict(spec):
    props = {}
    for prop in spec.split("|"):
        if "=" not in prop:
            continue
        key, value = prop.split("=", 1)
        value = value.strip("{}[] \n")
        props[key.strip()] = value
    return props

With the previous example, we get:

{'name': 'Douglas Adams',
 'image': 'Douglas adams portrait cropped.jpg',
 'caption': '',
 'birth_name': 'Douglas Noel Adams',
 'birth_date': 'birth date',
 'df': 'yes',
 'birth_place': 'Cambridge]], [[Cambridgeshire]], England',
 'death_date': 'Death date and age',
 'death_place': 'Montecito, California]], US',
 'resting_place': 'Highgate Cemetery]], London, England',
 'occupation': 'Writer',
 'alma_mater': "St John's College, Cambridge",
 'genre': 'Science fiction]], [[comedy]], [[satire',
 'notablework': "''[[The Hitchhiker's Guide to the Galaxy]]''",
 'signature': 'Douglas Adams Unterschrift (cropped).jpg',
 'website': 'URL'}

Future works

There are still preprocessing to perform on the retrieved data, but as is, it is exploitable for numerical properties. Future works include:

  • handle internal links
  • handle other kind of information enclosed in {{ }}

Category: how to Tagged: python wikipedia html data retrieval


Travis setup

Tue 12 May 2020
One job in continuous integration pipeline (Photo credit: Wikipedia)

The goal is to setup a CI pipeline based on Travis with external dependencies integrated to a Github repository

Travis basics

To enable Travis integration in Github, one must edit ./.travis.yml file.

I won't go into detail. The setup is …

Category: how to Tagged: travis ci how to

Read More

Wikidata crawling

Sun 26 April 2020
Graph database representation (Photo credit: Wikipedia)

I wish to have reliable data about vehicles. I decided to rely on one large source, namely Wikipedia. I chose it because it is reviewable and most of the time reviewed, and regularly updated and completed.

Wikipedia - Wikidata relationship

Wikidata items are made to …

Category: how to Tagged: python wikipedia wikidata html

Read More

Differential equation in python

Sat 04 April 2020
Second order differential equation (Photo credit: Wikipedia)

In python, differential equations can be numerically solved thanks to scipy [1]. Is usage is not as intuitive as I expected.

Simple equation

Let's start small. The first equation will be really simple:

\begin{equation*} \frac{\partial{f}}{\partial{t}} = a \times f …

Category: maths Tagged: python maths equation

Read More

Zombie propagation

Sat 21 March 2020
Zombie favorite food warning (Photo credit: wikipedia)

I recently read a paper [1] trying to model a disease propagation. I wanted to play with this model.

The model

The model is know as "SIR" as it divide the population into 3 groups:

  • S: suceptible to become a zombie
  • I: infected …

Category: maths Tagged: python maths zombie

Read More

Python virtualenv: quick reference

Sun 21 July 2019
Virtual environement (Photo credit: wikipedia)

To isolate python developments, I use virtualenv. This allow me to forget about the specific python version used for each project, avoid interferences with the default python installation and between my projects, is relatively light, and may have other advantages I cannot imagine with my …

Category: programming Tagged: python tools code

Read More

C*: Yaw

Mon 01 April 2019
aileron Vertical stabilizer (Photo credit: Wikipedia)

This post is about yaw control. This is also the post for which I did not find many references.

Remember the yaw is the axis controlled by the rudder. The rudder acts as any foil, providing a force dependant of its angle of attack. This …

Category: aviation Tagged: C star Flight dynamics yaw

Read More

LaTeX makefile updated

Fri 29 March 2019

My default LaTeX makefile evolved. Here is an update:

The makefile looks like:

LATEX=pdflatex
BIBTEX=bibtex
BIB=
RERUN='(There is undefined reference|Rerun to get (cross-references|the bars) right)'

%.pdf:%.tex
    ${LATEX} $<
    @if [ -e $*.bbl ]; then ${BIBTEX} $* && ${LATEX} $< && ${LATEX} $< ; fi
    @if egrep -q $(RERUN) $*.log ; then ${LATEX} $< ; fi

%.aux …

Category: tools Tagged: GNU LaTeX Makefile Writing how to tools

Read More
Page 1 of 11

Next »