\( \newcommand{\Real}{\ensuremath{\mathbb{R}}} \newcommand{\Natural}{\ensuremath{\mathbb{N}}} \newcommand{\Complex}{\ensuremath{\mathbb{C}}} \newcommand{\ReP}[1]{\ensuremath{\text{Re}\left(#1\right)}} \newcommand{\ImP}[1]{\ensuremath{\text{Im}\left(#1\right)}} \newcommand{\vnorm}[1]{\ensuremath{\|#1\|}} \newcommand{\emb}[1]{\ensuremath{\mathbf{e}_{#1}}} \newcommand{\remb}[1]{\ensuremath{\mathbf{r}_{#1}}} \newcommand{\rs}{\ensuremath{s}} \newcommand{\rp}{\ensuremath{p}} \newcommand{\rpq}{\ensuremath{q}} \newcommand{\ro}{\ensuremath{o}} \newcommand{\EXT}{\ensuremath{\text{IEXT}}} \newcommand{\conjt}[1]{\ensuremath{\overline{#1}}} \newcommand{\bx}{\ensuremath{\mathbf{x}}} \newcommand{\by}{\ensuremath{\mathbf{y}}} \newcommand{\bz}{\ensuremath{\mathbf{z}}} \newcommand{\vx}{\ensuremath{\mathbf{x}}} \newcommand{\vy}{\ensuremath{\mathbf{y}}} \newcommand{\vz}{\ensuremath{\mathbf{z}}} \newcommand{\spo}{\ensuremath{\langle \rs, \rp, \ro \rangle}} \newcommand{\sspo}{\ensuremath{\rs\rp\ro}} \newcommand{\cE}{\ensuremath{\mathcal{E}}} \newcommand{\cR}{\ensuremath{\mathcal{R}}} \newcommand{\cG}{\ensuremath{\mathcal{G}}} \newcommand{\cS}{\ensuremath{\mathcal{S}}} \newcommand{\cN}{\ensuremath{\mathcal{N}}} \newcommand{\cB}{\ensuremath{\mathcal{B}}} \newcommand{\dset}{\ensuremath{\mathcal{D}}} \newcommand{\loss}{\ensuremath{\mathcal{J}}} \newcommand{\sloss}{\ensuremath{\loss_{\cS}}} \newcommand{\sreg}{\ensuremath{\cR_{\cS}}} \newcommand{\score}{\ensuremath{\phi}} \newcommand{\KG}{\ensuremath{\mathcal{G}}} \newcommand{\Test}{\ensuremath{\mathcal{T}}} \newcommand{\Rank}{\text{rank}} \newcommand{\BigO}{\ensuremath{\mathcal{O}}} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\res}[1]{\textsc{#1}} \newcommand{\card}[1]{\ensuremath{\left\vert{#1}\right\vert}} \newcommand{\kg}[1]{{\textsc{#1}}} \newcommand{\rel}[1]{{\textsc{#1}}} \newcommand{\irel}[1]{{\textsc{#1}\textsuperscript{$-$}}} \newcommand{\orel}[1]{{\texttt{#1}}} \newcommand{\mdl}[1]{{\textsc{#1}}} \newcommand{\smdl}[1]{{\textsc{#1}$^\mathcal{S}$}} \newcommand{\corr}[1]{\ensuremath{\mathcal{C}(#1)}} \newcommand{\dvgc}[2]{\ensuremath{D\left[#1\|#2\right]}} \newcommand{\tdot}[3]{\ensuremath{\langle #1, #2, #3 \rangle}} \newcommand{\ff}{\ensuremath{\phi}} \newcommand{\fg}{\ensuremath{\pi}} \newcommand{\sR}{\textsuperscript{R}} \newcommand{\axioms}{\ensuremath{\mathcal{A}_{E}}} \newcommand{\iaxioms}{\ensuremath{\mathcal{A}_{I}}} \newcommand{\params}{\ensuremath{\Theta}} \newcommand{\paramspace}{\ensuremath{\boldsymbol{\Theta}}} \newcommand{\ie}{\textit{i}.\textit{e}.\ } \newcommand{\eg}{\textit{e}.\textit{g}.\ } \def\sc#1{\dosc#1\csod} \def\dosc#1#2\csod{{\rm #1{\small #2}}} \)

Adversarial Sets for Regularising Neural Link Predictors

Pasquale Minervini$^{\alpha}$   Thomas Demeester$^{\beta}$   Tim Rocktäschel$^{\gamma}$   Sebastian Riedel$^{\alpha}$
${}^{\alpha}$ University College London   ${}^{\beta}$ Ghent University - iMinds   ${}^{\gamma}$ University of Oxford
p.minervini@ucl.ac.uk
http://slides.neuralnoise.com/CPH2017

Knowledge Graphs

    A knowledge graph is a tuple $(\cE, \cR, \cG)$, where:
  • $\cE$ is a set of nodes, each representing an entity in the domain.
  • $\cR$ is a set of edge labels, each representing a predicate, or relation type.
  • $\cG \subseteq \cE \times \cR \times \cE$ is a set of $\langle \sc{subject}, \sc{predicate}, \sc{object} \rangle$ triples, denoting facts:
subjectpredicateobject
Barack Obamamarried toMichelle Obama
Barack Obamapolitician ofUnited States
Barack Obamaborn inHonolulu
Honolulucapital ofHawaii

The Linked Open Data Cloud

DBpedia $\rightarrow$ Obama

DBpedia $\rightarrow$ Copenhagen

Google Knowledge Graph

Google Knowledge Graph

Incompleteness

Many great knowledge graphs, including DBpedia, Freebase, YAGO, the Google Knowledge Graph and the Facebook Knowledge Graph.
However, they suffer from incompleteness:

  • Freebase: 71% of persons lack a birthplace, 75% lack a nationality, coverage for less frequent relations is even lower [Dong et al. 2014].
  • DBpedia: 66% of persons lack a place of birth, 58% of scientists are missing a fact stating what they are known for [Krompaß et al. 2015]


Link Prediction: problem of finding missing links in Knowledge Graphs.

Link Prediction

Link Prediction can be cast as a learning to rank problem, where the aim is learning a scoring function $\ff : \cE \times \cR \times \cE \mapsto \Real$ such that:

\[ \ff(\rs, \rp, \ro) \;\; \text{is} \;\; \left\{ \begin{array}{rl} \text{large} & \text{if the fact } \spo \text{ is true},\\ \text{small} & \text{otherwise} \end{array} \right. \]

The scoring function $\ff$ can be used for ranking missing triples: the higher the score, the more likely the triple.

The scoring function $\ff$ can be based either on the observable features or the latent features of entities $\rs, \ro \in \cE$.

Neural Link Prediction

Neural Link Predictors are neural network architectures where:

  • Each entity $o \in \cE$ is associated to an embedding vector $\emb{o} \in \Real^{k}$
  • The score of a triple $\spo$ is a function of the embeddings of $\rs$ and $\ro$:
  • \[ \ff(\rs, \rp, \ro) \triangleq \ff_{\rp}(\emb{\rs}, \emb{\ro}), \] where $\ff_{\rp}$ is a neural scoring layer.

Neural Link Predictors

Several Neural Link Predictors proposed in the literature:

  • Bilinear model (RESCAL) [Nickel et al. 2011]: \[ \ff_{\rp}(\emb{\rs}, \emb{\ro}) \triangleq \emb{\rs}^{T} \mathbf{W}_{\rp} \emb{\ro} \qquad \emb{\rs}, \emb{\ro} \Real^{k}, \quad \mathbf{W}_{\rp} \in \Real^{k \times k} \]
  • Translating Embeddings model (TransE) [Bordes et al. 2013]: \[ \ff_{\rp}(\emb{\rs}, \emb{\ro}) \triangleq - \vnorm{\emb{\rs} + \emb{\rp} - \emb{\ro}}, \qquad \emb{\rs}, \emb{\rp}, \emb{\ro} \in \Real^{k} \]
  • Bilinear Diagonal model (DistMult) [Yang et al. 2015]: \[ \ff_{\rp}(\emb{\rs}, \emb{\ro}) \triangleq \tdot{\emb{\rs}}{\emb{\rp}}{\emb{\ro}} = \sum_{i=1}^{k} \emb{\rs}_{i} \emb{\rp}_{i} \emb{\ro}_{i} \qquad \emb{\rs}, \emb{\rp}, \emb{\ro} \in \Real^{k} \]
  • Complex Embeddings model (ComplEx) [Trouillon et al. 2016]: \[ \ff_{\rp}(\emb{\rs}, \emb{\ro}) \triangleq \ReP{\tdot{\emb{\rs}}{\emb{\rp}}{\overline{\emb{\ro}}}} = \ReP{\sum_{i=1}^{k} \emb{\rs}_{i} \emb{\rp}_{i} \overline{\emb{\ro}_{i}}} \qquad \emb{\rs}, \emb{\rp}, \emb{\ro} \in \Complex^{k} \]

Training Neural Link Predictors

Minimizing a hinge loss over observed and corrupted triples:


\[ \begin{align} & \underset{\params}{\text{minimize}} \notag & & \loss(\params) \triangleq \sum_{(x, y) \in \Omega} \left[ 1 - y \cdot \ff(x ; \params) \right]_{+} \\ & \text{subject to} \notag & & \displaystyle{\forall o \in \cE: \; \vnorm{\emb{o}} = 1} \notag \end{align} \]

Where negative examples ($y = -1$) are generated by a corruption process, that corrupt either the subject or the object of a triple:


\[ \corr{\spo} = \{ \langle \tilde{\rs}, \rp, \ro \rangle \mid \tilde{\rs} \in \cE \} \cup \{ \langle \rs, \rp, \tilde{\ro} \rangle \mid \tilde{\ro} \in \cE \} \]

Injecting First-Order Logic Rules

Consider the following rule:

\[ \boxed{\sc{fatherOf}(\mathcal{X}, \mathcal{Y}) \land \sc{parentOf}(\mathcal{Y}, \mathcal{Z}) \Rightarrow \sc{grandFatherOf}(\mathcal{X}, \mathcal{Z})} \]

We want to enforce that, for all entities $x, y, z \in \cE$,
if $\langle x, \sc{fatherOf}, y \rangle$ and $\langle y, \sc{parentOf}, z \rangle$ hold,
$\langle x, \sc{grandFatherOf}, z \rangle$ also holds:

\[ \boxed{ \begin{aligned} \forall x, y, z \in \cE : & \min \{ \ff(x, \sc{fatherOf}, y), \ff(y, \sc{parentOf}, z) \} \\ & \leq \ff(x, \sc{grandFaherOf}, z) \end{aligned} } \]

A naïve solution consists in adding, for each rule, $\boxed{\card{\cE}^{3}}$ constraints.

However, this is unlikely to scale on large knowledge graphs -- e.g. on Freebase (40M entities) need to introduce $6.4 \times 10^{22}$ constraints.

Regularization via Adversarial Sets

Idea - Adversarial Training setting where, iteratively:

  • An adversary searches for the most violating $\mathcal{S} = \{ x, y, z \}$,
  • The model is trained by also correcting such violations.

Regularization via Adversarial Sets

Idea - Adversarial Training setting where, iteratively:

  • An adversary searches for the most violating $\mathcal{S} = \{ x, y, z \}$,
  • The model is trained by also correcting such violations.

This can be formalised in the following optimisation problem:

\[ \min_{\params} \mathcal{J}(\params) + \lambda \boxed{ \max_{\mathcal{S}} \mathcal{J}_{I}(\mathcal{S} ; \mathcal{A}, \params) } \]
where $\mathcal{J}_{I}({}\cdot{})$ measures to which degree the entities in $\mathcal{S}$
violate the rule in $\mathcal{A}$, given the model parameters $\params$.

  1. Entities in $\mathcal{S}$ can be either in entity space or in embedding space!!!
  2. In many cases, $\max_{\mathcal{S}} \mathcal{J}_{I}({}\cdot{})$ has a closed-form solution!!!

Model Architecture

Evaluation

Rank of a triple: given a test triple $\spo$, corrupt its subject $\rs$ in every possible way, and sort the triples by their score in descending order.
The position of $\spo$ in the ranking is its rank.
Then do the same for its object $\ro$.

We evaluate by:

  • Mean Reciprocal Rank (MRR), Hits@$k$
  • Area under the Precision-Recall Curve (AUC-PR)

Datasets:Freebase (FB122), WordNet (WN18), and others.

Rules - Samples

Small sample of the rules being used for
Freebase (FB122) and WordNet (WN18)

Rules

Experiments - Freebase (FB122)

We compared our method (ASR) and its closed form solutions (cASR)
with $\sc{KALE}$, a neural link predictor leveraging FOL rules,
and the unregularized versions of $\sc{DistMult}$ and $\sc{ComplEx}$.

FB122

Experiments - WordNet (WN18)

Adversarial Set Regularisatin is especially useful in a small-data regime,
where other methods struggle to generalise correctly.

WN18

Conclusions

We proposed Adversarial Set Regularisation, a method for
injecting prior knowledge in Neural Link Predictors where:

  • An adversary searches for the most violating variable assignments,
  • The model is regularised by correcting such violations.

Properties

  • General - Any function-free Datalog rule
  • Scalable - Independent on domain size!
  • Effective - SOTA results on major downstream link prediction tasks
  • Code: https://github.com/uclmr/inferbeddings

Ongoing Works

Beyond Neural Link Prediction - Look at other domains, where it can be beneficial to impose constraint on multiple data points:

  1. Recognising Textual Entailment/Natural Language Inference
    • $\sc{contradicts}(\mathcal{X}, \mathcal{Y}) \Rightarrow \sc{contradicts}(\mathcal{Y}, \mathcal{X})$
    • $\sc{entails}(\mathcal{X}, \mathcal{Y}) \land \sc{entails}(\mathcal{Y}, \mathcal{Z}) \Rightarrow \sc{entails}(\mathcal{X}, \mathcal{Z})$
  2. Image Processing
    • $\sc{isRotationOf}(\mathcal{X}, \mathcal{Y}) \land \sc{hasLabel}(\mathcal{Y}, \mathcal{Z}) \Rightarrow \sc{hasLabel}(\mathcal{X}, \mathcal{Z})$
  3. $\ldots$

Interested in applying this idea to other domains? Contact me!
$\sc{interestedIn}(\mathcal{X}, \;$ ASR $) \Rightarrow \sc{mailTo}(\mathcal{X}, \;$ p.minervini@ucl.ac.uk$)$