Pasquale Minervini$^{\alpha}$   Thomas Demeester$^{\beta}$   Tim Rocktäschel$^{\gamma}$   Sebastian Riedel$^{\alpha}$
${}^{\alpha}$ University College London   ${}^{\beta}$ Ghent University - iMinds   ${}^{\gamma}$ University of Oxford
p.minervini@ucl.ac.uk
http://slides.neuralnoise.com/CPH2017

### Knowledge Graphs

A knowledge graph is a tuple $(\cE, \cR, \cG)$, where:
• $\cE$ is a set of nodes, each representing an entity in the domain.
• $\cR$ is a set of edge labels, each representing a predicate, or relation type.
• $\cG \subseteq \cE \times \cR \times \cE$ is a set of $\langle \sc{subject}, \sc{predicate}, \sc{object} \rangle$ triples, denoting facts:
subjectpredicateobject
Barack Obamamarried toMichelle Obama
Barack Obamapolitician ofUnited States
Barack Obamaborn inHonolulu
Honolulucapital ofHawaii

## Incompleteness

Many great knowledge graphs, including DBpedia, Freebase, YAGO, the Google Knowledge Graph and the Facebook Knowledge Graph.
However, they suffer from incompleteness:

• Freebase: 71% of persons lack a birthplace, 75% lack a nationality, coverage for less frequent relations is even lower [Dong et al. 2014].
• DBpedia: 66% of persons lack a place of birth, 58% of scientists are missing a fact stating what they are known for [Krompaß et al. 2015]

Link Prediction can be cast as a learning to rank problem, where the aim is learning a scoring function $\ff : \cE \times \cR \times \cE \mapsto \Real$ such that:

$\ff(\rs, \rp, \ro) \;\; \text{is} \;\; \left\{ \begin{array}{rl} \text{large} & \text{if the fact } \spo \text{ is true},\\ \text{small} & \text{otherwise} \end{array} \right.$

The scoring function $\ff$ can be used for ranking missing triples: the higher the score, the more likely the triple.

The scoring function $\ff$ can be based either on the observable features or the latent features of entities $\rs, \ro \in \cE$.

Neural Link Predictors are neural network architectures where:

• Each entity $o \in \cE$ is associated to an embedding vector $\emb{o} \in \Real^{k}$
• The score of a triple $\spo$ is a function of the embeddings of $\rs$ and $\ro$:
• $\ff(\rs, \rp, \ro) \triangleq \ff_{\rp}(\emb{\rs}, \emb{\ro}),$ where $\ff_{\rp}$ is a neural scoring layer.

Several Neural Link Predictors proposed in the literature:

• Bilinear model (RESCAL) [Nickel et al. 2011]: $\ff_{\rp}(\emb{\rs}, \emb{\ro}) \triangleq \emb{\rs}^{T} \mathbf{W}_{\rp} \emb{\ro} \qquad \emb{\rs}, \emb{\ro} \Real^{k}, \quad \mathbf{W}_{\rp} \in \Real^{k \times k}$
• Translating Embeddings model (TransE) [Bordes et al. 2013]: $\ff_{\rp}(\emb{\rs}, \emb{\ro}) \triangleq - \vnorm{\emb{\rs} + \emb{\rp} - \emb{\ro}}, \qquad \emb{\rs}, \emb{\rp}, \emb{\ro} \in \Real^{k}$
• Bilinear Diagonal model (DistMult) [Yang et al. 2015]: $\ff_{\rp}(\emb{\rs}, \emb{\ro}) \triangleq \tdot{\emb{\rs}}{\emb{\rp}}{\emb{\ro}} = \sum_{i=1}^{k} \emb{\rs}_{i} \emb{\rp}_{i} \emb{\ro}_{i} \qquad \emb{\rs}, \emb{\rp}, \emb{\ro} \in \Real^{k}$
• Complex Embeddings model (ComplEx) [Trouillon et al. 2016]: $\ff_{\rp}(\emb{\rs}, \emb{\ro}) \triangleq \ReP{\tdot{\emb{\rs}}{\emb{\rp}}{\overline{\emb{\ro}}}} = \ReP{\sum_{i=1}^{k} \emb{\rs}_{i} \emb{\rp}_{i} \overline{\emb{\ro}_{i}}} \qquad \emb{\rs}, \emb{\rp}, \emb{\ro} \in \Complex^{k}$

Minimizing a hinge loss over observed and corrupted triples:

\begin{align} & \underset{\params}{\text{minimize}} \notag & & \loss(\params) \triangleq \sum_{(x, y) \in \Omega} \left[ 1 - y \cdot \ff(x ; \params) \right]_{+} \\ & \text{subject to} \notag & & \displaystyle{\forall o \in \cE: \; \vnorm{\emb{o}} = 1} \notag \end{align}

Where negative examples ($y = -1$) are generated by a corruption process, that corrupt either the subject or the object of a triple:

$\corr{\spo} = \{ \langle \tilde{\rs}, \rp, \ro \rangle \mid \tilde{\rs} \in \cE \} \cup \{ \langle \rs, \rp, \tilde{\ro} \rangle \mid \tilde{\ro} \in \cE \}$

### Injecting First-Order Logic Rules

Consider the following rule:

$\boxed{\sc{fatherOf}(\mathcal{X}, \mathcal{Y}) \land \sc{parentOf}(\mathcal{Y}, \mathcal{Z}) \Rightarrow \sc{grandFatherOf}(\mathcal{X}, \mathcal{Z})}$

We want to enforce that, for all entities $x, y, z \in \cE$,
if $\langle x, \sc{fatherOf}, y \rangle$ and $\langle y, \sc{parentOf}, z \rangle$ hold,
$\langle x, \sc{grandFatherOf}, z \rangle$ also holds:

\boxed{ \begin{aligned} \forall x, y, z \in \cE : & \min \{ \ff(x, \sc{fatherOf}, y), \ff(y, \sc{parentOf}, z) \} \\ & \leq \ff(x, \sc{grandFaherOf}, z) \end{aligned} }

A naïve solution consists in adding, for each rule, $\boxed{\card{\cE}^{3}}$ constraints.

However, this is unlikely to scale on large knowledge graphs -- e.g. on Freebase (40M entities) need to introduce $6.4 \times 10^{22}$ constraints.

Idea - Adversarial Training setting where, iteratively:

• An adversary searches for the most violating $\mathcal{S} = \{ x, y, z \}$,
• The model is trained by also correcting such violations.

Idea - Adversarial Training setting where, iteratively:

• An adversary searches for the most violating $\mathcal{S} = \{ x, y, z \}$,
• The model is trained by also correcting such violations.

This can be formalised in the following optimisation problem:

$\min_{\params} \mathcal{J}(\params) + \lambda \boxed{ \max_{\mathcal{S}} \mathcal{J}_{I}(\mathcal{S} ; \mathcal{A}, \params) }$
where $\mathcal{J}_{I}({}\cdot{})$ measures to which degree the entities in $\mathcal{S}$
violate the rule in $\mathcal{A}$, given the model parameters $\params$.

1. Entities in $\mathcal{S}$ can be either in entity space or in embedding space!!!
2. In many cases, $\max_{\mathcal{S}} \mathcal{J}_{I}({}\cdot{})$ has a closed-form solution!!!

### Evaluation

Rank of a triple: given a test triple $\spo$, corrupt its subject $\rs$ in every possible way, and sort the triples by their score in descending order.
The position of $\spo$ in the ranking is its rank.
Then do the same for its object $\ro$.

We evaluate by:

• Mean Reciprocal Rank (MRR), Hits@$k$
• Area under the Precision-Recall Curve (AUC-PR)

Datasets:Freebase (FB122), WordNet (WN18), and others.

### Rules - Samples

Small sample of the rules being used for
Freebase (FB122) and WordNet (WN18)

### Experiments - Freebase (FB122)

We compared our method (ASR) and its closed form solutions (cASR)
with $\sc{KALE}$, a neural link predictor leveraging FOL rules,
and the unregularized versions of $\sc{DistMult}$ and $\sc{ComplEx}$.

### Experiments - WordNet (WN18)

Adversarial Set Regularisatin is especially useful in a small-data regime,
where other methods struggle to generalise correctly.

## Conclusions

We proposed Adversarial Set Regularisation, a method for
injecting prior knowledge in Neural Link Predictors where:

• An adversary searches for the most violating variable assignments,
• The model is regularised by correcting such violations.

### Properties

• General - Any function-free Datalog rule
• Scalable - Independent on domain size!
• Code: https://github.com/uclmr/inferbeddings

## Ongoing Works

Beyond Neural Link Prediction - Look at other domains, where it can be beneficial to impose constraint on multiple data points:

1. Recognising Textual Entailment/Natural Language Inference
• $\sc{contradicts}(\mathcal{X}, \mathcal{Y}) \Rightarrow \sc{contradicts}(\mathcal{Y}, \mathcal{X})$
• $\sc{entails}(\mathcal{X}, \mathcal{Y}) \land \sc{entails}(\mathcal{Y}, \mathcal{Z}) \Rightarrow \sc{entails}(\mathcal{X}, \mathcal{Z})$
2. Image Processing
• $\sc{isRotationOf}(\mathcal{X}, \mathcal{Y}) \land \sc{hasLabel}(\mathcal{Y}, \mathcal{Z}) \Rightarrow \sc{hasLabel}(\mathcal{X}, \mathcal{Z})$
3. $\ldots$

Interested in applying this idea to other domains? Contact me!
$\sc{interestedIn}(\mathcal{X}, \;$ ASR $) \Rightarrow \sc{mailTo}(\mathcal{X}, \;$ p.minervini@ucl.ac.uk$)$