Knowledge GraphDigest

Creating a human friendly graph format

by KgBase
June 18, 2020

3 min read

What is a graph?

This is an example graph:

image
  • A graph consists of Nodes and Edges.
  • Nodes have types (i.e. Actor, Movie)
  • Nodes have attributes (i.e age, title)
  • Nodes have IDs that uniquely identify a node.
  • Edges have a label (i.e 'Actors' in the example above)
  • Edges have source and target nodes

In KgBase, nodes are grouped in tables by their type

image

Encoding a graph in a simple format is an interesting problem:

1- It has more structure than tables (it's a superset of tables), so something as simple as a .csv won't do. Yet,
2- It is not arbitrarily structured of data, so this graph format must encode only a subset of JSON

Encoding Nodes

Let's start by encoding a node. A node is just a list of key/value pairs

key1 value1
key2 value2
key3 value3
...

A table is a list of nodes. The table ID represents the node type. Let's index each element of this table by the corresponding node index

{
    table1:{
        id1: {
                key1: value1
                key2: value2
                key3: value3
            }
        id2: {
                key1: value1
                key2: value2
                key3: value3
            }
        }
    table2:{
        node1
        node2
        ...
        }
}

Encoding Edges

Now we have to think how to encode edges. An edge has 3 elements: a label, a source node and a target node. So our first guess could be a list of triples.

[
    [edge_label1, source1, target1]
    [edge_label2, source2, target2]
    [edge_label3, source3, target3]
    ...
]

However, many of these labels, sources and targets will repeat, and we aim to make this easy to type by a human. So we could condense this information by grouping all edges by label, then source, then target:

{
    edge_label1: {
        source1: [target1, target2]
        source2: [target3, target4, target5]
        source3: [target6]
    }
    edge_label2: {
        source4: [target7]
        source5: [target8, target9, target10]
        ...
        }
    ...
}

Or we could group first by source, then label, then target:

{
    source1: {
        edge_label1: [target1, target2]
        edge_label2: [target3, target4, target5]
        edge_label3: [target6]
    }
    source2: {
        edge_label4: [target7]
        edge_label5: [target8, target9, target10]
        ...
        }
    ...
}

In total, there are 3! = 6 ways (permutations) in which we could group the fields of the edges. However, among these possibilities, there is one which is 'special'.
If we group by source, then label, then target, we can merge this edge data with the node attributes data:

{
    table1:{
        id1: {
                key1: value1
                key2: value2
                key3: value3
                edge_label1: [target1, target2]
                edge_label2: [target3, target4, target5]
                edge_label3: [target6]

            }
        id2: {
                key1: value1
                key2: value2
                key3: value3
                edge_label4: [target7]
                edge_label5: [target8, target9, target10]
            }
        }
    table2:{
        node1
        node2
        ...
        }
}

So this way we don't need to have a separate object for edges.

There is one problem though: how can we tell the difference between an edge and an attribute from the node object keys?
We can add the rule that edge names must be preceded by a non alphanumeric character, for example ':'. So 'some_field_name' would represent a normal attribute, while ':some_field_name' would represent an edge.

Syntax Sugar

We are done! Let's see how an example would look:

{
    'movies': {
        'titanic': {
            'title': "Titanic",
            'year': 1997,
            ':actors': ['kate', 'leo']
        }
    }
    'actors': {
        'kate': {
            'name': "Kate Winslet",
            'gender': "female",
        },
        'leo': {
            'name': ""Lenoardo DiCaprio"",
            'gender': "male",
        }
    }
}

Mmm, all those braces and indentation though... makes it hard to read and write. For sure someone will forget to add a comma at the end of the line or balance braces.

TOML to the rescue. TOML aims to be a minimal configuration file format that is easy to read. Let's translate the example above into TOML:

[movies.titanic]
title="Titanic"
year=1997
":actors"=["kate", "leo"]

[actors.kate]
name="Kate Winslet"
gender="female"

[actors.leo]
name="Lenoardo DiCaprio"
gender="male"

Ah! Much better. There is only one thing that looks a little off: you have to quote edge names (":actors"), as they contain a non-alphanumeric character; also, the list after edge name contains lots of needless punctuation (,"[]). Let's shave this yak. A little pre-processing of :-starting names before handing over to the TOML parser can get us this:

[movies.titanic]
title="Titanic"
year=1997
:actors kate leo

[actors.kate]
name="Kate Winslet"
gender="female"

[actors.leo]
name="Lenoardo DiCaprio"
gender="male"

and this is our brand new KGML format for encoding graphs.
We wish you lots of fun creating KgBase graphs by writing KGML.

All posts