The data model of "property graphs" , "labeled property graphs ", or "attributed graphs " has emerged since the early 2000s as a common denominator of various models of graph-oriented databases.[1] It can be defined informally as follows:
In computer science terms, a property graph is a data structure representing entities associated by directed relationships, where the nodes and relations can both include multiple attributes / properties
Properties take the form of key-value pairs, as used for example in JSON. Keys are defined by character strings. Values are either numeric or also character strings. These properties fall within the usual definition of attributes as understood in entity-attribute-value orobject-oriented modeling. This is why the phrase "attributed graph" is relevant. Unlike what is the case with RDF graphs, properties are not arcs of the graph proper. This is another reason why it would be preferable to call them attributed graphs, or graphs with properties, rather than "property graphs", which is misleading.
Relationships are represented by arcs of the graph. These are often called edges, even though, strictly speaking, edges belong in undirected graphs. Arcs must have an identifier, a source node and a target node, and may have one or more attributes/properties in the previous sense
Formal definition
Building upon widely adopted definitions,[2][3] a property graph/attributed graph can be defined by a 7-tuple (N, A, P, V, α, , π), where
A is the set of arcs (directed edges) of the graph
K is a set of keys, taken from a countable set, defining the nature of attributes/properties
V is a set of values, to be associated with these keys in order to define full-fledged attributes
is a total function, defining the multigraph proper. For a ∈ A, u∈ N, v ∈ N, α (a) = (u, v) means that a is an arc of the graph having node u for origin and node v for target
is a binary relation over (A∪N) and K (formally defined as a subset of the cartesian product (A∪N)×K ), associating zero, one or several keys to each arc and node of the graph
is a partial function, providing values for the properties of the nodes and the arcs which include them. For u ∈ N, a ∈ A and k ∈ K, π (u, k) (respectively π (a, k)) is the value associated with the property key k for the node u, (respectively the arc a), if the corresponding attribute property is defined there.
A complementary construct, used in several implementations of property graphs with commercial graph databases, is that of labels, which can be associated both with nodes and arcs of the graph. Labels have a practical rather than theoretical justification, as they were originally intended for users of Entity-Relationship models and relational databases, to facilitate the import of their legacy data sets into graph databases :. labels make it possible to associate the same identifier (that of the relational table, or of the ER entity) to all graph nodes which would correspond to the different rows of this relational table, or to instances of the same generic entity / class. With the proposed definition, these labels could in fact be viewed as attributes defined only by a key, without an associated value (this is why is defined separately as a binary relation, and π as a partial function). The basic definition thus becomes much clearer, simpler, and satisfies a principle of parsimony. Alternatively, and more consistently, labels can be defined through type graphs, as special types associated with nodes and arcs.
Relations with other models
Graph theory and classical graph algorithmics
Attributed graphs, as defined above, are especially useful and relevant in that they provide an "umbrella" hypernymic concept ( i.e. common generalization) for several key graph-theoretic models, which have long-since been widely used in classical graph algorithms
Labeled graphs associate labels to each vertex and/or edge of a graph. Matched with attributed graphs, these labels would correspond to attributes comprising only a key, taken from a countable set (typically a character string, or an integer)
Colored graphs, as used in classical graph coloring problems, are but special cases of labeled graphs, whose labels are defined on a finite set of keys, matched to colors.
Weighted graphs associate a numerical value to arcs/edges, and, when relevant, to the vertices of a directed or undirected graph. These weights/valuations would correspond to the differents values of a set of attributes with the same key. As an example, for a graph modeling a road network, we could have a set of weights corresponding to the capacities (measured in number of vehicles per unit of time), and another representing the distances, these two valuations being associated with each road segment represented by an edge of the graph, and differentiated by two corresponding keys.
Flow networks are weighted graphs whose weights are interpreted as a capacities. They are used in all kinds of very classical models of transport networks, used e.g. with maximum flow algorithms.
Shortest path problems, as solved by very classical algorithms (like Dijkstra's algorithm), operate on weighted graphs for which the weights correspond to distances, real or virtual.
Knowledge graphs and RDF graphs
Knowledge graphs, usually represented as RDF graphs, are in fact hybrid labeled graphs, whose node labels correspond to instance identifiers (IRI)s or literals, and edge labels identify types (not instances) of predicates. They have now acquired a visibility which tends to obscure the longer-established use of graphs as direct model for systems of all kinds.[4] Attributed graphs are, by their versatility and expressivity, the best-adapted for this type of modeling, where graphs which can rightly be called cyber-physical do not merely capture weakly structured about a physical system, as would be the case with a knowledge graph, but attempt to directly capture the structure of a physical system, as matched by the connectivity structure of the graph. In contrast, an RDF graph would mix structural relationships with attached properties, and category / class information with instance / individuals, drowning out the structure The expressivity of attributed graphs, on the level of higher order logic, is also far above that of RDF graphs, which is limited to first order logic. Properties of relationships, which are at the heart of the attributed graph model, require a very cumbersome reification process to be expressed in RDF.
The NGSI-LD data model specified by ETSI has been the first attempt to standardize property graphs under a de jure standards body.
Compared to the basic model defined here, the NGSI-LD meta-model adds a formal definition of basic categories (entity, relation, property) on the basis of semantic webstandards (OWL, RDFS, RDF), which makes it possible to convert all data represented in NGSI-LD into RDF datasets, through JSON-LD serialization. NGSI-LD entities, relations and properties are thus defined by reference to types which can themselves be defined by reference to ontologies, thesauri, taxonomies or microdata vocabularies, for the purpose of ensuring the semantic interoperability of the corresponding information.
The ISO/IEC JTC1/SC32/WG3 group of ISO, which established the SQL standard, is in the process of specifying a new query language suitable for graph-oriented databases, called GQL (Graph Query Language). This standard will include the specification of a property graph data model, which should be along the lines of the basic model described here, possibly adding notions of labels, types, and schemas .
Type graphs and schemas
Graph-oriented databases are, compared to relational databases, touted for not requiring the prior definition of a schema to start populating the base. This is desirable and suitable for environments and applications where one operates under an open world assumption, such as the description of complex systems and systems of systems, characterized by bottom-up organization and evolution, not control of a single stakeholder. However, even in such environments, it may be needed to constrain the representation of specific subsets of the information entered into the database, in a way that may resemble a traditional database schema, while keeping the openness of the overall graph for addition of unforeseen data or configurations. For example, the description of a smart city falls under the open world assumption and will be described by the upper level of a graph database, without a schema. However, specific technical sub-systems of this city remain top-down closed-world systems managed by a single operator, who may impose a stronger structuring of information, as customarily represented by a schema.
The notions of "type graphs" and schemas[2] make it possible to meet this need, with types playing a role similar to that of labels in classical graph databases, but with the added possibility of specifying relations between these types and constraining them by keys and properties. The type graph is itself a property graph, linked by a relation of graph homomorphism with the graphs of instances that use the types it defines, playing a role similar to that of a schema in a data definition language.
The ontologies, thesauri or taxonomies used to reference NGSI-LD types are also defined by graphs, but these are RDF graphs rather than property graphs, and they typically have broader scopes than database schemas. The complementary use, possible with NGSI-LD types, of type graphs and referencing of external ontologies, makes it possible to enforce strong data structuration and consistency, while affording semantic grounding and interoperability.