Data Modeling

Personal notes and tips

These are personal notes. You are welcome to read them.

Index page

Top-level home page  
Management hints
HTML, javascript
Other
     
More pages here ("technical" pages) Ham radio pages Contents of current page

 


Contents

The Concepts

The Process

Some Products

Miscelaneous

 


Concepts

 

data model for reference table
data model for reference table
IDEF1X:
data model for reference table

Reference table, mandatory values in foreign key. The primary key is in the reference table, the foreign key points to the primary key. The reference table is the parent table; the child table holds the foreign key to the parent table. See below for more on the one-to-many relationships.

Notations:

  • The foreign key corresponds to one and only one row in the reference table (shown with two bars). The row in the reference table corresponds to none, one or many rows (shown with a zero and a crow's foot).
  • Alternate notation: the arrow points from the foreign key to the primary key.
  • IDEF1X: a black dot indicates the child entity.

Note that if reference table contains a simple code and a meaning, then it could be removed and the foreign key could be assigned to a domain.

super / sub types

Barker notation:
Super type / sub type (Barker notation)

IDEF1X:
Super type / sub type (IDEF1X)

This models a super-type/sub-type.

  • The super-type or generic entity holds the common attributes
  • Move the catagory specific attributes to sub-types.
  • A sub-type discriminator field or a category discriminator field is added to the super-type
  • Both super-type and sub-type contain non-key attributes or foreign keys. The foreign keys are always mandatory (see below). If foreign keys are optional, they should be moved to sub-types.
  • The primary key is the same for all entities. This leads to the consequence that only one super-type is possible for any one entity (not like in OO)
  • Ideally, the sub-types should be nonoverlapping, i.e. occurrences can be in only one or the other sub-type.
  • Ideally, the sub-types should be exhaustive, i.e. no other category exists other than those defined by the sub-types.

Note: a single sub-type is possible.

A subtype cluster is complete when all possible subtype entities have been described. The subtype cluster is incomplete when not all entities have been discovered in the modeling process.

A subtype relationship is exclusive or non-overlapping if the super-type can be related to only one sub-type. It is inclusive or overlapping if a super-type can be related to more than one sub-type.

Related notions are generalization, inheritance and specialization.

Triad resolved
Triad of tables The "short-cut" relationship is redundant because of transitivity: Remove the short-cut.
Recursive structure Alternative for ecursive structure

Recursive relationships can be also modelled with two tables. In one of the tables, the primary key appears twice.
Recursive relationships are sometimes referred to as "fishhooks" or "swine ears".

Hierarchy

In a hierarchy, each instance can have zero to many subordinates, but a maximum of one superior. Note that the top-most instance can also be modelled as self-referencing.

Because hierarchies are difficult to query, consider replacing with separate entities if the hierarchy is only a few levels deep.

Network

In a network, each instance has zero to many superiors and zero to many subordinates.

This is difficult to program and efforts should be made to continue talking to the users to discover another way to model the entities.

Chain In a chain, each instance is linked to at most one other instance. Model with a zero to many (zero to two) relationship. Another option is to make another entity that lists the instances in order.
Or constraint Or constraint: one table on the right OR the other is joined to the table on the left.
XOR constraint (or arc)
XOR constraint (or arc)
An arc or exclusive or constraint means that only one or the other of the tables on the right is joined to the table on the left.
Whenever possible, replace arcs with sub-types.
Arc in the primary key: trick is to put a surrogate key and have the arc on non-key fields.
Part of In data modeling, aggregation and composition are modeled the same way. The notion is not as important as in UML.
symbol for non-transferability Transferability: related to change over time. Suppose a relationship between a person and a radio license (see [2] P. 98 in bibliography). The relationship between a person and the amateur radio license is non-transferable. The relationship between a person and a commercial radio station license is transferable. For transferable relationships, a history should be kept. Note that a transferable one-to-one relationship is really a one-to-many when the time dimension is considered. And a transferable one-to-many relationship can likewise be seen as a many-to-many with changes over time. A non-transferable one-to-one relationship will probably refer to the same real-world concept. A structured key can be used to enforce nontransferability because nontransferability is equivalent to weakness (see below).
  A weak entity is an entity that relies on another for its identification. In particular, the primary key will be composite and contain the foreign key to another entity. For example, the "Invoice Line" entity is a weak entity if the foreign key to the "Invoice" entity is included in the primary key. Note that nontransferability is equivalent to weakness (see below). See also ER Diagrams. Note that weak entities have rounded corners in IDEF1X notation.
  An optional relationship means a partial participation in the relationship. A mandatory relationship means a total participation in the relationship.

Some more rules:

 

Some aspects of the IDEF1X (Integration DEFinition for Information Modeling) notation are described below in the section on Erwin. Basically, the crow's feet are replaced with a black dot, because it is seen as the child end of the relationship. In one-to-many relationships, a diamond indicates that nulls are allowed on the "one" or parent side of the relationship.

(not to be confused with Idéfix:) Idefix

 


Review of relationship types

Hint for determining relationship cardinality: count lines on the other side. The difficulty is with the optionality of the Barker notation. The maxima are "on the other side" and the optionality is on the "same side".

Two types of relationship rules: structural or cadinality rules and referential integrity rules. The structural or cardinality rules define the existence (0 or 1 on the diagram). The referential integrity rules define 1 or many on the diagram.

One to one relationships

In one-to-one relationships, two tables are joined. Generally, they share the same primary key. This leads to an issue in the primary key generation: care should be taken to generate the primary key for one table and copy the value to the other. In addition, a foreign key relationship may have to be built in to force integrity.

Very often, two tables with a one-to-one relationship can be merged into one table. Exceptions are situations with a super type and sub-types. Another exception is a transferable relationship which, after analysis, turns out to be one-to-many relationship over time (at a given moment, the relationship is one-to-one, but changes over time).

Example Notation Barker Notation Comments
data_model_01_01 data_model_01_01 data_model_01_01 Loose relationship between the two tables (is it of any use?).
data_model_11_01 data_model_11_01 data_model_11_01 This needs a constraint to enforce the mandatory part of the relationship.
The instead of one-to-one optional relationship, use one table with nullable columns.
Exception: super-type/sub-type.
data_model_11_11 data_model_11_11 data_model_11_11 This needs a constraint to enforce the mandatory relationship on both sides.
A one-to-one mandatory relationship is very likely replaced by one table.

 

 

One to many relationships

Example Notation Barker Notation Comments
1-to-1 with 0-to-many 1-to-1 with 0-to-many 1-to-1 with 0-to-many This is the most common situation, in particular when one table references another table as a reference table. Another use is when a fact table (on the right) references a dimension table (on the left) in data warehouses. Note that here not all the values in the reference/dimension table are used. They are available for future use.
Another notation uses an arrow:
Reference table notation with arrow
0-to-1 with 0-to-many 0-to-1 with 0-to-many 0-to-1 with 0-to-many When the relationship is optional on the side of the non-reference table / fact table, the foreign key can be null. This is not optimal and it is best to resolve non-mandatory foreign key relationships (see below). Resolving makes the diagram look like the previous row (1-to-1 with 0-to-many).
1-to-1 with 1-to-many 1-to-1 with 1-to-many 1-to-1 with 1-to-many In this case, the relationship is mandatory on the side of the reference/dimension table (compare with the first case above) and all rows in the reference/dimension table are linked to a row in the other table. This needs an extra constraint to be built in as the foreign key mechanism does not enforce this automatically. This also implies the use of transactions so as to insert new values at the same time on both sides.
1-to-1 with 0-to-many 1-to-1 with 0-to-many 1-to-1 with 0-to-many This case combines both the difficulty of enforcing the mandatory relationship on the side of the reference/dimension table (see previous row) and the need to resolve the non-mandatory foreign key relationship (see below).
no referential integrity     The reference table is missing a value. When referential integrity is enforced, the database management system raises an error.
To solve, add the missing value to the reference table and enable the foreign key constraint.

An optional relationship from the perspective of the child (with the "0" showing on the parent side!) means that the child (fact table or non-reference table) is not existence dependent on the parent (or reference table). This is shown in the second and fourth cases above. A mandatory relationship from the child's perspective (no "0" showing on the parent side) means that the child is existence-dependent on the parent. This is shown in the first and third cases above. When the relationship is mandatory, two situations arise: in addition to being existence-dependent, the child may be or may not be identification-dependent on the parent. This is linked to the notion of an identifying relationship, which does not show in the IE and Barker notations. Identifying relationships show in the IDEF1X notation as full lines and non-identifying relationships show as dashed lines. The diagrams in this section do not show this, so refer to the ERwin section below.

 

Resolving the non-mandatory foreign key relationship

It is best to resolve non-mandatory foreign key relationships by adding a default value in the reference/dimension table or by creating a sub-type:

Add a default value such as
"unknown" or "non-existant"
in the reference/fact table and
make the null foreign key point
to the row with the default value.
Adding a default row to resolve non-mandatory foreign key relationship
Or make a sub-type in which all fields
are mandatory so as to
make the relationship mandatory.
Sub-type to resolve non-mandatory foreign key relationship

 

Many to many relationships

Example Notation Barker Notation Comments
1-to-many with 0-to-many
1-to-many with 0-to-many
1-to-many with 0-to-many
 
1-to-many with 0-to-many
Many to many
Many to many Resolve many-to-many relationships
by inserting an associative entity to
describe the relationship.
Make the foreign keys into a primary key
so as to force unicity of the
key combinations.
Needless to say, the foreign keys in the
middle should not be null.

Note that the foreign key mechanism does not enforce mandatory relationships.

 

Types of Joins

A theta-join is a join using a comparison between fields of two tables. Theta-joins include all comparison operators such as > or <. An equi-join refers to the most common type of theta-join which uses only equality in the condition.

 

 


Entity Relationship Diagrams

 

 

 

 

Relation   Table File
Tuple Row Record
Attribute Column name Field
Degree or arity Number of columns Number of fields
Cardinality Number of rows Number of records

R = {a1, a2, ..., an} is set of attribute names for relation schema r(a1, a2, ... , an)
Schema or intension corresponds to instance or extension, which is a set of n-tuples without duplicates.
Note: tuple rhymes with couple.

The smallest unit of data is a scalar, the values of which are part of a domain. Determining the domain of a an attribute (field) is an important step in the analysis. The data type (character or numeric) is often not enough to define the domain. One way of constraining values to a domain is to define a reference table. Is NULL or blank part of the domain? What are the maximum and the minimum values? For dates, are times included? For numbers, are they integers or decimals? Values are constrained by the DBMS by the data type definition, by the use of the foreign key mechanism, by constraints on NON NULL values and other types of constraints depending on the DBMS. These rules are integrity rules.

A relation on a collection of domains is composed of (see C.J.Date):

 

Superkey Set of attributes with distinct values; can identify a tuple. 
Candidate key  Minimal superkey.
A candidate key has uniqueness property and is irreducible (no subset has uniqueness property).
Primary key  No null values
Foreign key References primary key of another table (same domain of values)

Note that, because by definition relations cannot contain duplicate tuples, therefore there always exists a candidate key. The primary key is chosen amongst the candidate keys. (Generally, there is not much of a choice).

Foreign keys:
update option: either cascade (an update on the primary key cascades to matching foreign keys) or restrict (an update is not possible if matching foreign keys exist)
delete option: either cascade (delete rows with matching foreign keys) or restrict (cannot delete if foreign keys exist).
Some DBMS' allow foreign keys to have nulls, others do not.

 

 

 

Entity, relationship, key and characteristics Total participation: each entity participates
Partial participation: some values are null
Partial participation
Strong Entity Strong entity
Weak Entity Weak entity
Relationship Relationship, with attributes
1 to 1 relationship M to n relationship
Same as for m to n relationship

1 to ne relationships
if e2 has total participation in e1
or allow null values for e1Key and rAttr
1 to n relationship basic 1 to n 1 to n relationship
m to n relationship m to n
Same as for 1 to 1 relationship
 

 

Alternative terms

Table Alternative
Row Record
Column Field / zone
Table File
Database Library

 

 


Normalization

 

Some basic issues before the normalization process:

 

Functional Dependancies

Functional dependancies are the starting point for denormalization.

A → B if for each item in A, there exists one and only one possible value in B.
A is generally a key of ID of some sort.
"→" reads "determines" or "is a determinant of".

Armstrong's axioms:
X and Y are sets of attributes (or columns)

Closure F+ of F is a set of all functiona dependancies implied by F. This means that a key functionaly determines all attributes. And all attributes of the key are needed to fully determine all attributes in the relation.

Domain D = {values} (a domain D is a set of values)
Relation is included in D1 X D2 X D3 X ... X Dn (a relation consists of tuples and each domain corresponds to a column).

1st Normal Form:

(Student, course1, course2, course3, course4) (Student, course1, course_type1)
(Student, course2, course_type2)
(Student, course3, course_type3)
(Student, course4, course_type4)

An entity is in 1st normal form if there are no repeating groups of attributes.
Note that a single attribute containing a list also violates the 1st normal form.

2nd Normal Form:

(Student, course, room, course_result)
          course <-- room

room is dependant on only part of the key, i.e. on the course
(course, room)
(Student, course, result)

An entity is in 2nd normal form if it is in 1NF and if all the non-key attributes are fully dependant on the primary key, meaning that the non-key attributes are dependant on all of the primary key and on none of the subsets of the primary key. A partial dependancy is when a subset of the primary key determines one of the non-key attributes.

3rd Normal Form:

(Course, room, building)
         room <-- building

Building is dependant on a non-key field, i.e. on the room
(course, room)
(room, building
)

An entity is in 3rd normal form if it is in 2NF and if every determinant of a nonkey attribute is a candidate key. Or, no non-key attribute determines another non-key attribute.

One special case of violations of the 3rd normal form are derived attributes that can be calculated from another attribute.

A trick to remember is that every attribute is dependant "on key, the whole key, and nothing but the key".

Boyce Codd Normal Form (BCNF)

A and X are sets of attributes, and A is not in X
If X → A then X must be a candidate key.
That is all columns that determine other columns must be candidate keys.
In other words, every determinant of key items (BCNF) and of nonkey items (see 3NF) must be a candidate key.

Situations where a table is in 3NF but not in BCNF occur when there are overlapping candidate keys.

Note that BCNF may not preserve dependancies required by the business rules (that is each dependance can be reconstructed). The tradeoff is that normalization prevents redundant data. Program logic can be used to enforce additional business rules.

4th Normal Form

a b c
a b' c
a b c'

A sort of cartesian product (but not complete) within a key-only table. Make two relations. But watch because this may not preserve dependancies.
See 5NF because it includes 4NF and that it is easier to understand.

5th Normal Form

4NF and 5NF issues occur in tables whose columns are all part of the primary key. This means that there must be at least three columns in the key and no nonkey items. If there are nonkey elements in the table, then the table is in 5th normal form because the related issues do not apply.

According to Chris Date (and Simsion [2]), examples with nonkey items do not exist in practice.

As suggested by [2], handle 4NF with 5NF together because 5NF includes 4NF and it is easier to explain. It is explained by saying that "no further table splits are possible with different primary keys in the resulting tables". Basically, the normalization process involves various steps of identifying the need to split tables and assign new primary keys to the resulting tables. A correct normalization results in tables that can be re-joined so that the original tables are re-constituted. 5NF says that this process should continue until one of the two is true:

So, when in doubt, split the tables, with different primary keys in each, and try to re-constitute the original table.

Note that if the tables are split and each has the same primary key, this could still be useful if the underlying entities are in fact different but just identified (temporarily) with the same key. Simsion [2] gives the example of a bin where one table describes the physical dimensions of the bin and the other describes the contents.

Domain Key Normal Form (DKNF)

All constraints are a consequence of domains or keys. This means that constaints should be enforced by keys (see normal forms above) or by limiting the allowable values (domain). This normal form is not used often. It is noted here just for information.

Method

 

 

 


Data Model

High level data model that captures the categories of data and their business rules. Stick to data that is to be tracked by the IT system.

The starting point is the Business Process Model. This describes what and how things happen in the business. It does not describe the data. This is because the users generally see things from the point of view of the process and not of the data.

The Conceptual Model refers to a high-level model showing entities and relationships but few or no attributes. It is useful in scope definition. See Data Modeling (conceptual model) for more details.

The Business Model describes the data elements from the point of view of the business users. Some argue that there is only one business model. Linked to this type of model are the external models, of which several can exist. The conceptual model and the business model may be the same.

The Logical Data Model fully implements all requirements. It is in third normal form. See Data Modeling (logical) for more details.

The DBA implements the Physical Data Model. It is optimized for performance and therefore takes into account denormalization for performance purposes, merging of tables, implementation of sub-types/super-types, partitioning, ...
See Data Modeling (physical) for more details.

 

The whole diagram is an Entity Relationship Diagram (ERD).

Keep a central place for general business rules.
Also keep re-occuring definitions such as "Name = first name + last name"

 

Steps

Entity Relationship Diagram
This diagram shows three entities.
A "has" 0 to many Bs and B is "belongs" to one and only one A.
B "belongs" to 0 or 1 Cs and C "has" 1 to many Bs.
ERD cardinalities

See basics of data modeling in Data Modeling

See http://www.agiledata.org/essays/dataModeling101.html

 

Some comments

In one presentation (sorry, I can't remember who), suggested three contradictory requirements: design elegance, processing speeed, and information requirements. I understood this to be similar to the "cheap, good, fast - pick one" dilema.

 

 


Conceptual Model

The conceptual model is based on the analysis of the business processes.

The conceptual model is a high-level overview of the data entities. It supports discussions during business and systems interviews. Accompanied by a glossary, it provides the business-related descriptions of the entities, the relationships between these entities, the main attributes, examples of values, and owners. Interviews with business users and eventual existing data models provide the information for building the conceptual model.

Verification of the conceptual model: the stakeholders should verify that:

To verify the model, [2] suggests using the assertions approach, described below.

Though the primary keys are formally defined in the logical data model, a clear idea of what constitutes a single row is necessary at this level. Define what makes an entity's occurrence unique.

Determine the scope[4]. This results in a statement such as "a data model for the accounting department's application in six months" or "a model of current business processes in the bank."

 

Steps for High-Level Data Model

Taken from [4].

 

See also data modeling steps.


Logical Model

At an attribute level, the conceptual model defines the attributes in business terms and defines any related business rules. On the other hand, compared to the conceptual mode, the logical model defines the type of data, the constraints (mandatory, allowed values, ranges), eventual formatting, and the composition of the primary key. The logical model includes:

The main types of attributes are:

See notes on profiling in Data Quality, element analysis

 

See also data modeling steps.


Physical Model

 

The physical model is derived from the logical model in a fairly automatic way. The main decisions concern performance and implementation of the structures that have been designed. In particular, decisions have to be taken about indexes, storage (table space usage, free space, ...), memory, locking strategies, creation of views, ...

Some decisions also impact the data model:

 

Physical Database Planning

Verify that the server has enough cpu, memory, disk space. Do the same for the SAN.

disk storage capacity

Disk throughput capacity

Databases and locations (in the case of remote locations):
Is some data used only in the remote connections.

CPU performance

Measure memory with

Network

Regulatory issues

Data backup and recovery

Data archiving

Server Consolidation

Data Distribution
Main issue is connectivity

Growth

 


Data Warehouse Model

This section describes the steps for going from an enterprise data model, which is generally fully denormalized, to a data model appropriate for a data warehouse (see Silverston [5]).

More notes on designing a data warehouse are in Project Management Notes.

Starting with the normalized enterprise data model, complete with conceptual and logical models, we take the following steps. This constitutes the physical model of the data warehouse.

Note that, according to Silverston [5], we do not need to model the processes in the context of a data warehouse model. The enterprise data model gives the structure. Kimball insists on knowing how the business works. I do not think that means building the Business Process Model.

 

 


Assertions Approach

 

 

Simsion [2] proposes an approach using assertions for verifying the data model with the users. Ask the users to validate the follwing assertions:

 

 

 


Project Related Notes

 

 

Once a version of the data model has published, accompany any new version of the data model with a list of the changes so that it is easier for people to validate the changes. Or best, produce a list of changes before implementing them so as to get feedback before messing up the existing model. This includes [2]:

Process Model

Check that the data model can hold the data necessary to support each process. This can be done with a "CRUD" matrix. Done the left side are the processes, along the top are the entities. In the cells are the indication of which processes Create, Read, Update or Delete which entities.

  Entity A Entity 2 Entity AB Entity CD
Process 1
C
 
 
R
Process B
R
C
U
R
Process XYZ
R
 
D
C

 

 


Erwin

 

Some reminders:

 

Relationship Types

Identifying relationships: the child depends on the parent for identification (and existence as a consequence). The child is identification-dependent.
(Notice how the foreign key is put automatically into the primary key of the child).
Strictly speaking, the notion of dependant and independant entities is not part of the IE notation but ERwin includes it anyway.

Identifying relationship

Non-identifying relationships: the child does not depend on the parent for identification. However, it may depend on the parent for existence and in this case, nulls are not allowed on the parent side (see further below).
(notice how the foreign key is put automatically added to the child, as a non-key column.
Also note that the line is dashed.)

Non-identifying relationship

Cardinalities for one-to-many relationships
The left-most column shows the additional indications ("P", "Z", "3") when the "format > relationship display > cardinality" option is chosen in the IE notation.
(The parent is on the right side, the child on the left).

Cardinalities

For non-identifying relationships, the parent side can be null too. If the child is existence-dependent on the parent, then no nulls are allowed in the foreign key attribute of the child. Otherwise, nulls are allowed.
(The parent is on the right side, the child on the left)

Nulls allowed

Many to many relationships are shown as follows. Note that few options are possible in the relationship properties.

Many to many relationship

The notation for super types and sub types is as follows. The discriminator is shown next to the symbol. Exclusive/complete sub-typing is shown below. An inclusive sub-type relationship is without the X in the IE notation (the term is not used in IDEF1X). An incomplete relationship is shown with a single horizontal line with the IDEF1X notation (complete and incomplete relationships are not noted in the IE notation). The terms "inclusive" and "exclusive" are part of the IE notation. "Complete" and "incomplete" are part of the IDEF1X notation. Because the sub-types are weak entities, the boxes are rounded in IDEF1X notation.

Super type and sub type notation

Below are the (non-intuitive) icons for building the header and footer on the printed page (and no tooltip appears when hovering over the icon):

icons for header and footer when printing

Referential Integrity options:

Reports

Some starting suggestions. Use in the report builder (menu > tools > Report Template Builder > Report Builder) and create a new template.

fundamental entities: is this the equivalent of independant entity???

Methodology

The proposed methodology is:

 

Tutorials:

 


Bibliography