Reference Table

Advanced Data Vault Modeling

Daniel Linstedt , Michael Olschimke , in Building a Scalable Data Warehouse with Data Vault 2.0, 2016

6.3 Reference Tables

These next sections cover another type of entity, which is not part of the core architecture but used often in the Data Vault.

We have introduced the hub in Chapter 4 as a unique list of business keys, identifying objects that are used in business. However, there are more keys and codes in enterprise data that don't necessarily qualify as business keys because they don't reference business objects. For example, ISO country codes, such as USA for the United States or DEU for Germany, are codes that are used in business, but the countries themselves are not used as business objects within the organization. Instead, they are used as descriptive reference data that delineate a specific state of information. In the case of country codes, the ISO code could describe the country where a sale had taken place. This description usually includes the official name of the country and some other more descriptive information, such as the continent or the capital. Often, this reference data is not controlled by the organization, but by an external body. On the other hand, the very same country code could be a business key in another organization, such as the United Nations Organization (UNO).

Reference data is not purely descriptive. It lives in the context of other information. The country information without the context of the sales transaction would be of no value to the business. Or to rephrase this statement: what is the value to the business of an unused list of country codes with their corresponding official names? It's zero. But if the country code is used in other data, such as the sales transactions, it provides value by adding additional descriptive data to the business. But they don't qualify as business keys because they are not used by business objects; therefore, they usually don't go into hub structures.

This is where reference tables come into play. Reference tables are used to store information that is commonly used to set up context and describe other business keys. In many cases, these are standard codes and descriptions or classifications of information.

The next sections describe some options for reference tables.

6.3.1 No-History Reference Tables

The most basic reference table is just a typical table in third or second normal form. This basic table is used when there is no need to store history for the reference data. That is often the case for reference data that is not going to change or that will change very seldom. Typical examples include:

Medical drug prescription codes and definitions

Stock exchange symbols

Medical diagnosis codes

VIN number codes and definitions (such as manufacturer codes)

Calendar dates

Calendar times

International currency codes

US state code abbreviations

Note that it depends on the actual project: e.g., in some countries other than the USA, there might be frequent changes in the medical diagnosis codes, for whatever reason.

The simple no-history reference table has no begin-date and no end-date because there are no changes in the data. Therefore, the structure is very simple, as Figure 6.7 shows.

Figure 6.7. A nonhistorized reference table for calendar (logical design).

This logical model shows a reference table to store a simple calendar in the Business Vault. The data is identified by the Date key, which is a Date field in the database. Other attributes in this example are the Year, Month, and Day, which store the corresponding whole numbers. Day of Week is the text representation of the week day, e.g. "Monday." There is no need for keeping a history of changes because there will be no need to track those in most businesses. It doesn't mean that there are no changes to the data in this structure. However, most changes are bug-fixes or should update all information marts, including historical data. Examples for the latter include translations of the Day of Week attribute or abbreviating the text. Figure 6.8 shows the ER model for this reference table.

Figure 6.8. A nonhistorized reference table for calendar (physical design).

The descriptive business key is used as the primary key of the table. The reason for this is that the key is used in satellites and Business Vault entities to reference the data in this table. That way, it becomes more readable and ensures auditability over time. If a business key is used as the primary key of the reference table, it has the advantage that it can be used in ER models or in referential integrity, if turned on, for example for debugging purposes.

Table 6.9 shows an excerpt of the reference data in the calendar table.

Table 6.9. Calendar Data in Nonhistory Reference Table

Date Load Date Record Source Year Month Day Day of Week
2000-01-01 2014-06-20 04:30:21.333 MDS 2000 1 1 Saturday
2000-01-02 2014-06-20 04:30:21.333 MDS 2000 1 2 Sunday
2000-01-03 2014-06-20 04:30:21.333 MDS 2000 1 3 Monday
2000-01-04 2014-06-20 04:30:21.333 MDS 2000 1 4 Tuesday
2000-01-05 2014-06-20 04:30:21.333 MDS 2000 1 5 Wednesday
2000-01-06 2014-06-20 04:30:21.333 MDS 2000 1 6 Thursday
2000-01-07 2014-06-20 04:30:21.333 MDS 2000 1 7 Friday

This example uses a RecordSource attribute again because the data is sourced from Master Data Services (MDS). If the data in MDS is changed by the user, it will overwrite the content of the reference table because there is no history tracking. In other cases, the data is not sourced from anywhere. Then, the LoadDate and the RecordSource attributes are not needed. However, it is good practice to source the data from analytical master data because it becomes editable by the business user without the need for IT. This is a prerequisite for managed self-service business intelligence (BI), a concept that is covered in Chapter 9, Master Data Management.

Once the reference table has been created in the model, it can be integrated into the rest of the model by using the primary key of the reference table wherever appropriate: the biggest use is in satellites, but they are also used in Business Vault entities. Figure 6.9 shows a typical use case where a satellite on a Passenger hub is referencing the primary key of a reference table.

Figure 6.9. Satellite with reference data (logical design).

The satellite Address references the reference table State via the USPS state abbreviations. That way, the reference indicates that there is more descriptive information for State in the reference table. By doing so, we don't use readability in the satellite and keep the basic usage of Data Vault entities intact.

6.3.2 History-Based Reference Tables

The last section introduced simple reference tables that hold no history. However, there are cases when reference data needs to be historized, similar to satellite data. To provide an alternative to Data Vault satellites when dealing with reference data, there is an option for history-based reference tables. If it is important to the business to reprint reports or go back in time and look at the historic reference data, these tables can be used to meet this requirement.

The way that Data Vault deals with this requirement is by adding standard satellites to the reference table presented in the previous section. While the base table holds only nonhistorized attributes, the satellite holds the reference data that requires history.

Figure 6.10 shows an extended version of the reference table in the last section. It is extended by satellite Fiscal Calendar which adds two historized attributes to the reference table: Fiscal Year and Fiscal Quarter. By having them historized, the business is capable of changing it in the future or addressing a change in the past. This could be a requirement if two organizations with different fiscal calendars merged in the past and the business wants to be able to work with historic reports.

Figure 6.10. History-based reference table for calendar (logical design).

By adding a satellite to the reference table to enable historization, it is possible to follow the basic concepts of Data Vault 2.0 modeling to extend the simple reference table introduced in the previous section. This is a great example of how these basic entities can be used by combining them into advanced entities.

Figure 6.11 shows the physical model derived from the logical model in Figure 6.6.

Figure 6.11. Physical model of historized calendar by using Data Vault satellite (physical design).

Satellite SatFiscalCalendar is attached to the reference table by using its primary key Date. We don't use a hashed version of the key in favor of readability of the reference table. If we preferred a hashed Date, it would require the use of the hash as the primary key, which in turn would affect the usage of the reference table. Other than that, the satellite is very similar to standard Data Vault 2.0 satellites, especially the use of the LoadDate in the primary key and LoadEndDate for end-dating satellite entries.

6.3.3 Code and Descriptions

Often, there are standard codes in a business that require a description to be used by end-users effectively. One example is the state codes that were used in the previous sections. However, there are more cases for abbreviations or other codes that are enriched with descriptions. For example, the FAA uses the operation codes in Table 6.10 to classify aircrafts regarding their intended use:

Table 6.10. List of FAA Standard Operations Codes

Standard Operations Code Description
N Normal
U Utility
A Acrobatic
T Transport
G Glider
B Balloon
C Commuter
O Other

The Standard Operations Code is used in all business processes of the FAA, both with internal and external interfaces. Everyone in the airline industry who deals with the FAA knows about the meaning of these codes. But there are also glossaries that translate the codes into descriptions, providing more meaning to those users who are not using these codes every day. And because these codes are so widely used in the business processes, there is a 100% chance that the code or the description will show up on a user-interfacing component, such as a report or an OLAP dimension. For example, they are often used to group information on a report or aggregate measures over these codes. By integrating the description into the user interface (in addition to or instead of the code), the usability of the report or OLAP view is drastically increased for casual users of the presented information.

It should be clear that there are many of these lists with codes or abbreviations and their corresponding description. Instead of creating a reference table, with or without history, for each of these lists, we introduce a code and description table that groups these lists into one categorized table ( Figure 6.12).

Figure 6.12. Code and descriptions reference table (logical design).

Figure 6.12 shows a minimal reference table for code and descriptions. Usually, there will be additional descriptive attributes in such table, for example:

Short description: for use in charts and other diagrams, because there is only limited room for captions in bar charts, pie charts, etc.

Sort order: most of the reference data is not sorted alphabetically when used in a report. Instead, the business wants to decide how to order entries in dimensions.

External reference: oftentimes, this is a URL where more information about the reference data entry can be found. Useful to integrate your reference data with Wikis on the Intranet.

Owner: indicates the functional unit that is responsible for maintaining the record.

Comment: free text to describe the reference data entry to the business user who maintains the record.

The ER model of the code and descriptions reference table is presented in Figure 6.13.

Figure 6.13. Code and descriptions reference table (physical design).

The layout of the entity follows the layout of the other reference tables by using the descriptive business key combination as the primary key for the table.

Note that this approach is only applicable if the reference data uses the same data types and the same attributes that describe the code. If the structure of the reference data is different, individual reference tables are used.

The FAA data presented in Table 6.10 would be stored using the following data in the physical table, as shown in Table 6.11.

Table 6.11. Code and Descriptions Table

Group Code Description
StdOpCode N Normal
StdOpCode U Utility
StdOpCode A Acrobatic
StdOpCode T Transport
StdOpCode G Glider
StdOpCode B Balloon
StdOpCode C Commuter
StdOpCode O Other
RstOpCode 0 Other
RstOpCode 1 Agriculture and Pest Control
RstOpCode 2 Aerial Surveying
RstOpCode 3 Aerial Advertising
RstOpCode 4 Forest
RstOpCode 5 Patrolling
RstOpCode 6 Weather Control
RstOpCode 7 Carriage of Cargo

Table 6.11 includes data from two groups: standard operation codes (StdOpCode) and restricted operation codes (RstOpCode) by the FAA. Both groups are identified by the Group attribute and have one or more records identified by a unique Code attribute. Therefore, the primary key of the table has to be on both columns (Group and Code).

There are two options to use the code and descriptions reference table in satellites. It is possible to use the combined primary key, consisting of attributes Group and Code as a foreign key (with or without referential integrity) in the satellite. Or, if foreign keys are not used in the model, it could also be possible to use only a Code attribute in the satellite and identify the Group implicitly using the satellites attribute. This means adding hard-coded filters to WHERE clauses when the data is retrieved or joined for resolution. This is an acceptable practice – as the model is "type-coding" codes and descriptions, allowing all codes to exist in a super-typed table at a subtype level grain. However, the second approach requires documentation in order to know which Group belongs to which satellite attribute, without the need to analyze the code from virtual facts and dimensions or ETL code.

6.3.3.1 Code and Descriptions with History

It is also possible to store history with the code and descriptions reference table. This is done similar to the history-based reference table, presented in section 6.3.2. The logical model for such a table is presented in Figure 6.14.

Figure 6.14. Codes and descriptions reference table with history-tracking satellite (logical design).

As the figure shows, this concept is again based on a satellite that holds the attributes that require tracking of history. In this example, this is the case for Short Description and Long Description. If there are attributes where the history should not be tracked, they can be added to the reference table itself. The attribute Sort Order follows this approach.

The ER diagram for this history-based code and description table is shown in Figure 6.15.

Figure 6.15. Codes and descriptions reference table with history-tracking satellite (physical design).

Having a composite primary key in the reference table requires that the satellite references both primary key attributes. The attributes without history tracking, in this case SortOrder, is added to the parent table and the attributes with history tracking are added to the satellite: attributes ShortDescription and LongDescription are such fields.

Chapter 9 describes in more detail how to create such entities in Master Data Services and how to load them into the Data Vault 2.0 model.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128025109000064

Databases

Jeremy Faircloth , in Enterprise Applications Administration, 2014

Unique Keys

Unique keys (or simply "keys") are an integral part of using relational databases. The DBMS allows you to assign specific columns within each table to be defined as keys, which are the best way to identify distinct rows of data within the table. Each table can have multiple keys and each defined key can be comprised of multiple columns. The intent is for the DBMS to respond with a single distinct row if the given table is queried using values in a given key. For example, a key could be built based on the combination of the columns FirstName, LastName, and AddrId. If this were done, the database would have to maintain its referential integrity by preventing the addition of any data that would make this key nonunique. That means that you could never have two rows containing the identical first and last names with a reference to the same address.

Out of the keys defined on a database table, one of the keys must be defined as the "primary key." The primary key is the key that other tables should use to reference data within the table. This primary key, like any unique key, cannot be duplicated within the table. In many common databases, the primary key is a single column that uses a uniquely generated value for each row. Our sample database schema includes an ID column in each table that can be used as the primary key.

Foreign Keys

Whenever a table references the data in another table through its primary key, that value is considered the "foreign key." This is simple to remember in that the primary key for any given table will always be called the foreign key when referring to it from the context of any other table. In our sample database schema, you can see many references to other tables through the use of their primary keys. For example, the Orders table has a column called CustId, which is the foreign key to the Customers table. This foreign key relationship is what allows the data between these two tables to be appropriately linked.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124077737000041

Deriving Initial Project Backlogs

Ralph Hughes , in Agile Data Warehousing Project Management, 2013

Sales channel

FA: Then there's the reference table for sales channel. The channel tells you which part of our organization is responsible for the relationship with the customer. It determines which sales rep, regional director, and vice president gets dinged if revenue falls below a particular amount. We used to track the channel on customer records, but because the right channel assignment was getting changed with every regional reorganization, now we treat it as a notion independent of the customer identity. Plus, channel assignments for the business marketing group are determined by some really complicated rules that involve how big a customer company is, how they ordered their last product, and the mix of technology they're using, plus the level of spend between products. For example, a U.S. company spending mostly on bulk minute plans will move to a particular channel if they start spending more on data, but a completely different channel if they merge with a multinational.

PA: Are the business rules consistent between the regions?

FA: Hardly. Because channel traces back to the organization, each business unit and region maintain their own rollup scheme, and of course those schemes change all the time, just to make life interesting for those of us in finance.

PA: Then I'll write a constraint card for "Allow each business unit–region to have distinct channel rules" (Item 34). Because the channel sounds like it changes on a different schedule than either segment or customer attributes, we'll include it as another separate item on the business target model (Object 6).

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123964632000053

Implementing Data Quality

Daniel Linstedt , Michael Olschimke , in Building a Scalable Data Warehouse with Data Vault 2.0, 2016

13.7.1 T-SQL Example

Because the analytical master data has been provided as reference tables to the enterprise data warehouse, it is easy to perform lookups into the analytical master data, even if the Business Vault is virtualized. The following DDL statement is based on the computed satellite created in the previous section and joins a reference table to resolve a system-wide code to an abbreviation requested by the business user:

The source attribute DistanceGroup is a code limited to the source system. It is not used organization-wide. Therefore, the business requests that the code should be resolved into an abbreviation known and used by the business. The mapping is provided in the analytical master data, which is loaded into the Data Vault 2.0 model as reference tables. The resolution of the system codes into known codes can be done by simply joining the reference table RefDistanceGroup to the computed satellite and adding the abbreviation as a new attribute, DistanceGroupText.

Because the computed satellite has modified the data, a new record source is provided.

Another use-case is to align the formatting of the raw data to a common format, for example when dealing with currencies, dates and times, or postal addresses.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128025109000131

Data Quality

David Loshin , in Business Intelligence (Second Edition), 2013

Invalid Values

A significant amount of information is encapsulated using code sets and reference tables. The validity of a code value can be directly related to its use within a business process context and the existence of a valid mapping to additional reference information. In fact, invalid codes can significantly impact BI and analytics due to the integral hierarchical relationship between the code sets and domain dimensions, especially when the hierarchy is embedded within the code values (a good example is industry classification codes). Values may be invalid because they are not enumerated within the code set or because they are inconsistent with other dependent data element values.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123858894000120

Store and Share – Entity Identity Structures

John R. Talburt , Yinle Zhou , in Entity Information Life Cycle for Big Data, 2015

External Reference Architecture

In the external reference architecture, the IKB is a large cross-reference table connecting equivalent references located in the various client systems. The EIS in the IKB are entirely virtual, only containing pointers to the references to a particular entity. None of the actual entity identity information for a particular entity is stored in the IKB.

Both the identity attribute values and the application-specific attribute values of the source record reside in the client system as shown in Figure 4.8. The advantage of external reference architecture is that changes to an entity identifier taking place in one system can be more easily propagated to all other client systems where the same entity is referenced.

Figure 4.8. External reference schematic.

The external reference architecture works best when the governance policy allows for distributed authority to make master data changes in several different client systems. It does not work as well in systems where a large number of new source records must be ingested and identified on a regular basis. In systems implementing external reference architecture, the identity information needed for matching must be marshaled on demand from the client systems where it resides.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128005378000041

Facets of the DQAF Measurement Types

Laura Sebastian-Coleman , in Measuring Data Quality for Ongoing Improvement, 2013

Measurement Type #27: Validity Check, Single Field, Detailed Results

Definition

Description : Compare values on incoming data to valid values in a defined domain (reference table, range, or mathematical rule).

Object of Measurement: Content/Row counts

Assessment Category: In-line measurement

Cross Reference: 1, 20, 28, 29, 31, 39

Business Concerns

Assuring that data is valid, that actual data values adhere to a defined domain of values, is fundamental to data quality assurance. Domains of valid values may be established within reference tables, defined as ranges, or they may be defined through a mathematical formula or algorithm. In all cases, they represent a basic expectation about the data: specifically, that data values on incoming records will correspond to valid values in the domain; and that the actual values will thereby be comprehensible to data consumers. Validity checks mitigate the risk of having sets of values in the database that are not defined within the value domain. Because such checks detect new values in core data, they can also serve as a prompt to update reference data that may have changed over time.

The results of validity checks come in two forms: details and roll-ups. Detailed results present the counts and percentages of records associated with each specific value (see Measurement Type #30: Consistent Column Profile), along with indicators as to whether or not values are valid. Roll-ups present the overall numbers of and overall percentages of valid and invalid records.

The purpose of this in-line measurement is to identify the levels of specific invalid values within the fields measured. Once these are identified, support work is required to determine why they are present and to make updates or process improvements that reduce the overall incidence of invalids. The measure also provides the raw data needed for validity roll-ups representing the overall percentages of valid and invalid values.

It is not likely that a change in the level of validity for any particular value or even the overall level of validity for a single column will be critical enough to require a database stoppage. If there is data whose validity is highly critical, it is best to identify it as part of an intake check such as the one described under Measurement Type #17.

Measurement Methodology

This in-line reasonability measure identifies distinct values within a field and calculates a percentage distribution of those values. It compares the values to those in the defined data domain in order to identify which values are valid and which are invalid. The percentage of any individual value can then be compared to past percentages of that value in order to detect changes in the patterns of incremental data. Automation of this capability is especially important in data domains with a high number of distinct values. High cardinality can be prohibitive to analysis of trends.

In addition, because the result set will include multiple rows, a single threshold for investigation cannot be applied; nor is it realistic to manage multiple thresholds through a manual process. Instead, an automated threshold should be applied: for example, three standard deviations from the mean of past percentages of rows associated with a particular value. Setting an indicator on measurement results for which in percentage of records exceed three standard deviations from the mean of past percentages enables analysts to home in on potential anomalies. This measurement does not generally require notifications (see Figure 16.2).

Figure 16.2. In-line Validity Measurement Pattern

The process to confirm data validity is similar regardless of how the domain of valid values is defined (set of valid values, range of values, or rule). First, the data that will be validated must be identified and its domain defined. Next, record counts for the distinct value set must be collected from the core data. Then the distinct values can be compared to the domain, and validity indicators can be assigned. The results of the comparisons constitute the detailed measurements. These can be compared to past measurements to identify statistically significant changes in the percentage distributions. For reporting purposes, these results can be rolled up to an overall percentage of valid and invalid values. These can also be compared to past results. If the measurements produce unexpected results, notifications will be sent and response protocols initiated. All results will be stored in results tables for reporting purposes.

Programming

Input for this measurement type includes metadata that identifies the fields that will be tested for validity as well as the domains against which they will be tested. It also requires past measurement results. The process must count the overall number of records in the dataset as well as the number of records associated with each distinct value for the field being measured. From these two counts, it must calculate the percentage of records associated with each distinct value. Having obtained the distinct values, the process must compare them with the valid values as defined by the domain rule and assign an indicator, designating each value as either valid (contained in the domain) or invalid (not contained within the domain). With each run of the measurement, for each distinct value, the process should compare the percentage of records to the historical mean percentage from past runs. Calculate the difference between the current percentage and the historical percentage. Calculate the standard deviation from the mean of past runs and three standard deviations from the mean of past runs. Assign an indicator to each distinct value showing whether the current percentage is more than three standard deviations from the mean of percentages of past runs.

Support Processes and Skills

Since it is not likely that a change in the level of validity will require a database stoppage, this measurement is not likely to require notifications to alert staff to urgent problems. Instead, staff is required to review both the roll-ups and the detailed results to identify invalid values and research their presence in the database. From initial assessment and profiling, knowledge should be captured related to business-expected patterns in the distribution of values. Since the measure detects changes in patterns, support staff should also investigate changes in trend to determine whether these are business-expected. Any findings should be added to the metadata.

Measurement Logical Data Model

The Metric Definition table for Measurement Type #27 contains the following attributes (as defined under #6): Measurement Type Number, Specific Metric Number, Dataset Name, Dataset Source, Dataset Type, Data Quality Threshold Type, Data Quality Threshold (if threshold is set manually), and the attributes contained in Table 16.15.

Table 16.15. Metric Definition for Measurement Type #27

Attribute Name Attribute Definition
Target Column 1 This field indicates which field or Target Column is being measured (see #15).
Validity type This field defines how validity will be established (through comparison to reference data, through a defined range, or through a rule). Other fields within the validity table will be populated based on which type of validity is measured.
Reference Data Table For validity tests based on reference data, this field contains the name of the table that contains the values against which validity will be tested.
Reference Data Column 1 For validity tests based on reference data, this field contains the name of the column on the reference table that contains the values against which the validity of Target Column 1 will be tested.
Range Minimum For any measurement based on a range of values, this field represents the lowest value in that range (see #6).
Range Maximum For any measurement based on a range of values, this field represents the highest value in that range (see #6).
Validity Business Rule For any measurement based on a business rule, this field contains the rule (in this case, a validity rule).

The Results table for Measurement Type #27 contains Measurement Type Number, Specific Metric Number, Measurement Date, and Notification Sent Indicator (as defined in #6), and the attributes listed in Table 16.16.

Table 16.16. Measurement Results for Measurement Type #27

Attribute Name Attribute Definition
Dataset Total Record Count This field contains the number of records present in the dataset being measured (see #10).
Data Value 1 (Target Column 1) This field contains the value present on the data being measured in the column being measured. Values from Target Column 1 are stored in Data Value 1, those from Target Column 2 in Data Value 2, etc.
Validity Indicator Y—Value is valid; N—Value is invalid
Measurement Record Count The number of records that meet the condition being measured (see #14)
Percentage of Records Associated with the Data Value (Measurement Record Count/Dataset Total Record Count) *100
Historical Mean Percentage of Records Associated with Each Data Value Calculated historical mean (see #14)
Difference from the Historical Mean Percentage of Records Associated with the Data Value Historical mean percentage minus the percentage of the current measurement
Threshold Exceeded Indicator for difference from the Historical Mean Percentage of Records Associated with the Data Value Records whether or not a specific measurement is greater than the established data quality threshold (see #6)

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123970336000171

Compliance Concepts

In Securing HP NonStop Servers in an Open Systems World, 2006

Analysis of Requirements in Common

Examining the sample compliance standards and regulations reveals they have some basic requirements in common. As the cross-reference table in Figure 1.3 depicts, these requirements can be logically grouped into four security categories:

Figure 1.3. Cross-reference of compliance requirements

Authentication

Authorization

Auditing

Integrity and Confidentiality

These requirements apply to all platforms. Following is a discussion particular to the NonStop environment.

Authentication Requirements

Native HP NonStop Guardian security provides a foundation for userids and passwords. There are, however, limitations within this foundation. As configured in TACL, users can completely remove any password from their account. No minimum length is enforced. Unless disallowed by customizing the TACL object file, users can input the password as part of the logon line, allowing it to be viewed in the clear on the screen or to be embedded in obey files or TACL macros.

Safeguard provides the following enhancements to Guardian security:

Required password on user accounts

Password history

Encryption of passwords in the userid data bases

Minimum length of passwords

Password expiration

Use of aliases

Industry standards and best practices all require that each user be given a unique userid to logon to the system. Assigning multiple aliases to one underlying userid as a substitute for shared userids does not meet compliance standards. While providing identification for logon, this practice grants all the privileges of the underlying userid, even though alias users typically need only a subset of privileges to perform their job. Also, aliases may not appear in all audits, so tracking true user identification and individual accountability are impossible.

Safeguard will be included with the shipment of all new Itanium systems. This is a valuable move forward that emphasizes HP's commitment to help its customers secure the NonStop platform.

Third-party security products provide complementary security enhancements, such as multi-factor authentication and granular quality control for passwords and alias management.

Authorization Requirements

In native Guardian mode and OSS, security strings protect files. Access to files is authorized via one setting for each of four operations: READ, WRITE, EXECUTE, and PURGE. Each operation may be granted to one of three classes of users: owner, Guardian group or anybody. Each of these three classes can be limited to local only or remote access. When someone is authorized for network wide access, it is commonly referred to as "the World." Access to the World is a surprisingly common and very risky practice, especially for READ and EXECUTE.

Used in conjunction with Guardian, Safeguard provides more granularity and selectivity in granting access. Safeguard Protection Records allow security administrators to grant or restrict access to objects such as files, subvolumes, and disks to multiple groups and/or users.

The practice of using Safeguard to secure an executable file so that it will launch a process as the owner of the file rather than the person who executed the file is called PROGID. This practice is an inadequate substitute for true access control and can lead to more problems than it solves, because PROGID'd programs can easily be used for malicious purposes.

Third party products can enable system managers and security administrators to fine-tune access to files and programs. They make it possible to:

Permit system utilities to be executed as a privileged userid (removing the need to PROGID), but restrict commands within the utility to those required for the user's job function.

Permit operators to manipulate reports in the SPOOLER, but prevent them from viewing report contents.

Permit operators to bring up a Pathway as the correct application owner without having to logon as the application owner.

Auditing Requirements

Without audits, it is impossible to determine what actions were performed on your system, when and by whom. Auditing and effective tools for monitoring, reporting, and alerting are essential to properly securing an HP NonStop Server system. A continuous cycle of auditing security events as shown in Figure 1.4 is crucial to achieving and sustaining compliance.

Figure 1.4. The security monitoring process

The Safeguard audit service provides the ability to record and retrieve information about a wide range of audited events recorded in the Safeguard audit files. Some entities are automatically recorded; for others, auditing is configurable.

HP's Event Management Service (EMS) messaging environment provides auditing of system events. Applications can be programmed to output messages to the EMS system. This provides operations groups with the ability to monitor system activity and react to any abnormal condition in a timely manner.

Third-party products are available to supply enhanced auditing capabilities. More comprehensive auditing, greater granularity, and convenient reporting tools are some of the benefits of third-party products.

Although Event Management Service Analyzer (EMSA) can be used to view EMS logs and SAFEART can be used for Safeguard audits, third-party products are essential for comprehensive audit reporting and alerting. In addition, third-party products that combine Safeguard and EMS audits with their own product audits make audit reporting simpler with flexibility to customize reporting based on specific business and security goals.

Audit reports can be classified as three different types:

Scheduled reports

Research or ad hoc reports

Alerts-related critical security events

Scheduled reports usually document only exception events. This approach avoids the time and errors associated with reviewing pages and pages of common and expected activities. Scheduled exception reports are intended to run automatically and be reviewed in a timely manner. Even the best, most informative reports are useless if no one looks at them and takes action based on the content.

Just as important as regularly scheduled reports are ad hoc reports. Promptly finding answers to questions as they arise is a prerequisite to making intelligent security decisions. Another use of ad hoc reporting is investigating suspected security breaches. Reporting tools must be quick, flexible, and easy to use for these important security administration activities.

Automatic real-time or near-real-time alerting is absolutely essential to maintaining proper security and reliable systems. Messages just rolling off a printer or terminal in the operations area are not an effective alerting system. E-mail, pager, phone, and/or text messaging are crucial to protecting the system. If someone is repeatedly attempting to logon as SUPER.SUPER, it is not acceptable to wait until tomorrow's report to address the threat. The same urgency applies to software or hardware failures.

Integrity and Confidentiality Requirements

Encryption is the most effective way to protect the privacy and integrity of sensitive data. It solves many problems, including those resulting from lost or stolen backup tapes, disks or hard copies of files. NonStop Server processing power efficiently supports encryption activities. While native NonStop security and Safeguard provide no provisions for encrypting data other than passwords, HP offers the Atalla line of encryption devices that can be used for hardware-based encryption. Third-party products provide mechanisms to implement both hardware and software encryption in applications, databases, files and network communications.

Figure 1.5 illustrates how sensitive fields of data records can be encrypted by applications as the data is being saved to disk. This protects data at rest, accessed either while on the local disk or residing on backup medium. When encryption is built into application databases, only the sensitive data fields need to be encrypted. The non-sensitive data fields remain readable for processing or reporting purposes.

Figure 1.5. Protecting data at rest

A related area is that of protecting communications between systems. This is important because at every logon via telnet session to the NonStop Server, passwords and data are transmitted in clear text from the user's desktop PC across the network to the NonStop Server. Network ports are almost as numerous in offices as electrical outlets. This makes it easy to attack a system with a rogue sniffer connected to an internal or external network. Sniffing is transparent to applications, which reduces the chances of catching intruders before userids and sensitive data are harvested. Capturing authorized userids enables intruders to gain further access to company systems and data. Intruders masquerading as authorized users can be very difficult to detect.

Other types of communications between systems have the same or similar vulnerabilities as telnet sessions. File transfers, FTP, EXPAND, and products that provide connectivity from one system to another are used at increasing risk unless passwords and other sensitive information are encrypted.

Third-party products, using a variety of crypto mechanisms, are available to encrypt data as it is transported. Compared to the increasing frequency and costs of breaches, encryption products are very cost effective. The following diagram illustrates the potential for exposure related to data in transit across a network.

Figure 1.6. Protecting Data in Transit

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781555583446500044

Loading the Data Vault

Daniel Linstedt , Michael Olschimke , in Building a Scalable Data Warehouse with Data Vault 2.0, 2016

12.2.1 No-History Reference Tables

If reference data should be loaded without taking care of the history, the loading process can be drastically simplified by using SQL views to create virtual reference tables. A similar approach was described in Chapter 11, Data Extraction, when staging master data from Microsoft Master Data Services (MDS) or any other master data management solution that is under control of the data warehouse team and primarily used for analytical master data. This approach can be used under the following conditions:

1.

History not required: again, this solution is applicable for cases of nonhistorized reference tables only.

2.

Full load in staging area: the source table in the staging area provides a full load and not a delta load.

3.

Same infrastructure: the staging area is located on the same infrastructure as the data warehouse. If a different database server is used to house the staging area, the performance of the virtual reference tables could be impacted.

4.

Full control over staging area: the staging area is under full control of the data warehouse team and the team decides about structural changes. The last thing that should happen in production is an uncontrolled update to the staging area that breaks a virtual reference table.

5.

Reference data in staging area is virtualized as well: this condition rules out most applications but is important because the staging area should not be used as the central storage location. If reference data in the data warehouse layer is virtually depending on data in the staging area, the Data Vault 2.0 architecture has been violated.

6.

All required data is available: in some cases, the source system loses old records (e.g., old countries). If this is OK, because old records are not required in the reference table, then this condition is negligible. However, because the data warehouse provides historic data, all codes referenced in satellites have to be dissolved by the reference table in the data warehouse layer.

7.

No data transformation required: the data in the staging area is already in a format that requires no processing of soft business rules in order to prevent the execution of conditional logic when loading the Raw Data Vault.

If all these conditions are met, a virtual SQL view can be created in order to virtually provide the reference data to the users of the Raw Data Vault. This approach is typically used when providing reference data from an analytical MDM solution that is under control and managed by the data warehouse team. Such data is also staged virtually and centrally stored in the MDM application. The following DDL creates an example view that implements a nonhistorized reference table in the Raw Data Vault:

The view selects data from the table in the staging area, which is also a virtually provided staging view (refer to Chapter 11 for details). All columns are provided explicitly to avoid taking over unrequired columns but also to prevent taking over unforeseen changes to the underlying structure into the data warehouse. The view doesn't implement any soft business rules, but might implement hard business rules, such as data type alignment. It does however, brings the reference data from the staging area into the desired structure of a reference table, as discussed in Chapter 6, Advanced Data Vault Modeling.

This approach is most applicable for loading analytical master data from a master data management application such as Microsoft Master Data Services. Virtual reference tables are especially used in the agile Data Vault 2.0 methodology to provide the reference data as quickly as possible. If the business user agrees with the implemented functionality and materialization is required, the reference data can be materialized in a subsequent sprint, stretching the actual implementation of new functionality over multiple sprints.

12.2.1.1 T-SQL Example

In many other cases, especially if the data is already staged in the staging area, it should be materialized into the data warehouse layer to ensure that data is not spread over multiple layers. This decoupling from the staging area prevents any undesired side-effects if other parties change the underlying structure of the staging area. In such cases, the reference table is created in the data warehouse layer, for example by a statement such as the following:

The structure of the reference table follows the definition for nonhistorized reference tables outlined in Chapter 6. The primary key of the reference table consists of the Code column. Because this column holds a natural key instead of a hash key, the primary key uses a clustered index. There are multiple options for loading the reference table during the loading process of the Raw Data Vault. The most commonly used adds new and unknown reference codes from the staging area into the target reference table and updates records in the target that have changed in the source table. This way, no codes that could be used in any one of the satellites is lost. While it is not recommended to use the MERGE statement in loading the data warehouse, it is possible to load the reference table this way:

Because the code column identifies the reference table, it becomes the search condition of the MERGE statement. If the code from the staging table is found in the target, the record in the reference table is updated. If it is unknown, it is inserted. If codes are deleted from the source system, they are ignored in order to preserve all codes in the reference table. Deletes are implemented by adding a WHEN NOT MATCHED BY SOURCE clause:

The MERGE statement is generally not recommended to use in the loading processes of the data warehouse because of performance reasons and other issues with the MERGE statement on SQL Server [2]. Instead, the operations should be separated into individual statements to maintain performance. On the other hand, reference tables often have a relatively small size and performance doesn't become an issue. Therefore, using the MERGE statement might be simpler in some cases. If the reference table is large or performance becomes an issue, the statement should be separated.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978012802510900012X

Metadata Management

Daniel Linstedt , Michael Olschimke , in Building a Scalable Data Warehouse with Data Vault 2.0, 2016

10.2.7.2 Metadata for Loading Link Entities

Loading Data Vault links follows a similar pattern compared to hubs but with a little more complexity. The additional complexity is due to the fact that a link table references other hubs to store the relationship between the business keys:

1.

Data flow name: the name of the data flow that is loading the target link.

2.

Priority: sometimes, link data is sourced from multiple sources. In this case, the priority can be used to determine the order of the data sources when loading the target link, which might affect the record source to be set in the target link.

3.

Link identifier: the technical name of the target link.

4.

Target link table physical name: the physical name of the target table in the Raw Data Vault.

5.

Source table identifier: the technical name of the source data table in the staging area.

6.

Source column physical name: the physical name of the source column in the source table that holds the business key.

7.

Source column data type: the data type of the source column.

8.

Source column required: indicates if the source column allows NULL values.

9.

Source column default value: indicates the default value of the source column.

10.

Source column computation: if the source column is a computed field, provide the expression that computes the column value for documentation purposes.

11.

Source data type: the data type of the source business key column.

12.

Business key driving flag: indicates if this business key is part of the driving key (if any).

13.

Business key column description: the technical description of the business key column.

14.

Business key column business description: a detailed textual description of the business key column in business terms.

15.

Business key column business name: the common business key column name that is recognized by business users.

16.

Business key column business alias: an alternative business key column name that is recognized by business users.

17.

Business key column acronym name: a common acronym coding of the business key column name.

18.

Hub identifier: the technical name of the referenced hub.

19.

Hub table physical name: the physical table name of the reference hub.

20.

Hub reference number: the number of the hub reference within the sort order of the hub references. This is required to calculate the correct hash key.

21.

Hub primary key physical name: the physical name of the primary key column in the referenced hub table.

22.

Hub business key physical name: the name of the business key column in the hub.

23.

Hub business key column number: the number within the column order of the business key in the hub. Required to calculate the correct hash value.

24.

Hub business key data type: the data type of the business key column in the referenced hub table. Can be used for automatically applying hard rules.

25.

Target column physical name: the physical name of the target hash key column in the link table.

26.

Last seen date flag: indicates if a last seen date is used in the hub and should be updated in the loading process.

27.

Attribute flag: indicates if the column is an attribute instead of a business key. This is required to define degenerated links (refer to Chapter 4).

28.

Hard rules: references to the hard rules that are applied within the loading process for this business key.

The number of entries per link depends on multiple factors: first, the number of referenced hubs. For each hub reference there is at least one metadata record required to completely define the link. In addition, if a composite business key defines a hub, the dependent link entry in the metadata table for links requires one record per business key part. Table 10.7 shows a simplified example for a link metadata table.

Table 10.7. Metadata for Capturing Source Tables to Data Vault Link Entities

Link Identifier Target Link Table Physical Name Source Table Physical Name Source Column Physical Name Source Data Type Hub Table Physical Name Target Column Physical Name Hub Business Key Column Number Hard Rules
L001 LinkFixedBaseOp FB_OPS CARRIER VarChar(2) HubCarrier CarrierHashKey 1 HR22.1
L001 LinkFixedBaseOp FB_OPS AIRPORT VarChar(3) HubAirport AirportHashKey 1 HR1.2.5
L002 LinkConnection CONN CARRIER VarChar(2) HubFlightNo FlightNoHashKey 1 HR22.2
L002 LinkConnection CONN FLIGHT Integer HubFlightNo FlightNoHashKey 2 HR22.2
L002 LinkConnection CONN S_AIRPORT VarChar(3) HubAirport SrcAirportHashKey 1 HR1.2.1
L002 LinkConnection CONN T_AIRPORT VarChar(3) HubAirport TgtAirportHashKey 1 HR1.2.1

The table is simplified because some of the metadata attributes are omitted. The first link LinkFixedBaseOp references two hubs: HubCarrier and HubAirport. Both hubs are defined by simple business keys and not by composite business keys. The second link LinkConnection also references two hubs, but one hub, HubFlightNo, is defined by a composite business key, consisting of two parts: first the carrier ID and second, the flight number. The second hub HubAirport is referenced two times: as source airport of the connection and as a target airport of the connection. For that purpose, both references are stored in different hash keys.

Similar to the metadata table for hubs, presented in the previous section, this table contains redundant metadata in favor of usability. Again, it might be valuable to use a metadata tool with normalized metadata tables.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128025109000106