Friday, March 30, 2012

Relationships vs. no relationships

Hello!

I have a really small database in terms of number of objects, but very big in terms of data (possibly millions of records). I always try to design my databases to be in the 3rd normal form (at least I hope they are!), but sometimes (for databases consisting of just a few tables) maintaining M:N relationships looks like an over-bloated approach to me... Let's see an actual example. I have a table like this:

Code Snippet

Create table [Airfares]
(
[OriginCity] Char(3) NOT NULL,
[DestCity] Char(3) NOT NULL,
[FareDate] Datetime Default getdate() NOT NULL,
[Price] Decimal(9,2) Default 0 NOT NULL,
[PriceChange] Decimal(9,2) Default 0 NOT NULL,
[Wrong] Bit Default 0 NOT NULL,
Primary Key ([OriginCity],[DestCity],[FareDate])
)
go

Now, at first there also was an Airline nvarchar(50) field, which stored the airline name. But now the client requested to store fares for a few airlines if it's the same price and city pair. This way we can have 3 airlines offering the same price on the same itinerary for example. So I added a dictionary table called Airlines:

Code Snippet

Create table [Airlines]
(
[AirlineCode] Char(2) NOT NULL,
[Name] Nvarchar(50) NOT NULL,
Primary Key ([AirlineCode])
)
go

Now, there are two things I can do. We are obviously dealing with the M:N relationship type here. I can either add another nvarchar column in the first table, like Airlines nvarchar(255) and store something like 'AA,AB,AC' there - those would be airline codes. OR I could do it the proper way and create a cross-reference table which will connect both Airfares and Airlines tables (this will store as many records as there are airlines for that particular itinerary, offerring the same price):

Code Snippet

Create table [AirfareAirlines]
(
[AirlineCode] Char(2) NOT NULL,
[OriginCity] Char(3) NOT NULL,
[DestCity] Char(3) NOT NULL,
[FareDate] Datetime Default getdate() NOT NULL,
Primary Key ([AirlineCode],[OriginCity],[DestCity],[FareDate])
)
go

But this is starting to look overly complicated, isn't it? The pros of this are I'm going to get real relationships, referential integrity, etc., but cons are more difficult code to query all this. What would you do? Do we really need referential integrity everywhere? Are there times when we're better not using relationships? I have seen many programmers skip referential integrity before, which always made me think they are missing something important in a Relational Database Management System.

The second approach is of course much better. Referential integrity should not be that much of an issue. If you design the database optimally with the correct indexes the performance should be fine.

The downside of the first approach is 2 main isses from my point of view:

You will have to parse the text field in order to find the relations between the airfares and the airlines. This is a performance issue and in the same time it you might not be able to retrieve certain information as easily as you would with a database that is structured with correct relationships (searching for airfares that belong to an airline is one that comes in mind. You have to use LIKE of do a full text index) It violates the main concept of referntial integrity where you can delete an Airline (I know this might not really make sense in your case) and not remove it from the airfares (or not remove all the references). With relationships you can easily delete references without having to handle them yourselves|||

I think that it is a mistake to assume that at all times, all airlines serving the same city pair will have the same prices and same fare dates.

In my opinion, you don't need a third table at this point, add the AirlineCode to the Airfares table. Don't make it more complex than required.

|||

Arnie Rowland wrote:

I think that it is a mistake to assume that at all times, all airlines serving the same city pair will have the same prices and same fare dates.

Of course not. I'm sorry, an explanation is needed - FareDate is actually the date on which I inserted a record to the database, not a date of flight. Wink It's called FareDate because both Date and DateTime are reserved keywords and I'd have to use square brackets in SQL everywhere.

|||

Not a problem. (Often tables have columns such as [CreateDate], [CreateUser], [ChangeDate], [ChangeUser], etc. so as to clearly indicate that it is metadata and not part of the dataset.)

Still, based upon your initial post, I think that you would be best served with only two tables.

|||To make it even more clear - I am inserting only the lowest fare, but in case there are more arlines with the same price for the same itinerary - I need to store all of them. And I can do that by placing them in a comma separated list in one field, or using a cross table as required by database normalization - I'm trying to decide on the best approach here. Smile|||

Stay away from comna delimited list of values in a single field. I have had to 'rescue' many clients from that horrendous mistake.

You never err to stay fully normalized.

|||

Arnie Rowland wrote:

Stay away from comna delimited list of values in a single field. I have had to 'rescue' many clients from that horrendous mistake.

As a matter of fact, the comma delimited list of airline codes was client's idea. Stick out tongue I didn't like it, because I would never think of doing things that way (but I know programmers who do this - worked with them), but thought I might as well ask other developers at MSDN Forums.

I will go with the normalized method.

Thanks.

|||

If you have SQL Server 2005 and you want to present a comma separated view of the normalized data then here is a sample of a trick using the xml functionality (text() and elements) in SQL Server (from an earlier post).

Code Snippet

DECLARE @.ItemTable table

(

ItemID int,

ItemNumber int,

TypeDesc varchar(20)

)

INSERT INTO @.ItemTable Values ( 1, 50000, 'CD' )

INSERT INTO @.ItemTable Values ( 2, 50000, 'Cassette' )

INSERT INTO @.ItemTable Values ( 3, 50001, 'CD' )

INSERT INTO @.ItemTable Values ( 4, 50002, 'CD' )

INSERT INTO @.ItemTable Values ( 5, 50002, 'Cassette' )

INSERT INTO @.ItemTable Values ( 6, 50002, 'DVD' )

INSERT INTO @.ItemTable Values ( 7, 50003, 'Cassette' )

-- Create Delimited list from multiple rows

-- From Tony Roberson

-- SQL 2005

SELECT DISTINCT ItemNumber, List = SUBSTRING(

(

SELECT ', ' + TypeDesc as [text()]

FROM @.ItemTable Det

WHERE Det.ItemNumber = Itm.ItemNumber

FOR XML path(''), elements

), 3, 4096

)

FROM @.ItemTable Itm

This will just run in a query window.

|||

I have a follow-up question. You already know the structure of those three tables, but additionally I need to partition data in Airfares, because it will grow too large and it needs to be really fast, so I will be creating a table called AirfaresHistory and moving old records there. Now, if I wanted to force integrity checks / relationships, I would have to create also an additional cross table to connect AirfaresHistory and Airlines, too. If I don't create relationships, I can have just one cross table and don't need to touch data in it when I move records from Airfares to Airfares history.

I can't use the partitioning feature from SQL Server 2005, because neither I nor the client has the Enterprise Edition, so I have to resort to old methods of creating member tables and a partitioned view. What should I do about relationships in such design? I highly doubt that when you have like 30 member tables you connect each of them to some cross table, do you? Wink It looks like in order to stay sane you should only have informational (on the ERD) relationships when dealing with member tables. Unless you should partition cross tables as well?

|||

You didn't ask for any design comments, but here goes...

It does seem to me that the Airfares table is mis-named. Wouldn't it be more descriptive to name that table something like 'Segments', or 'Routes', or Legs?

And that the table that is most likely to experience growth is what is currently called AirfaresAirlines (which may be more acturately called 'Airfares')? Each Airfare should have a StartDate and an ExpirationDate.

Then you would have an Airline, that serves a Segment, and there would be one or more Airfares for each Airline/Segment during a specific date/time period ...

(The choice of object names seemed a bit confusing for the entities in question.)

I wouldn't worry too much about maintaining relationships on historical data. I would consider NOT allowing data to be deleted, but to have an Active/Inactive flag field.

I consider that the primary purpose of PK-FK relationships is to force all relevant pieces of data to be inserted into the database when a new record is entered.

No comments:

Post a Comment