Monday, December 17, 2007

Interview With Joe Celko About The Forthcoming Book Thinking In Sets

I noticed that Joe Celko has a new book coming out soon: Joe Celko's Thinking in Sets: Auxiliary, Temporal, and Virtual Tables in SQL. I decided to send Joe an email to see if he would be willing to answer some questions I had about the book and SQL in general. To my surprise Joe was more than willing to accommodate my request. The question-and-answer session with Joe that follows was conducted via email.

Is the book geared towards a beginner/intermediate level user or do you have to be an advanced user to really utilize the information in this book?
I would say intermediate level. You need to know enough SQL to do some programming in the language and be experienced enough to see that DDL is as important as DML.

What are the most important things a person can do to master SQL?
The most important thing is to make the leap from procedural programming to declarative programming, from sequential files to sets. The only declarative language that most programmers have seen is spreadsheets. They are nothing like SQL!

I assume you wrote this book because people when they first use a database tend to do the same thing they do in Java, C# or other procedural languages; get a bunch of rows and manipulate them one by one. Is this the number one mistake beginners do?
That is right up there in the top five, for sure! But I think that the classic error is in not knowing how to design the schema in the first place. A bad schema -- denormalized, bad data element names, no constraints, no proper keys, no referential integrity -- leads to trying to correct the flaws in DDL in the DML. If you have a good schema, then the queries, inserts, updates, and deletes are usually fairly easy. All of the "hard work" should be hidden in the database and not spread over the application code and DML.

What is so bad about attribute splitting (somehow these three tables come to mind: Squids, Automobiles and Britney Spears)?
NO, no, no! I coined the phrase a "Squids, Automobiles and Britney Spears" table or column to refer to a table or column which has more than one kind of entity or more than one kind of attribute in it. What makes that phrase so fun is that there is a web page which compared photos of a giant squid and Britney Spears after she cut off her hair. If you actually gave such tables or columns a meaningful name, then you would see that it is not a proper table or column. These nightmares would have names like "hat_or_shoe_size_depending_on_column_five" Attribute splitting is when you take an attribute and make it into two or more tables, columns or rows. The table example is the most common one. How often have you seen a table for each location (split on geography) or a table for each month (temporal split)? This mimics a tape file system, so newbies feel comfortable. When you ask them if they would split a Personnel table into MalePersonnel and FemalePersonnel (split on gender), they immediately see the fallacy. Unfortunately, splits need all kinds of code if they want any data integrity. This code usually re-assembles the data elements back to what they should have been in the first place.

In chapter two of your new book you talk about the new advances in hardware. I have noticed that somehow the amount of data I have to store out paces the advance in hardware and the queries don't run any faster. Will Solid-State Disks finally change that?
I have great hopes for solid state devices. Any solid thing that has to physically move is by definition slower than electricity or light. We are looking at nano-tech, better lasers and a ton of new technology almost every month.

I see you have a whole chapter on auxiliary tables, I am a big fan of those tables myself and I use them to create dates on the fly or split comma delimited strings. In your opinion what percentage of developers does not use them and why should you use them?
Not enough. SQL is a language meant to handle data and not to do computations. Auxiliary tables can be shared among sessions and accessed in parallel. Imagine a complicated but deterministic formula. In a procedural computational model, you hang in a loop and re-do the work for each record in a sequential file. Let me give you a real-world example of that. Corky's BBQ does a huge mail-order business at Christmas time. The pick lists need to include the size of the boxes to be used, so part-time help can do their jobs. When I got this problem, they had discovered that using weights was not right way to go. The approach being attempted was to play "3-D Tetris" with the products. Since that kind of thing is a bitch to program, they were getting nowhere. My approach was to look over a few years of shipping history and find out how many different orders they had shipped and what the smallest box use for each of those orders was. There were only about 5000 configurations and the majority were standard gift packages. Do a relational division and a table look-up to handle 99.98% of the cases and print the phrase "Hand pick this order" on the pick list for the exceptions! In the SQL model, you join an auxiliary table which has the parameters and the result value in each row. You can do this simple multi-column equi-join once in parallel. Wait until the multi-core chips make parallelism the only way to design a computer; then auxiliary table are going to really fly.

I noticed you have a big chapter on VIEWs; do you feel that VIEWs are not utilized enough by developers?
They are used either not enough or too much.

Why do you think we should not use bit or Boolean flags in SQL?
In SQL, to be a data type it must allow NULLs. What is the fundamental nature of a BIT? It is one or zero; there is no NULL concept here! This was a problem for SQL Server, when they made their BIT data type into a numeric that could be NULL-able. The change scrambled a lot of data when it was ported from one release to the next. Bits are low-level, hardware dependent concepts. Are you high end or low end hardware? Do you have 4, 8, 16, 32 or 64 bit machine words? You have use proprietary operators. This defeats the idea of machine independence. Finally, bit flags are used to destroy First Normal Form (1NF) and thus destroy data integrity. As an example, recently in as newsgroup someone wanted to use a 4-bit column to store all the possible colors for a product (red, green, blue, yellow) and get them out with a bit mask (his hardware has a nibble!). But how do you add purple? How do you set up a constraint that no item can be mad with both red and green options? In RDBMS, we discover the state of our data with predicates and not by setting flags at the hardware level.

I have your puzzles book and noticed that you have a paragraph on Sudoku and one on Bin Packing in the last chapter of this book. You have procedural solutions and SQL solutions for most of the material in the chapter; is the SQL solution faster?
There is a funny story on the Sudoku problem. Richard Romley is a retired DBA from Smith-Barney and he decided to play with Sudoku for recreation. He used SQL Server on his home machine and coded a solver in one SELECT statement. The procedure takes 81 parameters (the starting grid) and does an 81-way self-join. It produces ALL the valid answers -- bet you did not know that many published puzzles have multiple solutions! The code is straight forward and depends on the optimizer to handle the search condition logic. Even longer problems with tens of answers run in well under one second. The procedural solvers vary, but I have seen some that stop when they get to the first valid grid. If there is only one solution, they are very fast. But is the right answer actually the *set* of valid grids? Since I am an SQL person, I think so. The procedural solvers can get hung up by backtracking to the starting position when there are a few hundred answers and become very slow. I also strongly recommend getting the Japanese or Chinese editions if you read either language. My two translators cleaned up some old code and added new solutions as we went along.

Should having good naming conventions such as 11179-5 be included in database courses?
Drop them in as soon as you start. If you grow up with good conventions, you will start doing it without thinking about it. When I teach RDBMS, I start with scales and measurement theory so that my students know what data is all about -- whether it is in a database or not.

When can we expect your new book to show up in bookstores?
It was supposed to be out in 2008 February, but we lucked up and it will be out in 2008 January. Production was faster than planned. I guess after seven books, and working with the same people, we have it down pretty well!

A bunch of questions not related to the book

Why do you write technical books?
I have no talent for fiction. I cannot get a plot or characters onto paper to save my life and my dialogues are awful chains of "he said-she said" stiff sentences. My grandfather wrote children's poems in Slovak, and I have even less talent for poetry. I have a number of friends who write detective novels, Science Fiction and Fantasy. They don't consider me a real author because I don't do fiction. I think about trying my hand at YA (Young Adult) books -- Danica McKellar (Winnie from the WONDER YEARS television show) just did a math book for girls, so maybe I could do "A Child's Garden of Normal Forms" or a juvenile detective series called "The Hardware Boys", then go on to a television show called "Query Eye for the Database Guy" or something.

Which of your books was the hardest to write and why?
First edition of SQL FOR SMARTIES! It was my first book and I thought that having written a few hundred magazine columns would make it easy. I was dead wrong -- completely different skill set. I was a year late in delivering the manuscript. After that, I had a system in place.

What is coming down the line? Any new books or updates to current ones?
I am trying to do at least one book per year -- more if I am unemployed and need the advances. My current thoughts are a book on the use of Standards in a database, and one on programming tricks with OLAP functions, CTEs, and other new features in the SQL-2003 Standards. My Morgan-Kaufmann books tend to follow a five year cycle, just like the ANSI/ISO Standards. I also get asked by vendors to do product specific books. I might self-publish something completely off-topic. I have a book on domino games based on my postings and I teach Texas-42 on Royal Carri bean Cruises -- it is a domino game only played in Texas. I also have a book on Pai Gow (a gambling game played with Chinese dominoes) which might sell 10 copies. I will also be doing some video classes, but I don't have details yet.

Which of all the SQL books that you wrote is your favorite?
DATA & DATABASES, which never got the sales of the others. It is more philosophical and concerned with the nature of data instead of programming.

What SQL Server books are on your bookshelf?
Anything I can get by Henderson, Machanic, Moreau, Ben Gan and Delaney. The SQL Server experts are pretty well-known and they publish. This is not true for other products, especially the open-source RDBMS products.

Why do you participate in newsgroups and do you think it is a good idea for beginners to ask questions in newsgroups?
To do some shameless self-promotion, of course :) Newsgroups are a good source of SQL problems and some clever answers that I can use in books and when I am consulting. I also have a pedantic streak I did to get out. And if I am available on a newsgroup, people don't fill up my mailbox at home. And, yes, beginners should use newsgroups for help. But not to have someone else do their job or their homework for them. I like to see the mindset of people who are just learning SQL. It is not enough to see that someone is making a mistake; you want to figure out what lead to that particular mistake. Remember Chernobyl? Everyone did just what they were supposed to do, but there were a few critical assumptions that lead to an event cascade.

What are Cowboy Coders and id-iots?
The term "cowboy coder" is an old one. It means someone who starts coding without any design phase, without any overview to the system as a whole, without any research for industry standards or a company data dictionary. They usually love dialect code and tricks that trade immediate performance for maintainability. The heavy dialect code also gives them job security, since they usually only know one product. An "ID-iot" is a newbie who has no RDBMS education and wants to have the comfort of a sequential file system. So he puts an IDENTITY column on every table as the PRIMARY KEY. Never mind that it is proprietary and non-relational; it is the familiar record number from a file system which can use to mimic pointer chains. He does not understand that rows are not records, tables are not files, columns are not fields and references are constraints and not pointers.

I have been working with Sybase IQ for a little bit; what is your opinion on columnar databases?
Sybase IQ is not the only game in town. I consulted with SAND (nee Marcus, nee Nucleus) years ago. It was one of the first such products. Later I ran into WX2 (nee White Cross) and I am looking at Stonebreaker's Verticia now. Their advantages in parallelism and compressing large amounts of data make them the best choice for Data Warehouses. I would also look at Teradata, which uses hashing. That will become more important as the research on minimal perfect hashing functions gets out of the lab and into products.

Where can we expect to see you in 2008? Any conferences, seminars, trade shows or classrooms perhaps?
I will hopefully be doing some more "SQL Saturdays!" on my weekends. I want to do more webcasts, but I am not sure if I am ready for YouTube. My other travel goal for 2008 is to get to Australia or Japan; I have never gone past Hawaii.

Some of Joe Celko's Books:
SQL for Smarties
SQL Programming Style
Trees and Hierarchies in SQL
SQL Puzzles and Answers
Data and Databases

No comments: