How Far We’ve Come – Pt. 6

As I started to get back into this I realized I skipped one final yet critical piece of the system. The backup script. I imagine some of you questioned where I got the csv files to import. As I said, one of the main reasons I went with Postgres all of those years ago is that I was tired of dealing with the oddities of backup and restore with DBF files and I didn’t want a single OS “solution” which would require a license or me rolling tons of (&)(*&) from scratch. This is something I use for tax season. You know, that season few whose income isn’t based directly on it, enjoy. It had to be solid. It had to be robust. I had to be able to generate the same (content wise) report with the data after a restore which got generated before the restore.

Why does the report have to match? Think for a moment. You get called in for an audit and your tax preparer (you do use one don’t you???) shows up with the original report they put in the file for that year. You show up with one that is different. Sit and ponder just how well that is going to work out for you.

I had to make the $TMPDIR mods to this script we made to our other script.  You will notice the if-then statement became an if-then-else.

declare -i tax_year=1992
declare -i current_year=$(date +%Y)

if [ -z $TMPDIR]; then
 export TMPDIR=/tmp
 sudo rm /tmp/tax*.csv
 rm "$TMPDIR"/tax*.csv

echo "Current Year $current_year"

while [ $tax_year -le $current_year ]
 echo "copy expenses to '$TMPDIR/tax_${tax_year}_expenses.csv' csv header;" > extract_it.sql
 echo "copy categories to '$TMPDIR/tax_${tax_year}_categories.csv' csv header;" >> extract_it.sql
 echo "copy payees to '$TMPDIR/tax_${tax_year}_payees.csv' csv header;" >> extract_it.sql
 echo " Extracting ... $tax_year"
 psql -w -f extract_it.sql tax_$tax_year

let "tax_year += 1"

cp -fv "$TMPDIR"/tax_*.* "$HOME"/postgres_tax_backups

Some of you will notice my backup script is pretty short. Each tax year is in its own database consisting of 3 tables, expenses, categories and payees. I just grind through a variable for the year, dynamically generate an SQL script which isn’t really SQL and when I’ve finished with all of the years copy the files off into a backup directory. Yes, the script assumes the directory exists, feel free to mod it for that deficiency.

Why the 2 step boogie with the files? See that psql command line? The one which executes our dynamic SQL script against the tax year database. You aren’t you when you run that. You are the Postgresql user. The one the server is running under. Unless you want to add it to your group or make your home directory world read/write, you need to do the 2 step boogie.

-rw-r--r-- 1 postgres postgres 6033 Jun 15 11:04 tax_2010_categories.csv
-rw-r--r-- 1 postgres postgres 21858 Jun 15 11:04 tax_2010_expenses.csv
-rw-r--r-- 1 postgres postgres 17872 Jun 15 11:04 tax_2010_payees.csv
-rw-r--r-- 1 postgres postgres 6349 Jun 15 11:04 tax_2011_categories.csv
-rw-r--r-- 1 postgres postgres 21546 Jun 15 11:04 tax_2011_expenses.csv
-rw-r--r-- 1 postgres postgres 20626 Jun 15 11:04 tax_2011_payees.csv
-rw-r--r-- 1 postgres postgres 6428 Jun 15 11:04 tax_2012_categories.csv
-rw-r--r-- 1 postgres postgres 26113 Jun 15 11:04 tax_2012_expenses.csv
-rw-r--r-- 1 postgres postgres 27052 Jun 15 11:04 tax_2012_payees.csv
-rw-r--r-- 1 postgres postgres 6902 Jun 15 11:04 tax_2013_categories.csv

Above is a snippet of what I see in the /tmp directory of my desktop after running our new script. You will recall that Linux doesn’t have robust file security like OpenVMS. The groups of letters on the left are in Owner Group World order. Most legitimate-for-business-use operating systems implement SOGW where S = System. After a bit of examination you will notice the files have World read access and are owned by the postgres user, hence the 2 step boogie.

Let us take a look at the last instance of our dynamically created extract_it.sql.

copy expenses to '/tmp/tax_2017_expenses.csv' csv header;
copy categories to '/tmp/tax_2017_categories.csv' csv header;
copy payees to '/tmp/tax_2017_payees.csv' csv header;

psql is powerful. The database server is robust. Just one line is all it takes to export into a csv file which includes column names in the header (the last two words on the line.) I’m going to use a screen shot to show you a snippet of output because it appears WordPress has taken it upon themselves to remove fixed width mono-space font support.

screenshot image

Please do whatever it is your browser allows to expand/zoom that image. I want to point something out here. Notice how the output is almost columnar? If it wasn’t for those things which required quotations because they contained one or more commas we, quite possibly would have perfect alignment.

CSV is not a reliable data transfer method which is why you see it so often used in the PC world. It has a loose set of guidelines and is implemented differently by everybody. I used it here because it is the exact same program both creating and reading the CSV.

We also need to chat about fixed width columns. Yes, I know all of the wowsy woowoo trends in DBA land, especially in documentation from Oracle say you should always use varchar. There are all kinds of conversations both right and wrong about reducing database storage requirements, data warehousing costs, etc. I’m here to tell you varchar is how data breaches happen. Not with this system unless I tried to connect it to the Internet via some XML interface, but with your systems that ultimately will be connected to the Internet or some external data feed.

Bad design choices happen because people stupidly believe future requirements have no place in “this sprint’s stories.” Yes, AGILE is both shit and criminal. While we are at it, let’s change the language used in this discussion. You didn’t “get hacked.” If you declared a varchar column in a table which was either directly via something on one of your own Web pages or indirectly via some XML gateway service like Websphere, you created the hack. It wasn’t a “data breach.” No, hackers didn’t “penetrate security.” You created the hack. Sign your name to it they should cart you out in cuffs.

For decades now, ever since “industry analysts” were paid to market Unix/Linux as “Open” systems and Microsoft somehow managed to bribe enough of those same analysts to declare Windows, the most proprietary PC operating system on the planet as “open” we have been told

Proprietary bad. Open good.

You’ve all heard some flavor of it. You know, the last major industry falsehood sold to us which created even bigger problems that what we had. Oh come on! If you are reading this you work in IT and you see this all of the time. Some highly marketed trend is the next thing to bring world peace, cure cancer and eliminate hunger. At least it is marketed like that until the “industry analysts” are paid even more money to say the same things about something else. Remember how Java was going to do all of that? I do. One only hears about Java in terms of legacy technology now. COBOL seems more current.

For the non-geeks or the too young geeks, I’ll give you a real world non-IT story to ponder. When I was a lad and they first introduced the catalytic converter along with unleaded gasoline Americans were in their previously large “green” movement. It wasn’t that long until we had actual city buses and prototype cars which ran on propane whose only exhaust was CO2 and water. These were “green vehicles” and they were going to save our planet from destruction. Now that I’m an old guy we are told “greenhouse gases,” primarily CO2, are causing climate change and the planet is going to kill us for it.

Now you have a frame of reference for the rest of this post.

The “heritage” communications systems using proprietary message formats having fixed width fields came about out of necessity. Communications was 300 baud or less. Somewhere long the lines would be a paper terminal which only had an 80 character data buffer and flushed the buffer on an ASCII return (decimal 13). Most people of that era designed data records to contain a maximum of 72 characters because there were other constraints on that 80 character data buffer like checksums, start/stop characters etc. You could get more characters to work some of the time in some of the places but 72 seemed to always work so it is the number the learned few shared amongst themselves.

Communications equipment and speeds changed over the decades, but proprietary messages remained. Some of the earliest 72 character proprietary formats may still be in use today. I know some didn’t go away until after 2000. All of these bad things had one good thing in common. SQL injection and buffer overrun attacks are not physically possible. You wish to send in a billion characters in a single field to force a buffer overrun so you can do an SQL injection, fine. Here’s what is going to happen:

72-characters, ka-chunk, fails data validation72-characters, ka-chunk, fails data validation
72-characters, ka-chunk, fails data validation
72-characters, ka-chunk, fails data validation

until all of those characters you fed it are consumed.

So, let’s consider what happens when you put an XML interface with something like Websphere (insert any name you wish there) which communicates directly or via port service or server consuming from a message queue with your relational database which has all fixed width columns. Not dynamically building SQL statements with contents of the XML but compiled solutions like SQLMOD or some such which only parse out the number of characters the column supports. It’s a repeat of our 72-characters example with different numbers of characters for each column.

You can read quite a bit of information about databases on OpenVMS here. The discussion around SQLMOD is quite a ways down so you may which to text search the page. As always, you can get a lot more information in this book.

Now, let’s get to the crux of the thesis. You remember what it is, right?

varchar is how data breaches happen.

Once again, something which was supposed to be great for the universe has turned out to be a bad thing. Initially varchar was a stop-gap measure. Databases which offered it first didn’t have an engine which would store only the actual characters for a fixed width field, stripping blanks on store and padding on read. Then we had all of this fuzzy design talk which gave us things like XML where text tags bound segments of limitless data as long as the limitless data does not contain a copy of the closing tag and the parser doesn’t pop past the end of its buffer or physical memory.

The next bandage put on this problem was the introduction of varchar(N) where variable columns had maximum lengths. Well, fine, but most databases weren’t very good at communicating maximum column width when this first came out. It now also requires your front end parser to retrieve on each and every message what that maximum is. Why each and every? Because someone could have issued an ALTER command. It was quite a while before syntax like this would work.

tax_1998=# select table_name, column_name, character_maximum_length from INFORMATION_SCHEMA.COLUMNS where table_name = 'categories';
 table_name | column_name | character_maximum_length 
 categories | category | 25
 categories | description | 50
 categories | tax_ded | 
(3 rows)

I will be shocked, shocked I say, if that select statement works on any other database. I have said it many times before:

As much as people talk about an SQL standard, there is no SQL standard. Those common yet annoying things haven’t made it into the standard and they’ve had close to 3 decades. Think I’m wrong. Connect to 5 different database engines each for a different type of database software and try the following: SHOW TABLES.


$ mysql -p --database=mega_zillionare
Enter password:
 Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 866 to server version: 4.1.9-log
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql> show tables
 -> ;
Empty set (0.04 sec)
mysql> exit

It didn’t fail, that was a test against an empty database. The SHOW TABLES command also works on RDB and quite a few other databases. But . . .

tax_1998=# show tables
tax_1998-# ;
ERROR: unrecognized configuration parameter "tables"
tax_1998=# \q

Yes, this has been a pet peeve of mine ever since relational databases were introduced.

The only realistic solution is a mantra I’ve recited to clients thousands of times before.

XML for outside world. Proprietary fixed width internally. Outside world never reach into database.

Always convert from XML to proprietary internal and have the proprietary internal message be handled by something that didn’t do the conversion. Never ever let a Web page direct connect to a database. Maybe, if it is a read-only copy of the database placed on an air gapped computer you don’t care about, but, even then it is not a great idea.

Having an officially defined and documented proprietary message format, for most companies at least, puts in place and official change control process so some DBA doesn’t go out to product their first day on the job and issue an ALTER command.

I need to rant about one last thing. According to the documentation . . .

Values of type character are physically padded with spaces to the specified width n, and are stored and displayed that way. However, the padding spaces are treated as semantically insignificant. Trailing spaces are disregarded when comparing two values of type character, and they will be removed when converting a character value to one of the other string types. Note that trailing spaces are semantically significant in character varying and text values.

I’m certain some bean counter ran numbers and justified padding on the store instead of storing the value trimmed so varchar(N) and char(N) would have the same inbound path. Other databases do not do this because they focus on storage and have figured out how to memset() N characters worth of memory before copying in the significant characters when the value is put into a cursor. Someone somewhere proved it is more efficient to waste the disk space than the processor cycles.

Switching columns to varchar(N) to get around storage inefficiencies causes other application inefficiencies. To start with, “Fred ” stored in the table will not match “Fred” used in the SELECT command. Not all programming languages support varying character fields, they will have to treat the columns as fixed length strings of maximum size which will really trash any hopes of saving storage because if, as the documentation says above, trailing spaces are semantically significant, the database has to store them.