Data Structures & Algorithms

A 'Data Structure' is an arrangement of data. Today the term is usually applied to the physical arrangement of data as it is stored and processed in computer systems' RAM & disks and as it is launched and apprehended via network connections.

An 'Algorithm' is a step-by-step procedure for completing an operation in a finite number of steps. The focus of this discussion is on algorithms used in computer programs. Algorithms also exist to find answers using arithmetic, math, & logic.

Wikipedia provides excellent introductions to data structures, and an algorithms.

The most amazing data structures today support our Googling and several flavors of AI - Artficial Intelligence and they involve highly-evolved data structures deployed on storage systems optimized for the purpose. Wikipedia's article on Search Engine Indexing. As we got on-line to The Internet in the '80s 'searches' could take hours and were confused by little things like plurals, where a search with 'horse' would get one set of results and 'horses' would get another. Today's search engines can handle tenses, plurals, and even misspelled terms with max aplomb, returning hundreds or thousands of relevent results in a fraction of a second...

Historical Perspective of Data Structures

Data structures and data-processing algorithms didn't just pop out of the clear blue sky when Digital Computers first appeared early in the 1900's. They evolved along with civilization and trade. Methods were established for originating and archiving official records of contracts and agreements. Customary, more or less standardized, documents described the facts about orders among trading partners and among citizens and government. Detailed manifests described exactly what was traded.

As clockwork and other machines keeping time and calculating answers emerged hundreds of years ago, inventors started thinking of ways to apply mechanical methods and circuits to the processing of data. By the time that inventors mixed mechanical and electronic circuits to make 'digital computers', automated data processing of data on punched cards was already a mature technology.

Googling 'edge notched cards' will show an assortment of ancient and more modern schemes for handling large numbers of records on cards. Some edge-notched cards had a metal plate that could be used to print address labels, or print directly on a newspaper, magazine, or invoice. The last edge notched system I converted to on-line was in about 1985, a circulation system for about 80,000 subscribers. Long metal spikes could be gently threaded to pull out, for example, the cards for subscriptions about to expire. Updating the cards and replacing them were manual processes.

Data Structures have evolved along with media for record keeping. Recording the earliest facts of commerce or life involved stroking ink on hand-made paper or carefully skived hides (parchment). Babylonians found plenty of clay for bricks in the cradle of civilization, and recorded data as impressions with a stylus on moist clay, which dried to make a permanent record. Some cultures' records were tied as knots in a hank of threads. These media made durable records that took up more or less space in an archive to hold facts for future reference. But, they were 'human readable' only, could not easily be adapted reading by machines.

As the industrial revolution evolved finer and finer machinery, the paper-making industry made paper affordable and plentiful for use in offices and homes. As a hand-made product requiring a skilled papermaker's effort, writing paper had been very expensive for centuries before. Paper-making machines were improved, made plenty of paper, and were adapted to process fibers into heavier, durable 'card stock' well before the 1900s. The cards were tough enough to be thumbed repeatedly over the years by clerks searching thru card files. Googling 'card catalog' shows an assortment of card catalogs from small to huge.

Eventually, punched cards could be read by a machine without being damaged by the metallic fingers probing for the resistance of a card or lack of it in a punch. Electro-mechanical card readers reigned for the last decades of the technology's Huraah! In these, the holes punched in cards complete electrical circuits as the cards' columns and rows move past little, metallic brushes. A punch allows the current to flow at that column/row location. If there is no punch at a column/row location there is an open circuit.

These card sorters, tabulators/printers, & gang punches required constant maintenance and cleaning to avoid problems with registration or tearing of cards. It took constant attention and activity by the data processing operators to move the stacks of cards correctly thru the sorting & merging that prepared batches of data for tabulation and printing operations.

In a medium-sized business a crew of a couple or a few people ran the machinery into the evening, or overnight, to make sure the batches of day's data was all lined up correctly, input to the appropriate process, reports 'balanced' and were printed on 'computer paper'. Then, these 'cocntinuous forms' were de-collated to remove the carbon papers, and burst apart correctly. Then these were wrapped or put in envelopes and ready ready for delivery to the CEO and others in the morning.

Record Keeping for Houses and Enterprises

Owners, secretaries, and other officers thru the 1700s worked at desks optimized for handling hand-written letters and documents. The desks are also called 'secretaries' because they have lots of places to 'secrete'(2nd use of the word) the papers close at hand. Matters were systematically reduced to transactions and recorded in bound Journals and Ledgers that were kept close.

As the 1800s moved along and paper production was industrialized, 'card files' with many drawers were added to many offices' furniture and were put to use for storing facts on cards for quick access. Card files were also used for indexing vast sets of an enterprise's or bureau's exquisitely detailed records.

Library 'card catalogs' are becoming obsolete in the 00's after hundreds of years in service. Using card catalogs, clerks could thumb through an index to find the location of a book or file without having to search thru a huge file room or library. Also, where only one copy of an original document exists, multiple index cards can reference the same book or document, making 'cross references' so the desired item, perhaps a deed, policy, or book, can be located quickly using indices for owner, buyer, seller, title, author, or other subject.

The files and indexes (indices) for organizing paper records illustrate some of the fundamentals of automated data processing: 1)store the record once, and, 2)use indexes to quickly locate individual records as they are needed.

Details of commerce and life were routinely recorded and indexed on paper or parchment for centuries before Chas. Dickens' time but it was his generation that first saw data recorded as punches thru and notches at the edges of precisely formatted cards. These punched and notched records could be selected and processed in a combination of clerical and mechanical steps. Just as authors of the period invented detailed and complex plots for their characters, the analysts applied their talents to making efficient processes for keeping and finding detailed & accurate records of enterprise or government.

Automating Processes with Punched Cards

The earliest images of 'punched cards' we see date from early in the 1800's, where Jacquard rigged a loom to weave richly detailed tapestry economically & automatically, controlling the weave with a continuous belt of cards. Jacquard used wooden battens and today's Jacquard Looms use modern materials, or have been digitized altogether. As each woof thread is shuttled across the loom the appropriate threads of the warp are raised if a pin controlling the thread's height encounters card instead of hole as each card is read in sequence from the belt. This is fine for controlling machines to repeat patterns, but has few applications to business record keeping.

By the late 1800's office equipment & machinery evolved to process data punched on cards in a stack as well as a belt. An office system used for hundreds of years uses holes and notches postioned around the edges of paper cards, often carefully printed with the significance of a hole or notch at that position. Dozens of boolean facts about Customers, Subscribers, Subject, and Accounts could be stored using either a hole or a notch which removed the hold. For example: When a stack of cards is lifted with a straight wire passed thru the position that indicates 'Subscription Expires in November 1958' the records with notches in this position will drop out for instant use for sending expiration notices or cancelling an expired subscription.

Many businesses from the early 1900s thru the 70s invested in machines to stamp an Account Number and name & address information as raised type on metal plates.

At the point of transaction, the data on a customer's charge-a-plate were quickly stamped onto a multi-part form using a machine that made the top impression thru a ribbon onto the pre-printed form, and the back copies thru carbon paper. In one stroke, all the paperwork was generated to make a clear audit trail from the sales counter thru to the bottom line.

In the Accounting Office, these metal plates could be attached to cardstock pre-printed with a form for the purpose at hand and allowed subscribers' or customers' addresses to be stamped quickly & directly on an envelope or statement of account. In the mailing room, machines for addressing envelopes and making mailing labels quickly struck thru a ribbon to move data from stacks of the address plates. The edges of the plates could have data encoded in them with notches so plates could be quickly selected and sorted as required for the process at hand.

This link is to a Short History of Credit Cards, including shots of paper accounts and the ChargeAPlates in common use for several decades before plastic cards with magnetic stripes came on the scene. Ahh! Credit!! It's been the American Way for hundreds of years!!!

Pneumatic Networks

From the mid-1800s enterprises and bureaus of many types installed pneumatic tube systems to move transaction records on paper or cardstock from the point of interaction with the customer to a central office where transactions were applied to the customer's account manually, or with a stamp, or with punches to a card. The account might be for a customer, patient, citizen, supplier, general ledger, department, or other entity. Data flowed noisily thru a combination of pneumatic, electrical, and mechanical circuits to be recorded on cards holding account data.

Here is a link to pages about a metropolitan area pneumatic network from the 1930's.

Punched Card Technology

Technical advances in Punched Card Technology by Herman Hollerith and others in the late 1800s were thorougly commercialized in the 1900s as the efficient and economical way to keep and process large numbers of records.

By the mid-1900s punched card technology reached its peak. One standard punched card from that era, run thru IBM equipment, has 80 columns where punches are made and each column has 12 rows. One or more rows can be punched in a column, making it easy to represent dozens of numbers and characters -- enough to encode values for digits 0 thru 9, letters A thru Z, a few punctuation marks like $, /, and *, and some 'control characters' used to control data processing equipment thru which stacks of cards were fed. The data encoding scheme for these cards, BCD-Binary Coded Data, is the precursor of EBCDIC.

Numerical and Character data were stored on punched cards, where a complete record of several hundred characters might span several cards. The data processing, or other department, had to dedicate lots of cubic feet to the storage of racks and racks of punched cards.

Data processing operators handled the punched cards and many record keeping & reporting tasks were accomplished more or less automatically. Central to a data processing operation were the Card Sorters that were able to quickly & accurately drop cards read from a stack into slots according to the holes punched in a column. Gang punches, duplicating punches, and tabulators handled record keeping and reporting tasks. Printers put reports on 'computer paper', usually with two or three carbons to make three or four copies. In these times with no networks or 'data terminals' lots and lots of paper was involved in enterprise.

Sorting a 'dataset' on punched cards involved running the stack thru the sorter repeatedly, once for each column of the field of data involved in the sort. Over the course of an accounting cycle, a card might be read in dozens of processes from daily thru weekly, monthly, and annual cycles.

Between automated processes, the stacks of cards might be sorted into 'alphabetic sequence by account name' so that individual records could be found, and updated with changed facts or corrections.

A short history of punched cards with good pictures.

Here is a page from the past showing an annotated drawing of data processing equipment of the 1930's as applied to Astronomy and not business.

Sequential Records on Magnetic Tape

In the 1950s became more and more feasible to use the newly emerging 'digital computers' in business processes. Many organizations saw tape drives, CPUs, and other 'peripheral devices' added to the card-processing equipment that was standard in data processing rooms at the time. Magnetic tape emerged as a quicker & more economical data storage medium than Punched Cards.

Thru the '70s lots of data were keyed on punched cards to be copied to magnetic tape, where subsequent processing was much quicker, and quieter, than cards. From the '70s as 'data entry terminals' were becoming feasible, many of these CRT/Keyboard machines were deployed as 'Key To Tape' units where Punched Cards were not involved and operators keyed data written directly to magnetic tape.

Algorithms for 'Sequential Record Processing' developed quickly during this time. After data are sorted into the desired sequence they could be updated or reported efficiently. 'Transaction Processing' might involve sorting the cards holding the 'transaction records' into the same sequence as the 'the master file' records on tape, then running a program that applies the new transaction records to the tape ahead of of the historical ones.

Magnetic Disks and Drums allowed quick Sequential & Direct Access to records.

Magnetic drum and disk technology became affordable for medium and large-sized business in the 1960s. $1000+ per Megabyte was a common price for a drive and one compatible disk pack, other removable disk packs cost thousands of dollars each. It was a management challenge to make an automated process that was economical given the expensive equiment and cost for personnel.

With their magnetic media constantly rotating past the read/write heads magnetic disks and drums could read and write records directly, usually in bursts of 512 bytes that corresponded to the sector size on hard disks of the time, or 1024, 2048, or more stored on the surface of a drum. Thru the '70s drums became obsolete, and 'disk packs' with a capacity of 20 or 40 megabytes were common.

As new techniques quickly emerged for maintaining records directly on magnetic disks. Magnetic tapes became more important for backing up data on disk and holding transaction data. And, the managers, customer-service reps, and others in the building with the enterprise's computer saw the tremendous benefits that direct access records provided in getting instant answering for questions at their desks, and how this expensive equipment was important for enterprises' bottom lines and competitivity.

Primitive Data Structures and Algorithms to Process their Data

If you're going for an interview for database or network admin it's likely the behavioral part will involve questions or exercises about 'data structures'. Amazon or Half dot com has lots of good books about what you're likely to be asked to do for an interview for a programmer, developer, or network admin position.

Storing huge numbers of records on magnetic disks and drums posed the problem: How to _find_ the one record that's needed, more or less instantly, among the zillions of records stored? How to add transactions to master records without having to shuffle increasingly large numbers of records on the disk drive, or (shudder to recollect) a tape drive?

The data structures and techniques for processing them make good 'teaching tools' today. Learning how to script these primitive structures and algorithms gently introduces students to the concepts & contrivances that underly today's DBMS. Here's a link to MySQL's CREATE INDEX syntax, showing that an understanding of BTREE & HASH is necessary to avoid hosing up the database by adding the wrong kind of index, so these topics are introduced here...

Here is a quick look at Sequential & Direct Files Access Techniques using PHP scripts.

Wikipedia provides excellent coverage of the Data Structures that are used to store, find, and retrieve records on disks:

Before and beyond these 'classic' data structures, there are lot of others employed in scripts: Scalar Variables, Arrays, Lists, Heaps, Stacks, Queues, Strings, Objects, Controls, Result Sets, Cursors and other hideyholes and gizmos to access RAM, virtual storage, disks, and networks effectively.

Modern Data Structures & Algorithms

DBMS are involved in most modern applications to save, find, retrieve, update, and delete records of may types. And, 'the file system' is also used to store and serve up some types of data. Most of today's structured and oject-oriented languages have application program interfaces making access to databases, or directories and files, that are physically located anywhere on a computer or network.

Today, it's the programmers who design and code DBMS, OS, & Programming Languages who must be _intimately_ familiar with these arcane concepts and algorithms. The rest of us app developers, network admins, & database admins need to be _aware_ of how they work so that we make the best decisions about how to position databases and tables on the diskspaces available, and how to optimize performance, reliability, and security for future access to the records they hold.

It's not unusual for a field in a table to hold the path and name of a file kept in the OS's directory structure.

Data Base Management Systems

Recently discussed in a prior topic...

Data Transport Protocols

Fixed blocked records as above, comma delimited, & tab-delimited records have been exchanged since tape and network made data exchanges between systems possible to connect 'the computer room' with the computer rooms at their customers', suppliers', and branches who likely to not have the same kind of computer system or application software. Today, with very quick bandwidth and thruput, XML is emerging as a standard, but bandwidth-consuming, way of marking up data for serialization and transport among computer systems.

Several standards prevail in 'EDI-Electronic Data Interchange'. The bulk of documents for supply chain management are transmitted using ANSI X-12 standard or the UN's standard EDIFact. Both organizations are currently involved (since early 2000s) in applying XML to their legacy of trading partner documents. Health, Automotive, Government, and many other Industries have adopted X-12 documents as their standard way of exchanging documents of commerce.

Having a business system adaped to handle EDI allows large, medium-sized, and small outfitters handle their part of the supply chains of huge customers', many of them will not issue a PO any other way than via X-12 EDI and require participation in the standards all the way thru shipping the goods and invoicing for them.

OBI-Open Buying on the Internet is a relatively recent adaptation of X-12 standards applied to modern networks and supply chain management among trading partners 'over the internet', larger and smaller, until competition happens and they might part.