data-definition

Data Definitions

A data definition is a specification of a set of values, using a semi-formal system. Here are a bunch of examples:

1	# a Price is an integer, representing a number of US cents

1	# an ID is a string of 10 alphanumeric characters

# a Location is Location(float, float)
class Location:
   def __init__(lat, long):
   # ...

# A Household is Household(Location, ListOfPersons, Dollars)
# representing a household in the census data
class Household:
  def __init__(self, location, people, income):

# A ClassStatus is one of:
# - "Enrolled",
# - "Waitlisted", or
# - "Not Enrolled"

1
2
3

# A weight is either
# - a float, representing a number of kg, or
# - "unknown"

# A FloatBinaryTree is either
# - "leaf", or
# - Node(float, FloatBinaryTree, FloatBinaryTree)
class Node:
  def __init__(self, value, left, right):
  # ...

That set of examples covers just about the whole class, actually…

Data Definitions are the most important but also the most challenging part of the whole design recipe.

Why are they important? Data definitions specify the values that the program will be handling. By defining values, we immediately start thinking about functions as mappings from pieces of data to other pieces of data, rather than as sequences-of-instructons. Data definitions lead to examples of data, and then to test cases. Data definitions steer students toward specifications, rather than implementations.

Why are they challenging? Well, for one thing, they’re the first step in the design process, so it may be hard to know what definition to choose. Choosing a definition well depends on a good understanding of what functions you’re going to implement, and how the choice of data definition will affect that.

Secondly, they have an interesting relationship to the rest of the design recipe. Specifically: you usually don’t need to produce a new data definition, in order to write a new function! Generally we define a piece of data, then write many many functions on that data definition. This presents mundane pedagogic difficulties; the data definition is step one in the design recipe—but for most functions, we choose from existing data definitions, rather than designing new ones, and don’t need to write anything down as part of step one.

When you work with the design recipe, you will find many students confused by the fact that “step one” generally has no written artifact at all. Some students will laboriously duplicate their data definitions, and others will confuse designing data with designing functions, and try to follow the rest of the steps of the design recipe without a function in mind.

Classification

There are several different classes of data definitions. Generally, we follow a “spiral staircase” approach, revisiting step one repeatedly as we encounter more and more sophisticated forms of data.

Existing

We take certain data definitions as existing, including:

int
float
str
bool

(These denote the expected sets of values.)

One interesting case is the existing Python “list” values. These present a bunch of interesting questions/problems, especially in this class. Are they arrays? Are they linked lists? Generally speaking, Python’s attitude is “don’t worry about it, just do whatever you want”… which is not really the attitude that makes sense in a data structures class. (FWIW, they turn out to be arrays, essentially.) Anyhow, I would write them like this:

[int]
[float]
…

Finally, we use the data definition Any to denote the set of all python values.

Simple Data

The first kind of nontrivial data definition, like the first two above, is just giving a new name to an existing kind of data.

From the list above:

1	# a Price is an integer, representing a number of US cents

Strictly speaking, there’s no need to define a new kind of data here; whenever you use the term Price, you could simply have used the built-in float.

You can also add side conditions to data definitions, though; consider the data definition

1	# an ID is a string of 10 alphanumeric characters

You can imagine all kinds of goofy side conditions. This is essentially the reason why this is (and probably has to be) a semi-formal notation; it allows the specification of side conditions that don’t fit into the notation of a fixed system. The ordering property of a binary search tree is a great example here.

—