Frequently Asked Questions About Level of Measurement

Who is S. S. Stevens?

S. S. (Stanley Smith) Stevens 1906-1973 is best known in the social sciences for his work on levels of measurement. Stevens also was an authority on the physics of sensory perception, especially hearing. He was director of the Psycho-Acoustics Laboratory and the Psychophysics Laboratory at Harvard.

Academic appointments:
1933-1973 Harvard University
Major publications:
Stevens, S. S., & Galanter, E. H. (1957) Ratio scales and category scales for a dozen perceptual continua. Journal of Experimental Psychology, 54, 377-411.
Stevens, S. S. (1957). On the psychophysical law. Psychological Review, 64, 153-181.

Total publications: 101 (89 by 1960 + 12 fro '67 to '75)
Other achievements:
1936 Founding member of the Psychological Round Table
1940 Election to the Society for Experimental Psychologists
1943 Warren Medal
1946 Election to the National Academy of Sciences
1948 Presidential Certificate of Merit
1959 Election to the American Philosophical Society
1959 Founding member of the Pychonomic Society
1960 APA award for distinguished scientific contribution.
1963-1971 Editorial board for Psychological Review.
1966-1971 Editorial board for Perception & Psychophyics

Stevens' work on measurement, so often quoted, is seldom read. His statement, “measurement is the assignment of numerals to events or objects according to rule” (Stevens, 1959, p.25), is used to support a variety of mathematical abuse. Stevens, however, tried to be precise about what kind of arithmetic was valid with what kind of numbers (see Table).

Stevens' Classification of Scales (after Stevens, 1959, p.25, 27)
Scale Operation Examples Location Dispersion Association Test
Nominal Equality Numbering of players Mode     Chi-square
Ordinal Greater or less Hardness of minerals
Street numbers
Raw scores
Median Percentiles Rank-order correlation Sign test
Run test
Interval Distance Temperature: Celsius
Position, Time
Standard scores(?)
Arithmetic mean Standard deviation Product-moment correlation t-test
F-test
Ratio Ratio Numerosity (Counts)
Length, Density
Position, Time
Temperature: Kelvin
Loudness: sones
Brightness: brils
Geometric mean
Harmonic Mean
Percent variation    

Nominal Scales

“Whether a process of classification underlying the nominal scale constitutes measurement is one of those semantic issues that depend on taste... I prefer to call it a form of measurement” (p. 25). Unfortunately this preference has been confused with the every-day restriction of the term "measurement" to numerically linear operations. Nevertheless, Stevens did perceive the difference between mere numerical operations and meaningful statistics: “When we compute the mean of the numerals assigned to a team of football players, are we trying to say something about the players, or only about the numerals? The only "meaningful" statistic here would be N, the number of players assigned a numeral.” (p. 29)

Interval Scales

Have order and equal intervals. Counts are interval, such as counts of income, years of education, or number of Democratic votes. For most statistical procedures the distinction between interval and ratio does not matter and it is common to use the term "interval" to refer to ratio data as well. Occasionally, however, the distinction between interval and ratio becomes important. With interval data, one can perform logical operations, add, and subtract, but one cannot multiply or divide. For instance, if a liquid is 40 degrees and we add 10 degrees, it will be 50 degrees. However, a liquid at 40 degrees does not have twice the temperature of a liquid at 20 degrees because 0 degrees does not represent "no temperature" -- to multiply or divide in this way we would have to be using the Kelvin temperature scale, with a true zero point (0 degrees Kelvin = -273.15 degrees Celsius). Fortunately, in social science the issue of "true zero" rarely arises, but researchers should be aware of the statistical issues involved.

Ordinal Scales

“When operations are available to determine only rank order, it is of questionable propriety to compute means and standard deviations... If we want to interpret the result of averaging a set of data as an arithmetic mean in the usual sense, we need to begin with more than an ordinal assignment of numerals.” (p. 29) Stevens understanding of the ordinal status of test scores is clear. He categorizes raw scores as ordinal, and, since the "(?)" in the Table is his, he is not convinced that even standardized scores are interval.

Ratio Scales

The distinctive feature of a ratio scale is that it has an origin defined by a dominating substantive theory (p. 25). Thus, Time, measured from the "Big Bang", is on a ratio scale, and so is Length when measured from the location of that same event. Length, in yards or meters, and Time, in days or years, are on interval scales. Since Stevens regards his classification as a hierarchy, he lists Length with the super-ordinate ratio scale. His distinction between Length as ratio and Time as interval dissolves.

Ratio Scales II--Counts

“The numerosity of collections of objects [counts] ... belongs to the class I have called ratio scales” (p. 20). Accordingly, in situations where it is important to maintain the notion that a count of 0 means "none at all", rather than "none extra", and a count 1 of means "only one object", rather than "one more to go with those we already have", then ratio scale arithmetic applies.

Campbell, Stevens, and Rasch

Stevens formulated his classification and discussion as a rejoinder to a British committee who, in 1932, investigated the possibility of "quantitative estimates of sensory events.” Physicist Norman Campbell's verdict was: “Why do not psychologists accept the natural and obvious conclusion that subjective measurements of loudness in numerical terms (like those of length) are mutually inconsistent and cannot be the basis of measurement?” Stevens comment on this is: “Why, he might have asked, does the psychologist not give up and go quietly off to limbo?

Stevens sought to rescue psychological measurement by changing the problem from that of inventing operations (the physical view) to that of classifying scales (a mathematical view). As Warren Torgerson (1958 p.18) noted, Stevens' and other similar approaches are concerned with “different methods for the systematic classification of various limited sets of [concrete] objects, rather than methods of measurement of [an abstract] property.” Stevens' solution does not produce linear measures, but merely classifies the numbers already in use. He concludes that brightness and loudness are on ratio scales (“linearized” by taking logarithms, p. 40). He also discovers that methods such as "just noticeable differences", rating scale categories and paired comparisons produce only ordinal scales (p. 46). “Stevens' enormous contribution was his successful argument that there are different kinds of scales, kinds defined in terms of their degree of resemblance to the real number line. The weakness of his writing was its apparent implication that the nature of a scale is somehow defined by the investigator (Cliff 1992 p. 186).”

Rasch moves beyond both Campbell's insistence on physical operations and Stevens' substitution of classification for measurement construction. Rasch measurement deliberately engages in producing from intangible qualitative observations the most meaningful (and common) form of measurement, namely that on an interval scale readily analyzed by linear statistics. Since it is almost impossible to think quantitatively without linearity, Rasch refuses to be appeased with less useful numerical assignments. For much more on Rasch models visit http://www.rasch.org.

Considerations Beyond Stevens’ Four Scales

Is there a distinction between how variable is conceptualized and how it is measured and recorded?
 Ex: Age

Is it fundamentally not a single dimension?
Ex: Marital Status

In an ordinal or nominal scale, how firm or arbitrary are the number of categories and their boundaries?
Ex: Psychological Scales and Indexes

Is there a meaningful upper anchor point for the scale?
Is the upper anchor point a maximum, or can it be exceeded?

Ex: Psychological Scales and Indexes

Are negative values meaningful? Ex: Income

In an ordinal scale, can, or must, an individual case move through the categories in order?

Ex:  Eriksson’s Stages(?)

If a variable can change, can it go in either direction?

Ex: Marital Status

Some Frequently Asked Questions about Level of Measurement

Is it okay to use ordinal variables in procedures like regression or path analysis that assume interval data?

In a recent review of the literature on this topic, Jaccard and Wan (1996) conclude that, for many statistical tests, rather severe departures (from “intervalness”) do not seem to affect Type I and Type II errors dramatically. Standard citations to literature showing the robustness of correlation and other parametric coefficients with respect to ordinal distortion are Labovitz (1967, 1970) and Kim (1975). Others are Binder (1984) and Zumbo and Zimmerman (1993). Use of ordinal variables such as 5-point Likert scales with interval techniques is the norm in contemporary social science. Use of scales with fewer values not only violates normality assumptions but also runs a heightened risk of confounding difficulty factors as discussed in the section below on use of dichotomies.

Researchers should be aware that there is an opposing viewpoint. Thomas Wilson (1971), for instance, concludes, "the ordinal level of measurement prohibits all but the weakest inferences concerning the fit between data and a theoretical model formulated in terms of interval variables." The researcher should attempt to discern if the values of the ordinal variable seem to display obvious marked departures from equal intervalness and qualify his or her inferences accordingly.

Is it okay to use dichotomous variables in procedures like regression or path analysis which assume interval data?

If one assumes the dichotomy collapses an underlying interval variable, there will be more distortion with dichotomization than with an ordinal simplification. Dichotomization of continuous variables attenuates the resulting correlations. Moreover, the cutting points used in the dichotomization will affect the degree of attenuation. If the underlying correlations are high (over .7), the cutting points will have a non-trivial effect. Note also that categorical variables with similar splits will necessarily tend to correlate with each other, regardless of their content (see Gorsuch, 1983). This is particularly apt to occur when dichotomies are used. The correlation will reflect similarity of "difficulty" for items in a testing context; hence such correlated variables are called “difficulty factors.” The researcher should examine the factor loadings of categorical variables with care to assess whether common loading reflects a difficulty factor or substantive correlation. Nonetheless, it is common to use dichotomies in interval-level techniques like correlation and regression.

Can Likert scales be considered interval?

Likert scales (ex., strongly agree, agree, etc.) are very commonly used with interval procedures, provided the scale item has at least 5 and preferably 7 categories. Most researchers would not use a 3-point Likert scale with a technique requiring interval data. The fewer the number of points, the more likely the departure from the assumption of normal distribution, required for many tests. Here is a typical footnote inserted in research using interval techniques with Likert scales:

"In regard to our use of (insert name of procedure), which assumes interval data, with ordinal Likert scale items, in a recent review of the literature on this topic, Jaccard and Wan (1996: 4) summarize, "for many statistical tests, rather severe departures (from intervalness) do not seem to affect Type I and Type II errors dramatically (Jaccard & Wan, 1996).

Likert scales are ordinal but their use in statistical procedures assuming interval level data is commonplace for the reason given above. Note, though, that under certain circumstances, Likert and other rank data can be interval. This would happen, for instance, in a survey of children’s allowances; if all children in the sample got allowances of $5, $10, or $15 exactly, and these were measured as "low," "medium," and "high." That is, “intervalness” is an attribute of the data, not of the labels. In most cases, of course, Likert and rank variables are ordinal but the extent to which they approach intervalness depends on the correspondence of the ordinal labels to the empirical data.

What is measurement?

Measurement of some attribute of a set of things is the process of assigning numbers or other symbols to the things in such a way that relationships of the numbers or symbols reflect relationships of the attribute being measured. A particular way of assigning numbers or symbols to measure something is called a scale of measurement. 

Suppose we have a collection of straight sticks of various sizes and we assign a number to each stick by measuring its length using a ruler. If the number assigned to one stick is greater than the number assigned to another stick, we can conclude that the first stick is longer than the second. Thus a relationship among the numbers (greater than corresponds to a relationship among the sticks (longer than). If we lay two sticks end-to-end in a straight line and measure their combined length, then the number we assign to the concatenated sticks will equal the sum of the numbers assigned to the individual sticks (within measurement error). Thus another relationship among the numbers (addition) corresponds to a relationship among the sticks (concatenation). 

Why should I care about measurement theory?

Measurement theory helps us to avoid making meaningless statements. A typical example of such a meaningless statement is the claim by the weatherman on the local TV station that it was twice as warm today as yesterday because it was 40 degrees Fahrenheit today but only 20 degrees yesterday. This statement is meaningless because one measurement (40) is twice the other measurement (20) only in certain arbitrary scales of measurement, such as Fahrenheit. The relationship 'twice-as' applies only to the numbers, not the attribute being measured (temperature 

When we measure something, the resulting numbers are usually, to some degree, arbitrary. We choose to use a 1 to 5 rating scale instead of a -2 to 2 scale. We choose to use Fahrenheit instead of Celsius. We choose to use miles per gallon instead of gallons per mile. The conclusions of a statistical analysis should not depend on these arbitrary decisions because we could have made the decisions differently. We want the statistical analysis to say something about reality, not simply about our whims regarding meters or feet. 

Suppose we have a rating scale where several judges rate the goodness of flavor of several foods on a 1 to 5 scale. If we want to draw conclusions about the measurements, i.e. the 1-to-5 ratings, then we need not be concerned about measurement theory. For example, if we want to test the hypothesis that the foods have equal mean ratings, we might do a two-way ANOVA on the ratings. 

But if we want to draw conclusions about flavor, then we must consider how flavor relates to the ratings, and that is where measurement theory comes in. Ideally, we would want the ratings to be linear functions of the flavors with the same slope for each judge; if so, the ANOVA can be used to make inferences about mean goodness-of-flavors, providing we can justify all the appropriate statistical assumptions. But if the judges have different slopes relating ratings to flavor, or if these functions are not linear, then this ANOVA will not allow us to make inferences about mean goodness-of-flavor. Note that this issue is not about statistical interaction; even if there is no evidence of interaction in the ratings the judges may have different functions relating ratings to flavor. 

We need to consider what information we have about the functions relating ratings to flavor for each judge. Perhaps the only thing we are sure of is that the ratings are monotone increasing functions of flavor. In this case, we would want to use a statistical analysis that is valid no matter what the particular monotone increasing functions are. One way to do this is to choose an analysis that yields invariant results no matter what monotone increasing functions the judges happen to use, such as a Friedman test. The study of such invariance is a major concern of measurement theory. 

However, no measurement theorist would claim that measurement theory provides a complete solution to such problems. In particular measurement theory generally does not take random measurement error into account, and if such errors are an important aspect of the measurement process, then additional methods, such as latent variable models, are called for. There is no clear boundary between measurement theory and statistical theory; for example, a Rasch model is both a measurement model and a statistical model. 

What are permissible transformations?

Permissible transformations are transformations of a scale of measurement that preserve the relevant relationships of the measurement process. Permissible is a technical term; use of this term does not imply that other transformations are prohibited for data analysis any more than use of the term normal for probability distributions implies that other distributions are pathological. If Stevens had used the term mandatory rather than permissible, a lot of confusion might have been avoided. 

In the example of measuring sticks, changing the unit of measurement (say, from centimeters to inches) multiplies the measurements by a constant factor. This multiplication does not alter the correspondence of the relationships 'greater than' and 'longer than', nor the correspondence of addition and concatenation. Hence, change of units is a permissible transformation with respect to these relationships.

What are levels of measurement?

There are different levels of measurement that involve different properties (relations and operations) of the numbers or symbols that constitute the measurements. Associated with each level of measurement is a set of permissible transformations. The most commonly discussed levels of measurement are as follows: 

In real life, a scale of measurement may not correspond precisely to any of these levels of measurement. For example, there can be a mixture of nominal and ordinal information in a single scale, such as in questionnaires that have several non-response categories. It is common to have scales that lie somewhere between the ordinal and interval levels in that the measurements can be assumed to be a smooth monotone function of the attribute. For many subjective rating scales (such as the 'strongly agree,' 'agree,' ... 'strongly disagree' variety) it cannot be shown that the intervals between successive ratings are exactly equal, but with reasonable care and diagnostics it may be safe to say that no interval represents a difference more than two or three times greater than another interval. 

Unfortunately, there are also many situations where the measurement process is too ill-defined for measurement theory to apply. In such cases, it may still be fruitful to consider what arbitrary choices were made in the course of measurement, what effect these choices may have had on the measurements, and whether some plausible class of permissible transformations can be determined.  

Is measurement level a fixed, immutable property of the data?

Measurement level depends on the correspondence between the measurements and the attribute. Given a set of data, one cannot say what the measurement level is without knowing what attribute is being measured. It is possible that a certain data set might be treated as measuring different attributes at different times for different purposes. 

Consider a rat in a Skinner box who pushes a lever to get food pellets. The number of pellets dispensed in the course of an experiment is obviously an absolute-level measurement of the number of pellets dispensed. If number of pellets is considered as a measure of some other attribute, the measurement level may differ. As a measure of amount of food dispensed, the number of pellets is at the ratio level under the assumption that the pellets are of equal size; if the pellets are not of equal size, a more elaborate measurement model is required, perhaps one involving random measurement error if the pellets are dispensed in random order. As a measure of duration during the experiment, the number of pellets is at an ordinal level. As a measure of response effort, the number of pellets might be approximately ratio level, but we would need to consider whether the rat's responses were executed in a consistent way, whether the rat may miss the lever, and so forth. As a measure of amount of reward, the number of pellets could only be justified by some very strong assumptions about the nature of rewards; the measurement level would depend on the precise nature of those assumptions. The main virtue of measurement theory is that it encourages people to consider such issues. 

Once a set of measurements have been made on a particular scale, it is possible to transform the measurements to yield a new set of measurements at a different level. It is always possible to transform from a stronger level to a weaker level. For example, a temperature measurement in degrees Kelvin is at the ratio level. If we convert the measurements to degrees Celsius, the level is interval. If we rank the measurements, the level becomes ordinal. In some cases it is possible to convert from a weaker scale to a stronger scale. For example, correspondence analysis can convert nominal or ordinal measurements to an interval scale under appropriate assumptions. 

Isn't an ordinal scale just an interval scale with error?

You can view an ordinal scale as an interval scale with error if you really want to, but the errors are not independent, additive, or identically distributed as required for many statistical methods. The errors would involve complicated dependencies to maintain monotonicity with the interval scale. In the example above with number of pellets as a measure of duration, the errors would be cumulative, not additive, and the error variance would increase over time. Hence for most statistical purposes, it useless to consider an ordinal scale as an interval scale with measurement error. 

What does measurement level have to do with discrete vs. continuous?

Measurement level has nothing to do with discrete vs. continuous variables. The distinction between discrete and continuous random variables is commonly used in statistical theory, but that distinction is rarely of importance in practice. A continuous random variable has a continuous cumulative distribution function. A discrete random variable has a stepwise-constant cumulative distribution function. A discrete random variable can take only a finite number of distinct values in any finite interval. There exist random variables that are neither continuous nor discrete; for example, if Z is a standard normal random variable and Y=max(0,Z), then Y is neither continuous nor discrete, but has characteristics of both. 

While measurements are always discrete due to finite precision, attributes can be conceptually either discrete or continuous regardless of measurement level. Temperature is usually regarded as a continuous attribute, so temperature measurement to the nearest degree Kelvin is a ratio-level measurement of a continuous attribute. However, quantum mechanics holds that the universe is fundamentally discrete, so temperature may actually be a discrete attribute. In ordinal scales for continuous attributes, ties are impossible (or have probability zero). In ordinal scales for discrete attributes, ties are possible. Nominal scales usually apply to discrete attributes. Nominal scales for continuous attributes can be modeled but are rarely used. 

Don't the theorems in a statistics textbook prove the validity of statistical methods without reference to measurement theory?

Mathematical statistics is concerned with the connection between inference and data. Measurement theory is concerned with the connection between data and reality. Both statistical theory and measurement theory are necessary to make inferences about reality. 

Does measurement level determine what statistics are valid?

Measurement theory cannot determine some single statistical method or model as appropriate for data at a specific level of measurement. But measurement theory does in fact show that some statistical methods are inappropriate for certain levels of measurement if we want to make meaningful inferences about the attribute being measured. 

If we want to make statistical inferences regarding an attribute based on a scale of measurement, the statistical method must yield invariant or equivariant results under the permissible transformations for that scale of measurement. If this invariance or equivariance does not hold, then the statistical inferences apply only to the measurements, not to the attribute that was measured. 

If we record the temperature in degrees Fahrenheit in Cary, NC, at various times, we can compute statistics such as the mean, standard deviation, and coefficient of variation. Since Fahrenheit is an interval scale, only statistics that are invariant or equivariant under change of origin or unit of measurement are meaningful. The mean is meaningful because it is equivariant under change of origin or unit. The standard deviation is meaningful because it is invariant under change of origin and equivariant under change of unit. But the coefficient of variation is meaningless because it lacks such invariance or equivariance. The mean and standard deviation can easily be converted back and forth from Fahrenheit to Celsius, but we cannot compute the coefficient of variation in degrees Celsius if we know only the coefficient of variation in degrees Fahrenheit. 

It is clear that if we are estimating a parameter that lacks invariance or equivariance under permissible transformations, we are estimating a chimera. The situation for hypothesis testing is more subtle. It is nonsense to test a null hypothesis the truth of which is not invariant under permissible transformations. For example, it would be meaningless to test the null hypothesis that the mean temperature in Cary in July is twice the mean temperature in December using a Fahrenheit or Celsius scale--we would need a ratio scale for that hypothesis to be meaningful. 

But it is possible for the null hypothesis to be meaningful even if the error rates for a given test are not invariant. Suppose that we had an ordinal scale of temperature, and the null hypothesis was that the distribution of temperatures in July is identical to the distribution in December. The truth of this hypothesis is invariant under strictly increasing monotone transformations and is therefore meaningful under an ordinal scale. But if we do a t-test of this hypothesis, the error rates will not be invariant under monotone transformations. Hard-core measurement theorists would therefore consider a t-test inappropriate. But given a null hypothesis, there are usually many different tests that can be performed with accurate or conservative significance levels but with different levels of power against different alternatives. The fact that different tests have different error rates does not make any of them correct or incorrect. Hence a soft-core measurement theorist might argue that invariance of error rates is not a prerequisite for a meaningful hypothesis test--only invariance of the null hypothesis is required. 

Nevertheless, the hard-core policy rules out certain tests that, while not incorrect in a strict sense, are indisputably poor tests in terms of having absurdly low power. Consider the null hypothesis that two random variables are independent of each other. This hypothesis is invariant under one-to-one transformations of either variable. Suppose we have two nominal variables, say, religion and preferred statistical software product, to which we assign arbitrary numbers. After verifying that at least one of the two variables is approximately normally distributed, we could test the null hypothesis using a Pearson product-moment correlation, and this would be a valid test. However, the power of this test would be so low as to be useless unless we were lucky enough to assign numbers to categories in such a way as to reveal the dependence as a linear relationship. Measurement theory would suggest using a test that is invariant under one-to-one transformations, such as a 
chi-squared test of independence in a contingency table. Another possibility would be to use a Pearson product-moment correlation after assigning numbers to categories in such a way as to maximize the correlation (although the usual sampling distribution of the correlation coefficient would not apply). In general, we can test for independence by maximizing some measure of dependence over all permissible transformations. 

However, it must be emphasized that there is no need to restrict the transformations in a statistical analysis to those that are permissible. That is not what permissible transformation means. The point is that statistical methods should be used that give invariant results under the class of permissible transformations, because those transformations do not alter the meaning of the measurements. Permissible was undoubtedly a poor choice of words, but Stevens was quite clear about what he meant. For example, Stevens (1959) states that: 

In general, the more unrestricted the permissible transformations, the more restricted the statistics. Thus, nearly all statistics are applicable to measurements made on ratio scales, but only a very limited group of statistics may be applied to measurements made on nominal scales. 

People who fail to distinguish between inferences regarding the attribute and inferences regarding the measurements have hotly disputed the connection between measurement level and statistical analysis in the psychometric and statistical literature. If one is interested only in making inferences about the measurements without regard to their meaning, then measurement level is, of course, irrelevant to choice of statistical method. The classic example is Lord (1953) who argued that statistical methods could be applied regardless of level of measurement, and concocted a silly example involving the jersey numbers assigned to football players, which Lord claimed were nominal-level measurements of the football players. Lord contrived a situation in which freshmen claimed they were getting lower numbers than the sophomores, so the purpose of the analysis was to make inferences about the numbers, not about some attribute measured by the numbers. It was therefore quite reasonable to treat the numbers as if they were on an absolute scale. However, this argument completely misses the point by eliminating the measured attribute from the scenario. 

The confusion between measurements and attributes was perpetuated by Velleman and Wilkinson (1993.Velleman and Wilkinson set up a series of straw men and knocked some of them down, while consistently misunderstanding the meaning of measurement and of permissible transformation. For example, they claimed that the number of cylinders in an automobile engine can be treated, depending on the circumstances, as nominal, ordinal, interval, or ratio, and hence the concept of measurement level 'simplifies the matter so far as to be false.' What is false is not measurement theory but Velleman and Wilkinson's backwards interpretation of it.  It is important to understand that the level of measurement of a variable does not mandate how that variable must appear in a statistical model. However, the measurement level does suggest reasonable ways to use a variable by default.

Consider the analysis of fuel efficiency in automobiles. If we are interested in the average distance that can be driven with a given amount of gas, we should analyze miles per gallon. If we are interested in the average amount of gas required to drive a given distance, we should analyze gallons per mile. Both miles per gallon and gallons per mile are measurements of fuel efficiency, but they may yield quite different results in a statistical analysis, and there may be no clear reason to use one rather than the other. So how can we make inferences regarding fuel efficiency that do not depend on the choice between these two scales of measurement? We can do that by recognizing that both miles per gallon and gallons per mile are measurements of the same attribute on a log-interval scale, and hence that the logarithm of either can be treated as a measurement on an interval scale. Thus, if we were doing a regression, it would be reasonable to begin the analysis using log(mpg). If evidence of nonlinearity were detected, then other transformations could still be considered. 

But measurement level has been shown empirically to be irrelevant to statistical results, hasn't it?

What has been shown is that various statistical methods are more or less robust to distortions that could arise from smooth monotone transformations; in other words, there are cases where it makes little difference whether we treat a measurement as ordinal or interval. But there can hardly be any doubt that it often makes a huge difference whether we treat a measurement as nominal or ordinal, and confusion between interval and ratio scales is a common source of nonsense. 

Suppose we are doing a two-sample t-test; we are sure that the assumptions of ordinal measurement are satisfied, but we are not sure whether an equal-interval assumption is justified. A smooth monotone transformation of the entire data set will generally have little effect on the p value of the t-test. A robust variant of a t-test will likely be affected even less (and, of course, a rank version of a t-test will be affected not at all). It should come as no surprise then that a decision between an ordinal or an interval level of measurement is of no great importance in such a situation, but anyone with lingering doubts on the matter may consult the simulations in Baker, Hardyck, and Petrinovich (1966) for a demonstration of the obvious. 

On the other hand, suppose we were comparing the variability instead of the location of the two samples. The F test for equality of variances is not robust, and smooth monotone transformations of the data could have a large effect on the p value. Even a more robust test could be highly sensitive to smooth monotone transformations if the samples differed in location. 

Measurement level is of greatest importance in situations where the meaning of the null hypothesis depends on measurement assumptions. Suppose the data are 1-to-5 ratings obtained from two groups of people, say males and females, regarding how often the subjects have sex: frequently, sometimes, rarely, etc. Suppose that these two groups interpret the term 'frequently' differently as applied to sex; perhaps males consider 'frequently' to mean twice a day, while females consider it to mean once a week. Females may report having sex more 'frequently' than men on the 1-to-5 scale, even if men in fact have sex more frequently as measured by sexual acts per unit of time. Hence measurement considerations are crucial to the interpretation of the results

What are some more examples of how measurement level relates to statistical methodology?

As mentioned earlier, it is meaningless to claim that it was twice as warm today as yesterday because it was 40 degrees Fahrenheit today but only 20 degrees yesterday. Fahrenheit is not a ratio scale, and there is no meaningful sense in which 40 degrees is twice as warm as 20 degrees. It would be just as meaningless to compute the geometric mean or coefficient of variation of a set of temperatures in degrees Fahrenheit, since these statistics are not invariant or equivariant under change of origin. There are many other statistics that can be meaningfully applied only to data at a sufficiently strong level of measurement.  

The general principle is that an appropriate statistical analysis must yield invariant or equivariant results for all permissible transformations. Obviously, we cannot actually conduct an infinite number of analyses of a real data set corresponding to an infinite class of transformations. However, it is often straightforward to verify or falsify the invariance mathematically. The application of this idea to summary statistics such as means and coefficients of variation is fairly widely understood. 

Confusion arises when we come to linear or nonlinear models and consider transformations of variables. Recall that Stevens did not say that transformations that are not 'permissible' are prohibited. What Stevens said was that we should consider all 'permissible' transformations and verify that our conclusions are invariant. 

Consider, for example, the problem of estimating the parameters of a nonlinear model by maximum likelihood (ML), and comparing various models by likelihood ratio (LR) tests. We would want the LR tests to be invariant under the permissible transformations of the variables. One way to do this is to parameterize the model so that any permissible transformation can be inverted by a corresponding change in the parameter estimates. In other words, we can make the ML and LR tests invariant by making the inverse-permissible transformations mandatory (this is the same set of transformations as the permissible transformations except for a degeneracy here and there which I won't worry about).  

In Conclusion?

Measurement theory shows that strong assumptions are required for certain statistics to provide meaningful information about reality. Measurement theory encourages people to think about the meaning of their data. It encourages critical assessment of the assumptions behind the analysis. It encourages responsible real-world data analysis. 

In real life, a scale of measurement may not correspond precisely to any of these levels of measurement. For example, there can be a mixture of nominal and ordinal information in a single scale, such as in questionnaires that have several non-response categories. It is common to have scales that lie somewhere between the ordinal and interval levels in that the measurements can be assumed to be a smooth monotone function of the attribute. For many subjective rating scales (such as the 'strongly agree,' 'agree,' ... 'strongly disagree' variety) it cannot be shown that the intervals between successive ratings are exactly equal, but with reasonable care and diagnostics it may be safe to say that no interval represents a difference more than two or three times greater than another interval.  At a minimum, even when the measurement process is poorly defined, it still seems useful to consider what (possibly) arbitrary choices were made in the course of measurement, what effect these choices might have had on the measurements, and whether some plausible class of permissible transformations can be determined. 



References

Baker, B. O., Hardyck, C., & Petrinovich, L. F. (1966). Weak measurement vs. strong statistics: An empirical critique of S.S. Stevens' proscriptions on statistics. Educational and Psychological Measurement, 26, 291-309. 

Binder, A. (1984). Restrictions on statistics imposed by method of measurement: Some reality, some myth. Journal of Criminal Justice, 12, 467-481.

Achen, C., & Shively W. P. (1995). Cross-level inference. Chicago: University of Chicago Press

Cliff N. (1992). Abstract measurement theory and the revolution that never happened. Psychological Science 3(3) p.186-190.

Gorsuch R. L. (1983). Factor Analysis. Hillsdale, NJ: Lawrence Erlbaum.

Jaccard, J., & Wan, C. K. (1996). LISREL approaches to interaction effects in multiple regression. Thousand Oaks, CA : Sage Publications.

Kim, J. O. (1975). Multivariate analysis of ordinal variables. American Journal of Sociology, 81, 261-298.

Labovitz, S. (1967). Some observations on measurement and statistics. Social Forces, 46, 151-160.

Labovitz, S. (1970). The assignment of numbers to rank order categories. American Sociological Review, 35, 515-524.

Lord. (1953). On the statistical treatment of football numbers. American Psychologist, 8, 750-751

Luce, R. D., &  Krumhansl, C. L. (1988). Measurement, scaling, and psychophysics. In Richard C. Atkinson, Richard J. Herrnstein, Gardner Lindzey, and R. Duncan Luce, eds. Stevens' handbook of experimental psychology, 2nd ed. New York: Wiley, chapter 1, pp. 3-74.

Stevens, S. S. (1946). On the theory of scales of measurement, Science, 103, 677-680.

Stevens, S. S. (1951). Mathematics, measurement and psychophysics. In S. S. Stevens, ed. Handbook of experimental psychology. New York: Wiley, Chp. 1, pp. 1-49.

Stevens S. S. (1959). Measurement, psychophysics and utility. In C. W. Churchman & P. Ratoosh (Eds.), Measurement: definitions and theories. New York: John Wiley, Chp. 2.

Torgerson W. S. (1958) Theory and methods of scaling. New York: John Wiley.

Velleman and Wilkinson. (1993). Nominal, ordinal, interval, and ratio typologies are misleading. The American Statistician, 47, 65-72.

Wilson, Thomas. (1971). Critique of ordinal variables. In H. M. Blalock, ed., Causal models in the social sciences. Chicago, Aldine, Ch. 24.

Zumbo, B. D., & Zimmerman, D. W. (1993). Is the selection of statistical methods governed by level of measurement? Canadian Psychology, 34, 390-399. Defends robustness of parametric techniques even when using ordinal data.

Wilson, T. (1971). Critique of ordinal variables. In H. M. Blalock, ed., Causal models in the social sciences. Chicago, Aldine, Ch. 24.

Zumbo, B. D., & Zimmerman D. W. (1993). Is the selection of statistical methods governed by level of measurement? Canadian Psychology, 34, 390-399.