Correlation of data: Scatter plots
Sometimes two different characteristics of a population exhibit behavior that seems to indicate that the measured values of one characteristic for an individual can be used to predict the measured values of the other characteristic for that same individual. This may be due to one characteristic affecting the other, a third factor affecting both characteristics, or it may be a coincidence altogether.
There are several statistical measures that can be utilized to illustrate and provide a better understanding for the relationship between two characteristics, or variables, of a population. We hope to present a glimpse of some of these measures, and show how Mathematica can be incorporated for the purpose of applying them. This statistical study of the strength of the relationship between two variables is known as correlation analysis.
If it is suspected that one of two characteristics of a population infuences the other, then this characteristic is referred to as an independent variable, while the other is referred to as a dependent variable. If it is unclear whether one characteristic might depend on the other, then such a designation may be made arbitrarily, without the assumption that a dependence between characteristics occurs, since we are trying to determine if a correlation exists without assuming that one exists beforehand.
Now, an investigation of the correlation between two characteristics of a population requires two pieces of information per individual of a sample of the populationmeasurements of the same two characteristics for each individual. This set of pairs of information can be treated as a set of points in a plane, leading us to a simple method of visually representing this type of data. A scatter plot is a plot of the points (x_{i}, y_{i}) of a set of data collected by the measurements, x_{i} and y_{i}, of two characteristics of the individuals of a sample of a population, the values x_{i} representing the independent variable and the values y_{i} representing the dependent variable. We can produce a scatter plot of a set of points contained in a list data
data={{x_{1},y_{1}},{x_{2},y_{2}},...,{x_{n},y_{n}}}
using the command ListPlot. This is done as follows:
ListPlot[data]
This yields a plot of the points contained in data on a rectangular section of the Cartesian plane. There are a number of optional modifications that can be made to a ListPlot command. They are incorporated into the command in the following manner:
ListPlot[data,option_{1}>value_{1},option_{2}>value_{2},...]
Here we understand that option_{1} is set to the value value_{1}, option_{2} is set to value_{2}, and so on. We can obtain a list of the options that correspond to the ListPlot command with an Options command
Options[ListPlot]
This will list the set of options along with their default values.

Example 1: Suppose that the height, in inches, and weight, in pounds, of a group of 10 males, aged 30 through 40 years, is listed in the following table:

height 
weight 
62.1 
157 
58.3 
161 
73.2 
198 
65.9 
192 
69.4 
180 
75.4 
248 
71.2 
203 
68.9 
182 
67.3 
195 
64.8 
168 
where the height of each individual is listed in the first column, and the corresponding weight of each individual is listed in the second column. Let us form a scatter plot of this data.
To accomplish this, we shall merely input the data as a list of paired values,
In[1]:= values={{62.1,157},{58.3,161},{73.2,198},
{65.9,192},{69.4,180},{75.4,248},{71.2,203},
{68.9,182},{67.3,195},{64.8,168}};
and then we will plot the set of resulting points using the ListPlot command. Since it seems more likely that weight depends on height than it does that height depends on weight, we have designated height as the independent variable, and weight as the dependent variable. To better center the graph of points, we modify the plot by specifying the x and ydimensions of the graphics output with the option PlotRange.
In[2]:= ListPlot[values,PlotRange>{{55,75},{150,200}}]
Out[2]= Graphics
Thus we obtain a visual sense of how the height and weight of these individuals correlate to each other.

Example 2: Suppose that the following set of paired values

1.32 
2 
2.37 
5 
0.81 
4 
0.63 
3 
1.03 
8 
3.21 
4 
consists of a list of data collected from six different households living in the same city. Each line of data represents a different household, with the first column containing the fraction of the national average of the annual income that the household earns in a year, and the second column containing the number of individuals living in that household. Let us form a scatter plot of this information.
We can input our data into the list info with the command
In[3]:= info={{1.32,2},{2.37,5},{0.81,4},{0.63,3},
{1.03,8},{3.21,4}};
To further proceed with this problem, we must designate which characteristic of the set of households it is that we wish to use as an independent variable. It may be that some will argue in favor of either characteristic, but we shall designate the second characteristicthat of the size of the householdas being our independent variable. The first characteristic, the information on annual income, will then be the dependent variable. This implies that we need the information in the second column as the xcoordinate of each point, and the information in the first column as the ycoordinate of each point. However, we have input the data into a list as points with the coordinates switched. Thus we will need to reverse the coordinates of each point in the list before we form the scatter plot. Note that the following commands enable us to switch the coordinates of the points in our list:
In[4]:= info=Transpose[info];
info=Reverse[info];
info=Transpose[info];
The scatter plot is then obtained with the command
In[5]:= ListPlot[info,PlotRange>{{0,10},{0,4}}]
Out[5]= Graphics
By incorporating the function BarChart contained within the Graphics`Graphics` package, we can form a bar chart that compares two lists of numbers, each having the same number of elements. Suppose we have two sets of data, data_{1} and data_{2}, with which we wish to form some such comparison. Once the Graphics`Graphics` package is loaded
<< Graphics`Graphics`
the command
BarChart[data_{1},data_{2}]
will provide a bar chart for each data set, aligning the charts so that the graphs corresponding to the i^{th} pieces of data from each set will be side by side.

Example 3: Let us form a bar chart for the data from the previous problem.
Let us assume that the data has already been input to the list info, and is paired with the number of individuals in each household first. We need to express the data from each set of coordinates of the list info as a separate list. We can express this list as a set of two such lists by transposing it.
In[6]:= info=Transpose[info];
We then form a bar chart from the two sublists of this data, the lists info[[1]] and info[[2]], by first loading the appropriate package
In[7]:= << Graphics`Graphics`
and then by utilizing the BarChart command as follows:
In[8]:= BarChart[info[[1]],info[[2]]]
Out[8]= Graphics
Scatter plots can also be made from ordered triples of data. This would require utilizing the ScatterPlot3D from the Graphics`Graphics3D` package. Threedimensional scatter plots are a little more difficult to properly visualize, but we can obtain a twodimensional representation of a list data of ordered triples
data={{x_{1},y_{1},z_{1}},{x_{2},y_{2},z_{2}},...,{x_{n},y_{n},z_{n}}}
in threedimensional space with a command of the form
ScatterPlot3D[data]
once the Graphics`Graphics3D` package has been loaded. Various options can be added to the input of this command in order to modify the output.

Example 4: Suppose that a collection of three measurements are made from each member of a group of individuals, and are given in the following table:

35.1 
62.1 
43.1 

37.8 
75.4 
50.5 
35.6 
58.3 
44.2 

33.0 
71.2 
48.6 
30.2 
73.2 
45.6 

39.3 
68.9 
44.3 
34.6 
65.9 
45.8 

32.7 
67.3 
46.1 
31.9 
69.4 
44.0 

38.8 
64.8 
42.7 
Let us form a threedimensional scatter plot of this information
To do so we must input the data into a list:
In[9]:= points={{35.1,62.1,43.1},{35.6,58.3,44.2},
{30.2,73.2,45.6},{34.6,65.9,45.8},
{31.9,69.4,44.0},{37.8,75.4,50.5},
{33.0,71.2,48.6},{39.3,68.9,44.3},
{32.7,67.3,46.1},{38.8,64.8,42.7}};
We must next load the appropriate package:
In[10]:= << Graphics`Graphics3D`
Finally, we obtain our scatter plot with the command
In[11]:= ScatterPlot3D[points]
Out[11]= Graphics3D
Unfortunately, with this command we do not obtain a good sense of the location of each point in space. The following command allows us to put small cuboids in place of each point, and thus enables us to gain a better perspective on how the points lie in relation to each other:
In[12]:= Show[Graphics3D[Table[Cuboid[points[[i]]],
{i,1,10}]]]
Out[12]= Graphics3D
Note that a threedimensional scatter plot does not graph well if there is a large variety in magnitude of values for differt variables.
Exercises
Last modified: Tue May 28 2002