Correlation of data: Scatter plots
Sometimes two different characteristics of a population exhibit behavior that seems to indicate that the measured values of one characteristic for an individual can be used to predict the measured values of the other characteristic for that same individual. This may be due to one characteristic affecting the other, a third factor affecting both characteristics, or it may be a coincidence altogether.
There are several statistical measures that can be utilized to illustrate and provide a better understanding for the relationship between two characteristics, or variables, of a population. We hope to present a glimpse of some of these measures, and show how Mathematica can be incorporated for the purpose of applying them. This statistical study of the strength of the relationship between two variables is known as correlation analysis.
If it is suspected that one of two characteristics of a population infuences the other, then this characteristic is referred to as an independent variable, while the other is referred to as a dependent variable. If it is unclear whether one characteristic might depend on the other, then such a designation may be made arbitrarily, without the assumption that a dependence between characteristics occurs, since we are trying to determine if a correlation exists without assuming that one exists beforehand.
Now, an investigation of the correlation between two characteristics of a population requires two pieces of information per individual of a sample of the population--measurements of the same two characteristics for each individual. This set of pairs of information can be treated as a set of points in a plane, leading us to a simple method of visually representing this type of data. A scatter plot is a plot of the points (xi, yi) of a set of data collected by the measurements, xi and yi, of two characteristics of the individuals of a sample of a population, the values xi representing the independent variable and the values yi representing the dependent variable. We can produce a scatter plot of a set of points contained in a list data
using the command ListPlot. This is done as follows:
This yields a plot of the points contained in data on a rectangular section of the Cartesian plane. There are a number of optional modifications that can be made to a ListPlot command. They are incorporated into the command in the following manner:
Here we understand that option1 is set to the value value1, option2 is set to value2, and so on. We can obtain a list of the options that correspond to the ListPlot command with an Options command
This will list the set of options along with their default values.
Example 1: Suppose that the height, in inches, and weight, in pounds, of a group of 10 males, aged 30 through 40 years, is listed in the following table:
where the height of each individual is listed in the first column, and the corresponding weight of each individual is listed in the second column. Let us form a scatter plot of this data.
To accomplish this, we shall merely input the data as a list of paired values,
and then we will plot the set of resulting points using the ListPlot command. Since it seems more likely that weight depends on height than it does that height depends on weight, we have designated height as the independent variable, and weight as the dependent variable. To better center the graph of points, we modify the plot by specifying the x- and y-dimensions of the graphics output with the option PlotRange.
Thus we obtain a visual sense of how the height and weight of these individuals correlate to each other.
Example 2: Suppose that the following set of paired values
consists of a list of data collected from six different households living in the same city. Each line of data represents a different household, with the first column containing the fraction of the national average of the annual income that the household earns in a year, and the second column containing the number of individuals living in that household. Let us form a scatter plot of this information.
We can input our data into the list info with the command
To further proceed with this problem, we must designate which characteristic of the set of households it is that we wish to use as an independent variable. It may be that some will argue in favor of either characteristic, but we shall designate the second characteristic--that of the size of the household--as being our independent variable. The first characteristic, the information on annual income, will then be the dependent variable. This implies that we need the information in the second column as the x-coordinate of each point, and the information in the first column as the y-coordinate of each point. However, we have input the data into a list as points with the coordinates switched. Thus we will need to reverse the coordinates of each point in the list before we form the scatter plot. Note that the following commands enable us to switch the coordinates of the points in our list:
The scatter plot is then obtained with the command
By incorporating the function BarChart contained within the Graphics`Graphics` package, we can form a bar chart that compares two lists of numbers, each having the same number of elements. Suppose we have two sets of data, data1 and data2, with which we wish to form some such comparison. Once the Graphics`Graphics` package is loaded
will provide a bar chart for each data set, aligning the charts so that the graphs corresponding to the ith pieces of data from each set will be side by side.
Example 3: Let us form a bar chart for the data from the previous problem.
Let us assume that the data has already been input to the list info, and is paired with the number of individuals in each household first. We need to express the data from each set of coordinates of the list info as a separate list. We can express this list as a set of two such lists by transposing it.
We then form a bar chart from the two sublists of this data, the lists info[] and info[], by first loading the appropriate package
In:= << Graphics`Graphics`
and then by utilizing the BarChart command as follows:
Scatter plots can also be made from ordered triples of data. This would require utilizing the ScatterPlot3D from the Graphics`Graphics3D` package. Three-dimensional scatter plots are a little more difficult to properly visualize, but we can obtain a two-dimensional representation of a list data of ordered triples
in three-dimensional space with a command of the form
once the Graphics`Graphics3D` package has been loaded. Various options can be added to the input of this command in order to modify the output.
Example 4: Suppose that a collection of three measurements are made from each member of a group of individuals, and are given in the following table:
Let us form a three-dimensional scatter plot of this information
To do so we must input the data into a list:
We must next load the appropriate package:
In:= << Graphics`Graphics3D`
Finally, we obtain our scatter plot with the command
Unfortunately, with this command we do not obtain a good sense of the location of each point in space. The following command allows us to put small cuboids in place of each point, and thus enables us to gain a better perspective on how the points lie in relation to each other:
Note that a three-dimensional scatter plot does not graph well if there is a large variety in magnitude of values for differt variables.
Last modified: Tue May 28 2002