A box plot is a very effective way of graphically representing groups of numerical data through their quartiles. Box plots often have lines, commonly known as whiskers, that extend vertically from the boxes; they show variability outside the upper and lower quartiles. Box plots are strictly descriptive: they show variation in a sample without making any assumptions of the underlying statistical distribution. The spacings between the different parts of the box show the spread and skewness of the data, and the outliers. Box-and Whisker plots are a good visual aid for assessing linear statistical measurements such as midhinge, range, mid-range, interquartile range, and trimean; I briefly cover these concepts in the section on Basic Descriptive Statistics.
I’ll demonstrate how to create a box-and-whisker plot overlay on a basic scatter graph. For the sake of simplicity, I use a dataset familiar to many of you: the Iris data included with Arcadia Data’s samples.
- Basic Descriptive Statistics
- Building the basic graph
- Preparing the graph for the box plot overlay
- Creating a custom style for the box plot
- Applying the box plot overlay to the visual
- Customizing box-and-whisker plots
Basic Descriptive Statistics
Consider how descriptive statistics, represented by the box-and-whisker plots, quickly generate an understanding of data dispersion.
- The quartiles (Q1, Q2, and Q3) divide our rank-ordered data set into four equal parts. The dispersion of data is indicated by the relative height of the top and bottom of the box.
- The interquartile range (IQR) is represented by the height of the box. IQR = Q3-Q1.
- Median is equivalent to Q2.
- The midhinge is calculated as the midpoint of the IQR, or Midhinge = (Q3-Q1) / 2.
- Range is the difference between the maximum and minimum measurement.
- Midrange is the middle of the range. Midrange = (Max – Min) / 2.
- Trimean (TM) is the weighted average of the distribution’s median and its two quartiles, where TM = (Q1 + 2Q2 + Q3) / 4. This is equivalent to the average of the median (Q2) and the midhinge.
Building the basic graph
Let’s start by building the basic scatter graph that represents distribution of dimensions across the three species of Iris used in the dataset: setosa, versicolor, and virginica.
- On the top menu, click Data.
- In the samples connection, find the dataset Iris, and click New Visual.
- In the new visual, select the scatter visual type.
- From the Fields menu, under Dimensions, select species and place it on the X shelf. Similarly, select one of the Measures, such as sepal_length, from the Fields menu; place it on the Y shelf.
- Click Refresh Visual to see the basic graph, which shows the average values of the sepal_length for the three species of Iris in our dataset.
- Note that Arcadia Data aggregates Y shelf values automatically; this is why you see only three marks on the graph, one for each species.
Because we wish to examine the distribution of individual values, I am removing the
avg()aggregation function from the
sepal_lengthfield on the Y shelf. Click the down arrow on
avg(sepal_length), selecting Aggregates from the menu, and then selecting Remove Aggregate.
- Click Refresh Visual to see the graph with individual measurement values for each of the three iris species.
Note that the distribution of points is not regular. This means that the next steps we make will generate interesting results.
Preparing the graph for the box plot overlay
Next, I will adjust this size and color intensity of the graph marks, to make it easier to see the box plot overlay later. These adjustments are handled by the visual’s Settings interface.
- Click Settings.
- In the Settings interface, click the Color tab, and adjust to color opacity to
- Click the Marks tab, and decrease Mark size range to
1-5. Click Apply.
- After you click Refresh Visual, you can see that the data points are much smaller, and very lightly colored. The darker points represent multiple data points with the same value.
- Let’s save the visual for now. I called mine Distribution of Measurements for Iris Data.
Creating a custom style for the box plot
Before enabling the box-and-whisker plot effect, I must first implement as a Custom Style. I am using the code you can download with the BoxWhiskerPlot file.
- Copy the code from the BoxWhiskerPlot file.
- Under the Gear menu, select Custom Styles.
- In the Manage Custom Styles interface, click New Custom Styles to start the new style.
Applying the box plot overlay to the visual
I can now overlay this basic graph with the box-and-whisker plot effect that I saved earlier as the BoxWhiskerPlot custom style. All that’s left now is to apply the style to the visual I created earlier.
- Go back to your saved visual, and open it in edit mode.
- In the visual Distribution of Measurements for Iris Data, click on the Settings menu.
- In the Settings menu, click the Custom Styling tab, and then click the Add Style button.
- When the Pick Custom CSS interface appears, select the style we saved earlier, BoxWhiskerPlot, and click Add.
- Note that the Settings Interface shows that BoxWhiskerPlot is included. Click Apply.
- Once again, click Refresh Visual, and observe that the box whisker plot appears on top of the data we saw earlier.
- Save the visual.
Customizing box-and-whisker plots
I now have a box and whisker plot overlay on top of my basic data.
The default settings for whiskers are at 1.5 IQR, but I want to show you how to further configure the box plot overlay to change the location of whisker end-points, their visibility, and the color and shape of the box plot itself. We can even choose not to display the data points.
- Open the visual Settings interface, and choose the Custom Styling tab.
- In the list of Styles, hover over the gear symbol for BoxWhiskerPlot, and note the possible configuration settings: 3 options for whisker display, outline color, box fill colors for Q2-Q1 and for Q3-Q2, line width, box width, and box opacity.
- To change the settings, click the gear.
- The window for Custom JS Setting allows you to change all the display options. As you can see, the whisker display default is at 1.5x IQR.
- Let’s change the display options for the box plot.
- To set it at Min/Max whiskers, deselect the option Display 1.5x IQR whiskers, and select Display Min/Max whiskers.
- In Box Fill Color Top, change the setting to #ff9.
- In Box Fill Color Bottom, change the setting to #cf9.
- In Box Width, change the setting to 15.
- In Box Opacity, change the setting to 80.
- Click Apply in the window Custom JS Settings for BoxWhiskerPlot.
- Click Apply again, in the Settings window.
- Click Refresh Visual.
- Notice that the whiskers now coincide with the minimum and maximum data points of each data series. The color changes are there, too.
- To see a simple box plot, without the whiskers, I deselected all the options for whisker display.
- Finally, if I choose not to show the data points on the graph, and only look at the box plot, I can open the visual Settings interface, choose the Color tab, and change the Color Opacity setting to 0.