SHARE THE ARTICLE ON
A mosaic plot is a sort of stacked bar chart that displays data percentages in groups. A contingency table is depicted graphically in the plot.
Mosaic plots are used to demonstrate connections and to compare groupings visually.
A mosaic plot (also referred as a Marimekko diagram) is a graphical way for visually representing data from two or more qualitative variables. It is a multidimensional version of spineplots, which graphically represent the same data for only one variable. It provides a summary of the data and allows for the identification of correlations between distinct variables. For example, independence is demonstrated when all of the boxes in the same category have the same areas. Hartigan and Kleiner proposed mosaic plots in 1981, and Friendly elaborated on them in 1994. Because of its similarity to a Marimekko chart, mosaic plots are often known as Mekko charts. The area of the tiles, which is also called the bin size, is related to the number of observations inside that category, as with bar charts and spineplots.
A typical example of a mosaic plot incorporates data from Titanic passengers. This example’s data set has 2201 observations and three variables. The variables are as follows:
The observations were gathered into the following table:
The categorical variables are initially arranged in alphabetical order. The variables are then allocated to an axis. This data set’s sequence and categorization are shown in the table to the right. Another ordering will produce a different mosaic plot, indicating that the order of the variables is important in all multivariate plots.
We initially display “Gender” at the left edge of the first variable, which means we divide the data vertically into two blocks: the bottom (much smaller) one relates to females, while the top (much bigger) one refers to males. One can readily tell that around one-quarter of the passengers were female, with the remaining three-quarters being male.
The top edge is then given the second variable “Class.” As a result, the four vertical columns represent the four values of that variable (1st, 2nd, 3rd, and crew). Because column width shows the proportional fraction of the relevant value on the population, these columns are varied in thickness. The crew is clearly the most male-dominated category, whereas third-class passengers are the most female-dominated group. The number of female crew members is likewise said to be small.
Finally, the third variable (“Survived”) is applied, this time along the left side, with the outcome underlined by shade: dark grey rectangles represent those who did not survive the calamity, whereas light grey rectangles represent people who did. Women in the first class are quickly shown to have had the best chance of surviving. Females appear to have had a higher survival probability than men (marginalized over all classes). Similarly, a gender marginalization identifies first-class passengers as the most likely to survive. In all, approximately one-third of all persons survived (proportion of light grey areas).
Conducting exploratory research seems tricky but an effective guide can help.
Mosaic plots are useful when:
A mosaic plot generally makes it clear whether two variables are independent. Because all proportions are the same when they are independent, the boxes line up in a grid. This approach is demonstrated using the UCB Admissions dataset included with R. The following is a graph of student admissions by gender:
It appears to be a gender prejudice. However, there is a hidden variable: the department to which you applied. What happens when we stratify by department?:
Most departments appear to be gender neutral, with those that are skewed favoring women. First, there are extremely few female candidates in departments A and B. (the columns are narrow). It is also very simple to get into such departments—the number of applicants who are denied is smaller than in other departments, particularly F. One possibility is that more men get in because they apply to the hungry, maybe fastest-growing, departments.
Mosaic plots provide the data exactly as it is, with no attempt to generalize to the entire population. We require statistical significance metrics to make judgments about the population. We may define Pearson residuals, which are inspired by the chi-square test, to quantify each cell’s departure from independence. Because the units are in standard deviations, a residual more than 2 or less than -2 signifies a substantial deviation at the 95 percent level.
Here is a mosaic plot of hair color versus eye color in a group statistics student with residual shading.
The residuals can be viewed as follows: If we are certain that a cell is taller than the other cells in the same row, it is colored blue. If we are certain that a cell is shorter than the other cells in the same row, it gets colored red. If a cell is plainly short but does not turn red, there is insufficient data to establish that the cell would remain short if we obtained another sample. A blue cell is frequently followed by a red cell in the same row, although this is not always the case—-see, for example, the bottom row of the figure (green eyes). It’s worth noting that the shading says nothing about the relative heights of the boxes in the same column.
Shading is unnecessary in a table with a lot of data because all differences are substantial and can be observed from the box heights. When boxes aren’t lined up, such as in the “hazel eyes” row, it might be difficult to compare heights. In addition, coloration draws your attention to the locations of the essential relationships.
It provides a summary of the data and allows for the identification of correlations between distinct variables. For example, independence is demonstrated when all of the boxes in the same category have the same areas.
For example, in the graph below, because the two highlighted rectangles are not aligned at a same baseline, comparing their heights is more difficult than if they were aligned along a single baseline.