Homework #6: Stream Graphs
Everyone must work with a partner for Homework 6.
It must be a person you have NOT already worked with for an earlier homework in this class.
Preferably, it is also someone you have NOT worked with on a in-class worksheet.
The deadline for forming your team is the end of lecture on
Tuesday February 13th!
(Yes, over a week before the homework deadline!) You must form
your team on Submitty before by this deadline. If you don't yet
have a partner, please add your name to the "Users Seeking
Team/Partner" list for the Homework 6 Gradeable. If necessary (if
we have an odd number of students in the course), the instructor will a approve a
single team to have 3 members.
-
First, decide on an interesting, personal/individual,
multi-category, information-dense, time-based dataset for which
the two members of your team will likely have some subtle or
significant differences. For example:
Minutes spent in a typical week on sleep/class/homework/eat/sports/tv/etc.
Lines of code written during your time at RPI in different programming languages (python,c++,java,etc)
Money spent over a typical month/year on tuition/apartment/food/travel/clothing/movies
or something else?
Note on dataset choice: If this assignment fits well with a
dataset from an earlier assignment you are welcome to re-use that
dataset -- but you are expected to spend a non-trivial amount of time
either expanding or improving the dataset.
-
Next you will prepare the datasets. Your team should agree on
the the time range, sampling frequency, number and of definition
of the specific categories, etc. The data should be relatively
high-resolution on the time-axis relative to the interval
chosen. For example, if you are visualizing money spent over
the last year, you should have daily data (not just monthly
totals). You should also have an interesting number of
categories with sufficient data (e.g., probably at least 4
categories with a moderate amount of samples). Then, each team
member will collect or prepare their own personal version of the
data.
If the data already exists in an easily obtainable and sufficient
detail and quantity, write scripts to scrape, parse, and organize
that data (e.g., if it is all in GitHub). If it is not possible or
practical to gather this data, then write a script to generate
synthetic data that has the realistic patterns you would expect to
find (e.g., if you had been wearing a GPS location & sleep
tracking watch 24/7 for the last 10 years).
If you have real-world data, but you have missing, incomplete, or
uncertain information, fill in those details with appropriate
synthetically-generated data. Be sure to add some random noise to
synthetic data so it's "interesting".
-
Now let's visualize these datasets! Generate 2 different versions of each dataset:
-
First, plot the two datasets using a "boring" stacked bar
graph over time (2 separate plots, 1 for each teammate).
These charts can be prepared with Google Sheets, Microsoft
Excel, LibreOffice Calc, or similiar. Be sure to use the
same design/legend/scale/colors/ordering so that the
differences between the datasets can be easily compared. Be
sure to make good color choices (e.g., use ColorBrewer
and/or make appropriate linguistic choices).
Think carefully about the time discretization for the
stacked bar chart. Your dataset is highly dense on the time
axis, but you cannot easily show all of that data legibly on the
graph. You will need to group into buckets/windows by time.
Try different values for discretization, and save an
examples where the time resolution is too dense, too sparse,
and just right. Also think about the vertical axis. Does
the information sum to a natural maximum (24 hours in the
day) or is it unbounded (lines of code or money spent).
-
Then, create a streamgraph version of this data. Again,
make 2 separate plots, 1 for each teammate. You should
attempt to match the colors and be consistent with the
labeling between the stacked bar chart and the streamgraph
so the strengths and weaknesses of these two methods in
analyzing data and making conclusions are more easily
comparable.
Streamgraphs can be created with D3. Here is an example witg code:
https://d3-graph-gallery.com/streamgraph.
You may search for additional examples or references or you may use
another toolkit to create your streamgraph.
-
Iterate as necessary to revise the colors, ordering, and
time discretization to maximize legibility and usefulness of
all 4 plots.
How to Submit:
-
Gather your plots in a .pdf report. Be sure to write detailed
and informative figure captions that will fully explain your
plots to a reader who hasn't seen streamgraphs before.
Desribe the data collection process and/or method of generating
data.
It is essential that you are very clear about the
use of synthetic data for this visualization exercise.
Include a small human-readable sample of the data format that
shows the frequency and detail available in the datasets.
-
Analyze the effectiveness of the bar graph vs. streamgraph in
showing off the differences and similarities in the two
datasets. What might be confusing, unclear, or misleading about
these plots?
How well do these plots allow the viewer to make accurate
conclusions about the data? Point out interesting specific
observations that can be seen when comparing and contrasting the
datasets for the two team members. E.g., do both charts equally
well show how teammate A took AP Computer Science in High
school, while teammate B started coding with Computer Science I
in Python but both teammates uses Java when they took Principles
of Software.
If interaction with the streamgraph is a key part of the
enhanced analysis, include multiple screenshots in your report
to illustrate those features.
-
Submit your code for collecting & processing real world data
and/or code for generating synthetic data. Also include your
code for the interactive streamgraph visualization (we may try
to run it as part of the manual grading process). Be sure to
acknowledge and cite any code samples or other references that
helped you produce your visualization.
-
Make one post for the team sharing your plots and a short &
concise caption / brief analysis about your conclusions.