Skip to main content

Homework #6: Stream Graphs

Everyone must work with a partner for Homework 6. It must be a person you have NOT already worked with for an earlier homework in this class. Preferably, it is also someone you have NOT worked with on a in-class worksheet.

The deadline for forming your team is the end of lecture on Tuesday February 13th! (Yes, over a week before the homework deadline!) You must form your team on Submitty before by this deadline. If you don't yet have a partner, please add your name to the "Users Seeking Team/Partner" list for the Homework 6 Gradeable. If necessary (if we have an odd number of students in the course), the instructor will a approve a single team to have 3 members.

  • First, decide on an interesting, personal/individual, multi-category, information-dense, time-based dataset for which the two members of your team will likely have some subtle or significant differences. For example:

    • Minutes spent in a typical week on sleep/class/homework/eat/sports/tv/etc.

    • Lines of code written during your time at RPI in different programming languages (python,c++,java,etc)

    • Money spent over a typical month/year on tuition/apartment/food/travel/clothing/movies

    • or something else?

    Note on dataset choice: If this assignment fits well with a dataset from an earlier assignment you are welcome to re-use that dataset -- but you are expected to spend a non-trivial amount of time either expanding or improving the dataset.

  • Next you will prepare the datasets. Your team should agree on the the time range, sampling frequency, number and of definition of the specific categories, etc. The data should be relatively high-resolution on the time-axis relative to the interval chosen. For example, if you are visualizing money spent over the last year, you should have daily data (not just monthly totals). You should also have an interesting number of categories with sufficient data (e.g., probably at least 4 categories with a moderate amount of samples). Then, each team member will collect or prepare their own personal version of the data.

    If the data already exists in an easily obtainable and sufficient detail and quantity, write scripts to scrape, parse, and organize that data (e.g., if it is all in GitHub). If it is not possible or practical to gather this data, then write a script to generate synthetic data that has the realistic patterns you would expect to find (e.g., if you had been wearing a GPS location & sleep tracking watch 24/7 for the last 10 years).

    If you have real-world data, but you have missing, incomplete, or uncertain information, fill in those details with appropriate synthetically-generated data. Be sure to add some random noise to synthetic data so it's "interesting".

  • Now let's visualize these datasets! Generate 2 different versions of each dataset:

    • First, plot the two datasets using a "boring" stacked bar graph over time (2 separate plots, 1 for each teammate). These charts can be prepared with Google Sheets, Microsoft Excel, LibreOffice Calc, or similiar. Be sure to use the same design/legend/scale/colors/ordering so that the differences between the datasets can be easily compared. Be sure to make good color choices (e.g., use ColorBrewer and/or make appropriate linguistic choices).

      Think carefully about the time discretization for the stacked bar chart. Your dataset is highly dense on the time axis, but you cannot easily show all of that data legibly on the graph. You will need to group into buckets/windows by time. Try different values for discretization, and save an examples where the time resolution is too dense, too sparse, and just right. Also think about the vertical axis. Does the information sum to a natural maximum (24 hours in the day) or is it unbounded (lines of code or money spent).

    • Then, create a streamgraph version of this data. Again, make 2 separate plots, 1 for each teammate. You should attempt to match the colors and be consistent with the labeling between the stacked bar chart and the streamgraph so the strengths and weaknesses of these two methods in analyzing data and making conclusions are more easily comparable.

      Streamgraphs can be created with D3. Here is an example witg code: https://d3-graph-gallery.com/streamgraph. You may search for additional examples or references or you may use another toolkit to create your streamgraph.

    • Iterate as necessary to revise the colors, ordering, and time discretization to maximize legibility and usefulness of all 4 plots.

How to Submit:

  1. Gather your plots in a .pdf report. Be sure to write detailed and informative figure captions that will fully explain your plots to a reader who hasn't seen streamgraphs before.

    Desribe the data collection process and/or method of generating data. It is essential that you are very clear about the use of synthetic data for this visualization exercise. Include a small human-readable sample of the data format that shows the frequency and detail available in the datasets.

  2. Analyze the effectiveness of the bar graph vs. streamgraph in showing off the differences and similarities in the two datasets. What might be confusing, unclear, or misleading about these plots?

    How well do these plots allow the viewer to make accurate conclusions about the data? Point out interesting specific observations that can be seen when comparing and contrasting the datasets for the two team members. E.g., do both charts equally well show how teammate A took AP Computer Science in High school, while teammate B started coding with Computer Science I in Python but both teammates uses Java when they took Principles of Software.

    If interaction with the streamgraph is a key part of the enhanced analysis, include multiple screenshots in your report to illustrate those features.

  3. Submit your code for collecting & processing real world data and/or code for generating synthetic data. Also include your code for the interactive streamgraph visualization (we may try to run it as part of the manual grading process). Be sure to acknowledge and cite any code samples or other references that helped you produce your visualization.

  4. Make one post for the team sharing your plots and a short & concise caption / brief analysis about your conclusions.