Homework 2: Time-Based Datasets & Chart Junk
You are strongly encouraged to do this assignment as a team of two.
You are welcome to work with the same person you met for
the Lecture 2 worksheet, or form a new team. You may use an idea you
brainstormed in class for the worksheet, or pick something entirely
different. You may alternately work alone for this assignment.
Identify an interesting time-based dataset on a topic that is
familiar to at least one member of the team. Be sure to pick a topic
for which sufficient data can be found online without too much effort.
For example, sports player/team wins/salary, movie ticket sales/actor
salaries, etc. It should be time-based, meaning that an
interesting component of the data is time (e.g., year, day, hour, age,
duration), and you can pose interesting questions about the data that
can be solved by plotting data relative to a time axis.
Review the categories of charts from
"Eenie, Meenie, Minie, Moe: Selecting the Right Graph for Your
Message", Stephen Few, 2004 and brainstorm how different
relationships in this data (or subsets of this data) could be plotted
to answer simple questions about the data. Make sure you are
following the paper's recommendations for best practices to display
information. Aim to produce at least 5 charts of different types.
Extra credit points if you can create one of each of the 7 different
types!
Now collect the data. Some online data sources are trivial to parse
-- just a click to download a single simply-structured, error-free
file. Other data sources might require collection from multiple
locations, complicated parsing, filtering, manual cleaning, and
elaborate association or post-processing. Re-evaluate your plan if
the data collection and data preparation process is either too simple
or more time-consuming that you expected. This is a 1 week, team
assignment and one goal is to learn and practice something about the
data collection process.
Tools you might use/learn for data collection:
Simple copy-paste from a website to a file.
wget to download files from websites.
UNIX utilities:
grep
/
sort
/
uniq
/
sed
/
awk
Your favorite programming language to parse/strip out unnecessary html formatting.
Save as .csv (comma separated value) files to upload to Microsoft Excel / Google Sheets / LibreOffice Calc.
Python has lots of packages for parsing (e.g., json format).
Selenium for automated browsing of websites.
Please share other ideas/tips for data collection on the Discussion Forum!
Finally, use Microsoft Excel or Google Sheets or LibreOffice Calc
(yes, limit yourself to one of these simple visualization tools) to
create your charts. Make sure to carefully label the data, axes,
legend, title as appropriate. Write a thoughtful, complete, and
well-written caption for each chart that ensures the viewer
understands the purpose of the chart and the conclusion that can be
drawn from the data.
OPTIONAL (for extra credit): Select one of your charts
rendered in Microsoft Excel or Google Sheets or LibreOffice Calc
and redraw it (using any tool) to make it more memorable, inspired
by the examples in
"Useful Junk? The Effects of Visual Embellishment on Comprehension and Memorability of Charts", Bateman et al., CHI 2010.
What to Submit
-
Collect your charts into a single .pdf document. The charts plus
their captions should mostly stand on their own. (You shouldn't
need to write much additional text about the topic.)
-
In the .pdf, formally cite the source(s) of your dataset. And
provide a detailed documentation of the data collection and data
processing steps you took to collect and prepare your dataset.
-
Include the source code for scripts or programs that you wrote to
prepare the data. Note: We won't attempt to run these scripts,
so you don't need to include helper libraries/programs that you did
not write, just document those libraries/programs in the .pdf.
-
In the .pdf include a brief explanation of "who did what" on the
team and "what you learned" (hopefully something about web scraping
and/or data preparation tools). Note: we encourage
pair
programming so that everyone learns all of
the tools used to complete the assignment.
-
Make a single post for the team on the Submitty Discussion Forum
sharing two of your diagrams.