Assignment #4: Data Collection and Preparation
For this homework, you may work in a team of 2 or individually.
You are encouraged to work with someone you hadn't met before this
course.
This week your primary task is identify and collect a new and
interesting (to you!) data set that is also interestingly large. You
are expected to use your programming skills to obtain and/or wrangle
this data into a file format you can visualize and analyze.
Some examples of where to start:
-
Take a non trivial computer program (for example a simulation or a
solver) you have written and add dense logging information. How often
does each function get called? How many times does an inner loop get
called? What is the pattern of data stored in a variable or passed
into a function?
-
Monitor your own computer activity, what keys do you press, where does
your mouse move, what files do you open, what?
-
Scrape the GPS data off of your phone to gather your location over
time. Or your heart-rate from a smart watch.
-
Setup a microphone or video camera and collect a stream of audio
and/or images.
Try to find a dataset that's not simply "download a file". You should
be doing a moderate amount of work (writing code) to either collect or
parse/reorganize/simplify/post-process this data.
NOTE: Grad students working on a thesis or undergraduates working on a
research project are strongly encouraged (required?) to work with a
research-related data source.
Once you've selected a data source...
-
Write down at least 2 specific research questions that can be
solved by analyzing this data. The first should be "obvious" and may
simply communicate the overall quantity of data you've got your hands
on. The second should be more complex or subtle, that can be answered
by the data, but will involve rearranging or simplifying or finding
correlations within the data.
-
What are your specific hypotheses related to these research
questions? What knowledge are your drawing on to make these
predictions?
-
With your research questions in mind, design the detailed format
for your raw data (the columns of your data "spreadsheet") and
decide on the action or sampling frequency for each "row" of the data.
Make sure you are able to acquire an "interesting" amount of data,
both number of samples (at least 1000 rows?) and dimensions per sample
(at least 3 columns?) Note: These estimates are not requirements.
If your data has many more columns, things can be quite interesting
even with far fewer rows.
-
Create (at least 2) simple visualization plots of this data
using a tool that's new to you (or you would like to learn more
about). Consider using:
Excel,
LineUp,
Tableau,
Google Analytics,
Plotly, or
VTK. These plots
should attempt to answer the research questions you posed earlier.
You can revise your research questions as needed as you work with the
data.
When you're ready to submit:
-
Prepare a writeup for this assignment with the information requested
above as either a .pdf with inline images or a plaintext README.txt
with well-named image files. Additionally, your writeup should detail
the efforts you made to collect, parse, reorganize, simplify, and/or
post-process this data source.
-
In a code directory, include the source code you wrote to
collect the data. (Don't include 3rd party libraries, it won't be
compiled or run for grading purposes.)
-
In a data directory, include interesting samples of
the data. Don't attempt to upload the entire dataset (it might be too
big!), but a sample that shows the format and range of values.
Document the overall size of the data (# of rows and/or file size for
context). Depending on any work you had to do to wrangle the data
into an alternate format, include samples of the data at intermediate
and final stages as well.
-
A brief review of the tool you used to create the visualizations.
Note: Teams of two should clearly label their submission with both
names. And both students should upload the full assignment.
|