Reading is Section 9.1 of Practical Programming, as well as the last part of Section 5.10 on command-line arguments.
We are given a file extracted from the Internet Movie Database (IMDB) called imdb_data.txt containing, on each line, a person’s name, a movie name, and a year. For example,
Kishiro, Yukito | Battle Angel | 2016
Goal:
The challenge in doing this is that many names appear multiple times.
First solution: store names in a list. We’ll start from the following code, posted on-line in find_names_start.py
imdb_file = raw_input("Enter the name of the IMDB file ==> ").strip()
name_list = []
for line in open(imdb_file):
words = line.strip().split('|')
name = words[0].strip()
and complete the code in class.
The challenge is that we need to check that a name is not already in the list before adding it.
Initialization comes from a list, a range, or from just set():
>>> s1 = set()
>>> s1
set([])
>>> s2 = set(range(0,11,2))
>>> s2
set([0, 2, 4, 6, 8, 10])
>>> v = [4, 8, 4, 'hello', 32, 64, 'spam', 32, 256]
>>> s3 = set(v)
>>> s3
set([32, 64, 4, 'spam', 8, 256, 'hello'])
The actual methods are
s.add(x) — add an element if it is not already there
s.clear() — clear out the set, making it empty
s1.difference(s2) — create a new set with the values from s1 that are not in s2. Using Python’s operator syntax this is
s1 - s2
s1.intersection(s2) — create a new set that contains only the values that are in both sets. Operator syntax:
s1 & s2
s1.union(s2) — create a new set that contains values that are in either set. Operator syntax:
s1 | s2
s1.issubset(2) —- are all elements of s1 also in s2? Operator syntax:
s1 <= s2
s1.issuperset(s2) — are all elements of s2 also in s1? Operator syntax:
s1 >= s2
s1.symmetric_difference(s2) — create a new set that contains values that are in s1 or s2 but not in both.
s1 ^ s2
We will explore the intuitions behind these set operations by considering
and then consider who is in the sets
s1 - s2
s1 & s2
s1 | s2
s1 ^ s2
Sets should be relatively intuitive, so rather than demo them in class, we’ll work through these as an exercise:
>>> s1 = set(range(0,10))
>>> s1
>>> s1.add(6)
>>> s1.add(10)
>>> s2 = set(range(4,20,2))
>>> s2
>>> s1 - s2
>>> s1 & s2
>>> s1 | s2
>>> s1 <= s2
>>> s3 = set(range(4,20,4))
>>> s3 <= s2
These practice problems are to be used both in understanding sets and as a study aid for the next test.
What is the output of the following Python code? Write the answer by hand before you type it into the Python interpreter. Do not worry about getting the order of the values in a set correct:
>>> s1 = set([0,1,2])
>>> s2 = set(range(1,9,2))
>>> print 'A:', s1.union(s2)
>>> print 'B:', s1
>>> s1.add('1')
>>> s1.add(0)
>>> s1.add('3')
>>> s3 = s1 | s2
>>> print 'C:', s3
>>> print 'D:', s3 - s1
Note that this example does NOT cover all of the possible set operations. You should generate and test your own examples to ensure that you understand all of the basic set operations.
Write Python code that implements the following set functions using a combination of loops, the in operator, and the add function. In each case, s1 and s2 are sets and the function call should return a set.
Write a Python function to find all of the family names in the IMDB data set. Output them in alphabetical order. Assume the family name ends with the first ',' on each input line. Would you have noticed a significant difference in execution time if we used a list implementation? What if the data set were just the students in this class, or all students at RPI?