Our discussion is only loosely tied to Chapter 8 of the text.
Let’s review and go over some very common string operations that are particularly useful in parsing files.
Remove characters from the beginning, end or both sides of a string with lstrip, rstrip and strip:
>>> x = "red! Let's go red! Go red! Go red!"
>>> x.strip("red!")
" Let's go red! Go red! Go "
>>> x.lstrip("red!")
" Let's go red! Go red! Go red!"
>>> x.rstrip("red!")
"red! Let's go red! Go red! Go "
>>> " Go red! ".strip()
'Go red!'
Space is the character removed by default.
Split a string using a delimiter, and get a list of strings. Space is the default delimiter:
>>> x = "Let's go red! Let's go red! Go red! Go red!"
>>> x.split()
["Let's", 'go', 'red!', "Let's", 'go', 'red!', 'Go', 'red!', 'Go', 'red!']
>>> x.split("!")
["Let's go red", " Let's go red", ' Go red', ' Go red', '']
>>> x.split("red!")
["Let's go ", " Let's go ", ' Go ', ' Go ', '']
It returns the strings before and after the delimiter string in a list.
Find the first location of a substring in a string, return -1 if not found. You can also optionally give a starting and end point to search from:
>>> x
"Let's go red! Let's go red! Go red! Go red!"
>>> x.find('red')
9
>>> x.find('Red')
-1
>>> x.find('red',10)
23
>>> x.find('red',10,12)
-1
>>> 'red' in x
True
>>> 'Red' in x
False
Given the name of a file as a string, we can open it to read:
f = open('abc.txt')
This is the same as
f = open('abc.txt','r')
We can read in data through three primary methods. First,
line = f.readline()
reads in the next line up to and including the end-of-line character, and “advances” f to point to the next line of file abc.txt.
By contrast,
s = f.read()
reads the entire remainder of the input file as a single string,
When you are at the end of a file, f.read() and f.readline() will both return "" (empty string).
The most common way to read a file is as follows:
f = open('abc.txt') for line in f: print line
This for loop will equivalent to the following:
f = open('abc.txt')
for each line in the file:
line is assigned the string corresponding to the
contents of the line, including the new line
You can combine the above steps into a single for loop:
for line in open('abc.txt'):
....
The code below closes and reopens a file
f = open('abc.txt')
# Insert whatever code is need to read from the file
# and use its contents ...
f.close()
f = open('abc.txt')
f now points again to the beginning of the file.
This can be used to read the same file multiple times.
In order to write to a file we must first open it and associate it with a file variable, e.g.
f_out = open("outfile.txt","w")
The "w" signifies write mode which causes Python to completely delete the previous contents of outfile.txt (if the file previously existed).
It is also possible to use append mode:
f_out = open("outfile.txt","a")
which means that the contents of outfile.txt are kept and new output is added to the end of the file.
Write mode is much more common than append mode.
To actually write to a file, we use the write method:
f_out.write("Hello world!")
You must close the files you write! Otherwise, the changes you made will not be recorded!!
f_out.close()
Given the file census_data.txt:
Location 2000 2011
New York State 18,976,811 19,378,102
New York City 8,008,686 8,175,133
What are the value of variables line1, line2, line3, and line4 after the following code executes?
f = open("census_data.txt")
line1 = f.readline()
line2 = f.read()
line3 = f.readline()
f.close()
f = open("census_data.txt")
line4 = f.readline()
For the same data above, what does the following program produce?
f = open('census_data.txt')
s = f.read()
line_list = s.split('\n')
print len(line_list)
Write code to print all the lines in the above file except for the header line (the first line).
Given a file containing test scores, one per line, write Python code to write a second file with the scores output in decreasing order, one per line, with the index on each line. For example, if the input file contains:
75
98
21
66
83
then the output file should contain:
0: 98
1: 83
2: 75
3: 66
4: 21
This can be done in 10 or fewer lines of Python code.
We can use the urllib module to access web pages.
We did this with our very first “real” example:
import urllib
words_file = urllib.urlopen(words_url)
Once we have words_file we can use the read, readline, and close methods just like we did with “ordinary” files.
When the web page is dynamic, we usually need to work through a separate API (application program interface) to access the contents of the web site. Recall the Flickr example.
Python code:
HTML: Basic structure is a mix of text with commands that are inside “tags” < ... >.
Example:
<html>
<head>
<title>HTML example for CSCI-100</title>
</head>
<body>
This is a page about <a href="http://python.org">Python</a>.
It contains links and other information.
</body>
</html>
Despite the clean formatting of this example, html is in fact free-form, so that, for example, the following produces exactly the same web page:
<html><head><title>HTML example for CSCI-100</title>
</head> <body> This is a page about <a
href="http://python.org">Python</a>. It contains links
and other information. </body> </html>
JSON: used often with Python in many Web based APIs:
{
"class_name": "CSCI 1100"
, "lab_sections" : [
{ "name": "Section 01"
, "scheduled": "T 10AM-12PM"
, "location": "Sage 2704"
}
, { "name": "Section 02"
, "scheduled": "T 12PM-2PM"
, "location": "Sage 2112"
} ]
}
Similar to HTML, spaces do not matter.
Simplejson is a simple module for converting between a string in JSON format and a Python variable:
>>> import simplejson as sj
>>> x = ' [ "a", [ "b", 3 ] ] '
>>> sj.loads(x)
['a', ['b', 3]]
We will examine some simple formats that you have already seen in various homeworks.
Parsing files with fixed format in each line, delimited by a character
Often used: comma (csv), tab or space
Example: lego list:
2x1, 2 2x2, 3
Is there a header or not?
Pseudo code:
for each line of the file
split using the separator
read each column
Exercise: write a simple parser for the lego list that returns a list of the form:
['2x1', '2x1', '2x2', '2x2', '2x2']
Parsing files with one line per row of information, different columns containing unknown amount of information seperated with a secondary delimiter
Example: Yelp from Lab 4:
Meka’s Lounge|42.74|-73.69|407 River Street+Troy, NY 12180|http://www.yelp.com/biz/mekas-lounge-troy|Bars|5|2|4|4|3|4|5
Information after column 5 are all reviews
The address field is separated with a plus sign
Pseudo code:
for each line of the file
split using the separator
read column with secondary separator, split
for each value in the column
read value
Exercise: Return the address as a list of [street, city, state, zip] and the number of reviews for each line of the Yelp file.
More complex file formats:
Blocks of data (of unknown length) separated by spaces making up a record:
4590 - Friday, July 16, 2004
Comments: Ken Jennings game 33.
Contestants:
Frank McNeil: a facilities management specialist from Louisville, Kentucky
Mary McCarthy: a homemaker from Las Vegas, Nevada
Ken Jennings: a software engineer from Salt Lake City, Utah (whose 32-day cash winnings total $1,050,460)
First Jeopardy! Round: AMERICAN WRITERS, IT'S A TEAM THING, NOT NO. 1, POPULATIONS, WHIRLED CAPITALS, THE ANATOMY OF EVEL
AMERICAN WRITERS | 8 days after publishing his first novel, "This Side of Paradise", he married Zelda Sayre | (F. Scott) Fitzgerald
right: Ken
Wrong:
Value: $200
Number: 1
AMERICAN WRITERS | His ability to imitate the family doctor earned him this playwright the nickname "Doc" | Neil Simon
right:
Wrong: Triple Stumper
Frank: Who was Eugene O'Neill?
Value: $400
Number: 2
Pseudo code:
for each line in the file
if the line in the same block as the previous
add to the block of lines to process
else
process the current block
start a new block
Exercise: Write a simple piece of code to decide when a new block is reached. What is the initial value?