Here s one approach.
I notice there is no delimiter between rows. In an effort to clean up the input data, I make a few assumptions:
- The first "row" is the "heading" of a "table", these will be our dictionary keys
- There are no empty fields in the first row (ie: no
""
)
- Any other field can be empty (ie:
""
)
- The first occurrence of two successive
"
indicates the end of the heading row
First I create a response based on your input string:
>>> response = "Subject";"Start Date";"Start Time";"End Date";"End Time";"All day event";"Description""Play football";"16/11/2009";"10:00 PM";"16/11/2009";"11:00 PM";"false";"""Watch 2012";"20/11/2009";"07:00 PM";"";"08:00 PM";"false";"""";"17/11/2009";"9:00 AM";"17/11/2009";"10:00 AM";"false";""
Note that
- the "End Date" for "Watch 2012" is empty
- there is a third event with an empty "Subject" heading
These two modifications illustrate some "edge cases" I m concerned about.
First I will replace all occurrences of two consecutive "
with a pipe (|
) and strip out all other "
characters because I don t need them:
>>> response.replace( "" , | ).replace( " , )
Subject;Start Date;Start Time;End Date;End Time;All day event;Description|Play football;16/11/2009;10:00 PM;16/11/2009;11:00 PM;false;|Watch 2012;20/11/2009;07:00 PM;|;08:00 PM;false;||;17/11/2009;9:00 AM;17/11/2009;10:00 AM;false;|
If we had any empty cells not at the start or end of a row (ie: Watch 2012 s End Date), it looks like this: ;|;
-- let s simply leave it blank:
>>> response.replace( "" , | ).replace( " , ).replace( ;|; , ;; )
Subject;Start Date;Start Time;End Date;End Time;All day event;Description|Play football;16/11/2009;10:00 PM;16/11/2009;11:00 PM;false;|Watch 2012;20/11/2009;07:00 PM;;08:00 PM;false;||;17/11/2009;9:00 AM;17/11/2009;10:00 AM;false;|
Now the |
indicates the split between the heading row and the next row. What happens if we split our string on |
?
>>> response.replace( "" , | ).replace( " , ).replace( ;|; , ;; ).split( | )
[ Subject;Start Date;Start Time;End Date;End Time;All day event;Description ,
Play football;16/11/2009;10:00 PM;16/11/2009;11:00 PM;false; ,
Watch 2012;20/11/2009;07:00 PM;;08:00 PM;false; ,
,
;17/11/2009;9:00 AM;17/11/2009;10:00 AM;false; ,
]
Looks like we re getting somewhere. There s a problem, though; there are two items in that list that are just the empty string
. They re there because we sometimes have a |
at the end of a row and the beginning of the next row, and splitting creates an empty element:
>>> "a|b||c".split( | )
[ a , b , , c ]
Same goes for a lone delimited at the end of a line, too:
>>> "a||b|c|".split( | )
[ a , , b , c , ]
Let s filter our list to drop those empty "rows":
>>> rows = [row for row in response.replace( "" , | ).replace( " , ).replace( ;|; , ;; ).split( | ) if row]
>>> rows
[ Subject;Start Date;Start Time;End Date;End Time;All day event;Description ,
Play football;16/11/2009;10:00 PM;16/11/2009;11:00 PM;false; ,
Watch 2012;20/11/2009;07:00 PM;;08:00 PM;false; ,
;17/11/2009;9:00 AM;17/11/2009;10:00 AM;false; ]
That s it for massaging the input; now we just need to build the dictionary. First, let s get the dictionary keys:
>>> dict_keys = rows[0].split( ; )
>>> dict_keys
[ Subject ,
Start Date ,
Start Time ,
End Date ,
End Time ,
All day event ,
Description ]
And build a list of dictionaries, one for each event:
>>> import itertools
>>> events = []
>>> for row in rows[1:]:
... d = {}
... for k, v in itertools.izip(dict_keys, row.split( ; )):
... d[k] = v
... events.append(d)
...
>>> events
[{ All day event : false ,
Description : ,
End Date : 16/11/2009 ,
End Time : 11:00 PM ,
Start Date : 16/11/2009 ,
Start Time : 10:00 PM ,
Subject : Play football },
{ All day event : false ,
Description : ,
End Date : ,
End Time : 08:00 PM ,
Start Date : 20/11/2009 ,
Start Time : 07:00 PM ,
Subject : Watch 2012 },
{ All day event : false ,
Description : ,
End Date : 17/11/2009 ,
End Time : 10:00 AM ,
Start Date : 17/11/2009 ,
Start Time : 9:00 AM ,
Subject : }]
Hope that helps!
Some notes:
- if you expect
|
to appear in your data, you might want to encode it first; or use a different delimiter
- supporting quotes in the data might be tricky (ie: Subject : Watching "2012" )
- I leave conversion of All day event values from string to boolean as an exercise to the reader :D