Here s something relevant I cooked up : A sed file movies2xml.sed :
# ampersand etc ..
s|&|&|g
s|<|<|g
s|>|>|g
# last field, if range
s|([12?][0189?][0-9?][0-9?])-([12?][0189?][0-9?][0-9?])$|<when><f>1</f><t>2</t></when>|
# last field, if single
s|([12?][0189?][0-9?][0-9?])$|<when><y>1</y></when>|
# made-for tv/vid/vidgame ..
s|(([TVG][TVG]*)) *<when|<for>1</for><when|
# episode
s|{(.*)} *|<ep>1</ep>|
# ep season, number
s|<ep>(.*)(#([0-9][0-9]*).([0-9][0-9]*))</ep>|<ep s= 2 e= 3 >1</ep>|
# release year / Number (when titles are duplicated in a year)
s| (([12?][0189?][0-9?][0-9?])/*([IVX]*)) <|<y N= 2 >1</y><|
s|<y N= >|<y>|
# TV titles
s|^"([^<]*)"<y|<title type= tvseries >1</title><y|
# titles
s|^(.[^<]*)<y|<title type= film >1</title><y|
# vid game
s| type= film (.*<for>VG<)| type= videogame 1|
# wrap tag
s|^(<.*>)$|<entry>1</entry>|
# rm other text
s|^([^<].*)$|<!-- 1 -->|
The xml tagnames are a little terse, but (in june 14) there s 2,936,679 entries, which makes up 334MB ..
I process the imdb zip-file like this:
( F=movies.xml ; echo <list> > $F ;
zcat movies.list.gz |
tr | tr -s - | recode l9..u8..xml |
sed -f movies2xml.sed >> $F ;
echo </list> >> $F ; ) &
This xml output then validates with this XSD :
<?xml version="1.0" encoding="UTF-8"?>
<!-- imdb_movies_list.xsd -->
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="list">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="0" maxOccurs="unbounded" ref="entry"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="entry">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="1" maxOccurs="1" ref="title"/>
<xs:element minOccurs="1" maxOccurs="1" ref="y"/>
<xs:choice>
<xs:element minOccurs="0" maxOccurs="1" ref="for"/>
<xs:element minOccurs="0" maxOccurs="1" ref="ep"/>
</xs:choice>
<xs:element minOccurs="1" maxOccurs="1" ref="when"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="title">
<xs:complexType mixed="true">
<xs:attribute name="type" use="required">
<xs:simpleType>
<xs:restriction base="xs:token">
<xs:enumeration value="tvseries"/>
<xs:enumeration value="videogame"/>
<xs:enumeration value="film"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
</xs:complexType>
</xs:element>
<xs:element name="y">
<xs:complexType>
<xs:simpleContent>
<xs:extension base="yeartype">
<xs:attribute name="N" use="optional">
<xs:simpleType>
<xs:restriction base="xs:token">
<xs:enumeration value="I"/>
<xs:enumeration value="II"/>
<xs:enumeration value="III"/>
<xs:enumeration value="IV"/>
<xs:enumeration value="V"/>
<xs:enumeration value="VI"/>
<xs:enumeration value="VII"/>
<xs:enumeration value="VIII"/>
<xs:enumeration value="IX"/>
<xs:enumeration value="X"/>
<xs:enumeration value="XI"/>
<xs:enumeration value="XII"/>
<xs:enumeration value="XIII"/>
<xs:enumeration value="XIV"/>
<xs:enumeration value="XV"/>
<xs:enumeration value="XVI"/>
<xs:enumeration value="XVII"/>
<xs:enumeration value="XVIII"/>
<xs:enumeration value="XIX"/>
<xs:enumeration value="XX"/>
<xs:enumeration value="XXI"/>
<xs:enumeration value="XXII"/>
<xs:enumeration value="XXIII"/>
<xs:enumeration value="XXIV"/>
<xs:enumeration value="XXV"/>
<xs:enumeration value="XXVI"/>
<xs:enumeration value="XXVII"/>
<xs:enumeration value="XXVIII"/>
<xs:enumeration value="XXIX"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:element>
<xs:element name="for">
<xs:simpleType>
<xs:restriction base="xs:token">
<xs:enumeration value="TV"/>
<xs:enumeration value="V"/>
<xs:enumeration value="VG"/>
</xs:restriction>
</xs:simpleType>
</xs:element>
<xs:element name="ep">
<xs:complexType mixed="true">
<xs:attribute name="s" type="xs:integer" use="optional"/>
<xs:attribute name="e" type="xs:integer" use="optional"/>
</xs:complexType>
</xs:element>
<xs:element name="when">
<xs:complexType>
<xs:choice>
<xs:sequence>
<xs:element name="y" type="yeartype" minOccurs="1" maxOccurs="1"/>
</xs:sequence>
<xs:sequence>
<xs:element name="f" type="yeartype" minOccurs="1" maxOccurs="1"/>
<xs:element name="t" type="yeartype" minOccurs="1" maxOccurs="1"/>
</xs:sequence>
</xs:choice>
</xs:complexType>
</xs:element>
<xs:simpleType name="yeartype">
<xs:restriction base="xs:string">
<xs:pattern value="[12?][0189?][0-9?][0-9?]"/>
</xs:restriction>
</xs:simpleType>
</xs:schema>
I expect there s an xml-to-json converter out there somewhere, for the believers ..