English 中文(简体)
create xml file from LIST file format
原标题:

I have downloaded some .LIST file from imdb database, and I wanna use them for some social network analyses reason (research with references), using a SNA software (where input can be in xml or csv)...

问题回答

Here s something relevant I cooked up : A sed file movies2xml.sed :

# ampersand etc ..
s|&|&|g
s|<|&lt;|g
s|>|&gt;|g
# last field, if range
s|([12?][0189?][0-9?][0-9?])-([12?][0189?][0-9?][0-9?])$|<when><f>1</f><t>2</t></when>|
# last field, if single
s|([12?][0189?][0-9?][0-9?])$|<when><y>1</y></when>|
# made-for tv/vid/vidgame ..
s|(([TVG][TVG]*)) *<when|<for>1</for><when|
# episode
s|{(.*)} *|<ep>1</ep>|
# ep season, number
s|<ep>(.*)(#([0-9][0-9]*).([0-9][0-9]*))</ep>|<ep s= 2  e= 3 >1</ep>|
# release year / Number (when titles are duplicated in a year)
s| (([12?][0189?][0-9?][0-9?])/*([IVX]*)) <|<y N= 2 >1</y><|
s|<y N=  >|<y>|
# TV titles
s|^"([^<]*)"<y|<title type= tvseries >1</title><y|
# titles
s|^(.[^<]*)<y|<title type= film >1</title><y|
# vid game
s| type= film (.*<for>VG<)| type= videogame 1|
# wrap tag
s|^(<.*>)$|<entry>1</entry>|
# rm other text
s|^([^<].*)$|<!-- 1 -->|

The xml tagnames are a little terse, but (in june 14) there s 2,936,679 entries, which makes up 334MB ..

I process the imdb zip-file like this:

( F=movies.xml ; echo  <list>  > $F ; 
zcat movies.list.gz | 
    tr  	      | tr -s    -  | recode l9..u8..xml | 
    sed -f movies2xml.sed >> $F ;  
echo  </list>  >> $F ; ) &

This xml output then validates with this XSD :

<?xml version="1.0" encoding="UTF-8"?>
<!-- imdb_movies_list.xsd -->
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="list">
    <xs:complexType>
      <xs:sequence>
        <xs:element minOccurs="0" maxOccurs="unbounded" ref="entry"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="entry">
    <xs:complexType>
      <xs:sequence>
        <xs:element minOccurs="1" maxOccurs="1" ref="title"/>
        <xs:element minOccurs="1" maxOccurs="1" ref="y"/>
        <xs:choice>
          <xs:element minOccurs="0" maxOccurs="1" ref="for"/>
          <xs:element minOccurs="0" maxOccurs="1" ref="ep"/>
        </xs:choice>
        <xs:element minOccurs="1" maxOccurs="1" ref="when"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="title">
    <xs:complexType mixed="true">
      <xs:attribute name="type" use="required">
        <xs:simpleType>
          <xs:restriction base="xs:token">
            <xs:enumeration value="tvseries"/>
            <xs:enumeration value="videogame"/>
            <xs:enumeration value="film"/>
          </xs:restriction>
        </xs:simpleType>
      </xs:attribute>
    </xs:complexType>
  </xs:element>
  <xs:element name="y">
    <xs:complexType>
      <xs:simpleContent>
        <xs:extension base="yeartype">
          <xs:attribute name="N" use="optional">
            <xs:simpleType>
              <xs:restriction base="xs:token">
            <xs:enumeration value="I"/>
            <xs:enumeration value="II"/>
            <xs:enumeration value="III"/>
            <xs:enumeration value="IV"/>
            <xs:enumeration value="V"/>
            <xs:enumeration value="VI"/>
            <xs:enumeration value="VII"/>
            <xs:enumeration value="VIII"/>
            <xs:enumeration value="IX"/>
            <xs:enumeration value="X"/>
            <xs:enumeration value="XI"/>
            <xs:enumeration value="XII"/>
            <xs:enumeration value="XIII"/>
            <xs:enumeration value="XIV"/>
            <xs:enumeration value="XV"/>
            <xs:enumeration value="XVI"/>
            <xs:enumeration value="XVII"/>
            <xs:enumeration value="XVIII"/>
            <xs:enumeration value="XIX"/>
            <xs:enumeration value="XX"/>
            <xs:enumeration value="XXI"/>
            <xs:enumeration value="XXII"/>
            <xs:enumeration value="XXIII"/>
            <xs:enumeration value="XXIV"/>
            <xs:enumeration value="XXV"/>
            <xs:enumeration value="XXVI"/>
            <xs:enumeration value="XXVII"/>
            <xs:enumeration value="XXVIII"/>
            <xs:enumeration value="XXIX"/>
              </xs:restriction>
            </xs:simpleType>
          </xs:attribute>
        </xs:extension>
      </xs:simpleContent>
    </xs:complexType>
  </xs:element>
  <xs:element name="for">
    <xs:simpleType>
      <xs:restriction base="xs:token">
        <xs:enumeration value="TV"/>
        <xs:enumeration value="V"/>
        <xs:enumeration value="VG"/>
      </xs:restriction>
    </xs:simpleType>
  </xs:element>
  <xs:element name="ep">
    <xs:complexType mixed="true">
      <xs:attribute name="s" type="xs:integer" use="optional"/>
      <xs:attribute name="e" type="xs:integer" use="optional"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="when">
    <xs:complexType>
      <xs:choice>
        <xs:sequence>
          <xs:element name="y" type="yeartype" minOccurs="1" maxOccurs="1"/>
        </xs:sequence>
        <xs:sequence>
          <xs:element name="f" type="yeartype" minOccurs="1" maxOccurs="1"/>
          <xs:element name="t" type="yeartype" minOccurs="1" maxOccurs="1"/>
        </xs:sequence>
      </xs:choice>
    </xs:complexType>
  </xs:element>
  <xs:simpleType name="yeartype">
    <xs:restriction base="xs:string">
      <xs:pattern value="[12?][0189?][0-9?][0-9?]"/>
    </xs:restriction>
  </xs:simpleType>
</xs:schema>

I expect there s an xml-to-json converter out there somewhere, for the believers ..





相关问题
how to represent it in dtd?

I have two element action and guid. guid is a required field when action is add. but when action is del it will not appear in file. How to represent this in dtd ?

.Net application configuration add xml-data

I need to add xml-content to my application configuration file. Is there a way to add it directly to the appSettings section or do I need to implement a configSection? Is it possible to add the xml ...

XStream serializing collections

I have a class structure that I would like to serialize with Xstream. The root class contains a collection of other objects (of varying types). I would like to only serialize part of the objects that ...

MS Word splits words in its XML format

I have a Word 2003 document saved as a XML in WordProcessingML format. It contains few placeholders which will be dynamically replaced by an appropriate content. But, the problem is that Word ...

Merging an XML file with a list of changes

I have two XML files that are generated by another application I have no control over. The first is a settings file, and the second is a list of changes that should be applied to the first. Main ...

How do I check if a node has no siblings?

I have a org.w3c.dom.Node object. I would like to see if it has any other siblings. Here s what I have tried: Node sibling = node.getNextSibling(); if(sibling == null) return true; else ...

Ordering a hash to xml: Rails

I m building an xml document from a hash. The xml attributes need to be in order. How can this be accomplished? hash.to_xml

热门标签