First, if you are going to ignore the fact that the input is XML, then there is no need for Perl or Python or gawk or any other language. Just use
$ grep <s1> input_file > s1.txt
$ grep <s2> input_file > s2.txt
and be done with it. This seems inefficient but given the time it takes to write a script and then invoke it, the inefficiency is insignificant. What s worse, if you do not know how to write that particularly simple script, you have to post on SO and wait for an answer which exceeds the inefficiency of the grep
solution by many many many orders of magnitudes.
Now, if the fact that the input is XML matters in the slightest, you should use an XML parser. Contrary to the incorrect claim made elsethread, there are plenty of XML parser which do not have to load the whole file in to memory. Such a parser would have the advantage of being extensible and correct.
The example I give below is intended to replicate the structure of the answer you have already accepted to show you that it is no more complicated to use a proper solution.
Just to give fair warning, the script below is likely to be the slowest possible way. I wrote it to exactly mimic the accepted solution.
#!/usr/bin/perl
use strict; use warnings;
use autodie;
my %fh = map { open my $f, > , $_; $_ => $f } qw{ s1.txt s2.txt };
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(*DATA);
$parser->xml_mode(1);
while ( my $tag = $parser->get_tag( s1 , s2 ) ) {
my $type = $tag->get_tag;
my $text = $parser->get_text("/$type");
print { $fh{"$type.txt"} } $text, "
";
}
__DATA__
<link type="1-1" xtargets="1;1">
<s1>bunch of text here</s1>
<s2>some more here</s2>
</link>
<link type="1-1" xtargets="1;1">
<s1>bunch of text here</s1>
<s2>some more here</s2>
</link>
<link type="1-1" xtargets="1;1">
<s1>bunch of text here</s1>
<s2>some more here</s2>
</link>
Output:
C:Temp> cat s1.txt
bunch of text here
bunch of text here
bunch of text here
C:Temp> cat s2.txt
some more here
some more here
some more here