Question

I have a packet capture code that writes http payload into a file. Now i want to extract the URL information from these dumps. For each packet , the payload begins like this.

GET /intl/en_com/images/logo_plain.png HTTP/1.1..Host: www.google.co.in..User-Agent: Mozilla/5.0

I would like to extract :

the string between "GET" and "HTTP/1.1"
the string between "Host:" and "User-Agent"

How to do this in C ? Are there any inbuilt string functions ? Or Regular expressions ?

Answer 1

C doesn t have built-in regular expressions, though libraries are available: http://www.arglist.com/regex/, http://www.pcre.org/ are the two I see most often.

For a task this simple, you can easily get away without using regexes though. Provided the lines are all less than some maximum length MAXLEN, just process them one line at a time:

char buf[MAXLEN];
char url[MAXLEN];
char host[MAXLEN];
int state = 0;      /* 0: Haven t seen GET yet; 1: haven t seen Host yet */
FILE *f = fopen("my_input_file", "rb");

if (!f) {
    report_error_somehow();
}

while (fgets(buf, sizeof buf, f)) {
    /* Strip trailing 
 and 
 */
    int len = strlen(buf);
    if (len >= 2 && buf[len - 1] ==  
  && buf[len - 2] ==  
 ) {
        buf[len - 2] = 0;
    } else {
        if (feof(f)) {
            /* Last line was not 
-terminated: probably OK to ignore */
        } else {
            /* Either the line was too long, or ends with 
 but not 
. */
            report_error_somehow();
        }
    }

    if (state == 0 && !memcmp(buf, "GET ", 4)) {
        strcpy(url, buf + 4);    /* We know url[] is big enough */
        ++state;
    } else if (state == 1 && !memcmp(buf, "Host: ", 6)) {
        strcpy(host, buf + 6);   /* We know host[] is big enough */
        break;
    }
}

fclose(f);

This solution doesn t require buffering the entire file in memory as KennyTM s answer does (though that is fine by the way if you know the files are small). Notice that we use fgets() instead of the unsafe gets(), which is prone to overflow buffers on long lines.

Answer 2

Look for the location of using strchr (or strstr). Since the strings GET and HTTP/1.1 and Host: are of fixed length, the index and location of the path in between can be extracted easily.

If you want to use regular expressions, on POSIX-compliant systems there is regcomp(3), but that s also quite hard to use.

友情链接