Parsing text in C

Question

Parsing text in C

I have a file like this:

... words 13 more words 21 even more words 4 ...

(The general format is a line without numbers, then a space, then any number of numbers and a new line)

and I would like to analyze each line by putting the words in one field of the structure, and the number in another. Right now I'm using an ugly line hack, while characters are not numbers, and then reading the rest. I believe there is a clearer way.

+6

c parsing

Sebastian A. Sep 05 '09 at 21:03

source share

7 answers

Rob jones · Answer 1 · 2009-09-05T21:07:04+0000

Edit: you can use pNum-buf to get the length of the alphabetic part of the string and use strncpy () to copy to another buffer. Remember to add '\ 0' to the end of the destination buffer. I would embed this code before pNum ++.

 int len = pNum-buf; strncpy(newBuf, buf, len-1); newBuf[len] = '\0';

You can read the entire line into a buffer, and then use:

 char *pNum; if (pNum = strrchr(buf, ' ')) { pNum++; }

to get a pointer to the number field.

Jason williams · Answer 2 · 2009-09-05T21:28:15+0000

 fscanf(file, "%s %d", word, &value);

This gets the values directly into a string and an integer, and copes with variations in white space and number formats, etc.

Edit

Oh, I forgot that there are spaces between the words. In this case, I would do the following. (Note that it truncates the source text in the line)

 // Scan to find the last space in the line char *p = line; char *lastSpace = null; while(*p != '\0') { if (*p == ' ') lastSpace = p; p++; } if (lastSpace == null) return("parse error"); // Replace the last space in the line with a NUL *lastSpace = '\0'; // Advance past the NUL to the first character of the number field lastSpace++; char *word = text; int number = atoi(lastSpace);

You can solve this with the stdlib functions, but the above will probably be more efficient, since you are only looking for the characters you are interested in.

Amber · Answer 3 · 2009-09-05T21:06:54+0000

You can try using strtok () to tokenize each line, and then check if each token is a number or a word (a pretty trivial check when you have a token line - just look at the first character of the token).

Liran orevi · Answer 4 · 2009-09-05T21:27:37+0000

Assuming the number immediately follows '\ n'. you can read each line in the character buffer, use sscanf ("% d") on the entire line to get the number, and then calculate the number of characters this number occupies at the end of the text line.

Kfro · Answer 5 · 2009-09-05T21:34:35+0000

Depending on how complex your lines are, you may want to use the PCRE library. At the very least, you can compile the perl'ish regex to split your lines. It could be too much.

John bode · Answer 6 · 2009-09-06T00:41:37+0000

Given the description, here is what I would do: read each line as one line using fgets () (make sure the target buffer is large enough), and then split the line with strtok (). To determine if each token is a word or a number, I would use strtol () to try to convert and check the error condition. Example:

 #include <stdlib.h> #include <stdio.h> #include <string.h> /** * Read the next line from the file, splitting the tokens into * multiple strings and a single integer. Assumes input lines * never exceed MAX_LINE_LENGTH and each individual string never * exceeds MAX_STR_SIZE. Otherwise things get a little more * interesting. Also assumes that the integer is the last * thing on each line. */ int getNextLine(FILE *in, char (*strs)[MAX_STR_SIZE], int *numStrings, int *value) { char buffer[MAX_LINE_LENGTH]; int rval = 1; if (fgets(buffer, buffer, sizeof buffer)) { char *token = strtok(buffer, " "); *numStrings = 0; while (token) { char *chk; *value = (int) strtol(token, &chk, 10); if (*chk != 0 && *chk != '\n') { strcpy(strs[(*numStrings)++], token); } token = strtok(NULL, " "); } } else { /** * fgets() hit either EOF or error; either way return 0 */ rval = 0; } return rval; } /** * sample main */ int main(void) { FILE *input; char strings[MAX_NUM_STRINGS][MAX_STRING_LENGTH]; int numStrings; int value; input = fopen("datafile.txt", "r"); if (input) { while (getNextLine(input, &strings, &numStrings, &value)) { /** * Do something with strings and value here */ } fclose(input); } return 0; }

Jonathan leffler · Answer 7 · 2009-09-06T00:44:46+0000

Given the description, I think I would use a variant of this (now tested) C99 code:

 #include <stdio.h> #include <string.h> #include <stdlib.h> #include <ctype.h> struct word_number { char word[128]; long number; }; int read_word_number(FILE *fp, struct word_number *wnp) { char buffer[140]; if (fgets(buffer, sizeof(buffer), fp) == 0) return EOF; size_t len = strlen(buffer); if (buffer[len-1] != '\n') // Error if line too long to fit return EOF; buffer[--len] = '\0'; char *num = &buffer[len-1]; while (num > buffer && !isspace(*num)) num--; if (num == buffer) // No space in input data return EOF; char *end; wnp->number = strtol(num+1, &end, 0); if (*end != '\0') // Invalid number as last word on line return EOF; *num = '\0'; if (num - buffer >= sizeof(wnp->word)) // Non-number part too long return EOF; memcpy(wnp->word, buffer, num - buffer); return(0); } int main(void) { struct word_number wn; while (read_word_number(stdin, &wn) != EOF) printf("Word <<%s>> Number %ld\n", wn.word, wn.number); return(0); }

You can improve error reporting by returning different values for different problems. You can make it work with dynamically allocated memory for the dictionary part of strings. You can make it work with longer strings than I assume. You can scan backwards by numbers instead of non-spaces, but this allows the user to write "abc 0x123" and the hexadecimal value is processed correctly. You may prefer that there are no numbers in the part of the word; this code does not care.

Parsing text in C

More articles: