Sunday, March 20, 2011

GCC: G++ ifstream reads extra unwanted characters

Been having trouble with GCC's G++ 4.5.2 with MinGW on Windows. I kept having unwanted characters from a file, which should not be present in the first place.

Given that I have a hello.txt text file that contains,
Hello World
and
Hello Universe


when opened with Notepad or any other text editor, G++ keeps sending me,
Hello World
and
Hello Universe
zpzksmsm3jk4509s


Of course, "zpzksmsm3jk4509s" is just a representation of the garbage data I'm getting. Most of the time, that garbage data is readable. So I usually get something like,
Hello World
and
Hello Universe
He


So there is basically no way for me to know which is a garbage data and which is still part of the file. Somehow, getting the file size doesn't seem to work correctly.

string loadFile(string p_fileName) {
  const char *l_fileName = p_fileName.c_str();
  char *bufPath = new char[strlen(l_fileName)];
    strcpy(bufPath,l_fileName);

  ifstream file(bufPath, ios::in|ios::ate);
  ifstream::pos_type size;
  char * memblock;
  string ret;

  size = file.tellg();
  memblock = new char [size];

  file.seekg (0, ios::beg);
    file.read (memblock, size);
  file.close();

  ret = memblock;
  delete []memblock;
  delete []bufPath;

  return ret;
}


Just so you could follow me further, I am trying to make a simple interpreter. I am barely at the part where I have to tokenize the text content and just had a basic checking of which is a command and which is not.

As I run my interpreter, I get the "expected" error, "Unknown command ` zpzksmsm3jk4509s'". But it's the "zpzksmsm3jk4509s"-part of the message that I was not expecting. As I said, the text file only contains nothing more but
Hello World
and
Hello Universe


It took me a couple of days before finally figuring out the problem.

According to a Cygwin FAQ entry, there seem to be an abnormality with Windows dealing with CR/LF.

From a Notepad's perspective, the hello.txt file is only 32bytes long. That counts Hello World (11 bytes), and (3 bytes), Hello Universe (14), plus two instance of CR and LF (thus, 4bytes).

G++ on Windows, on the other hand, also sees 32bytes. But! As what the Cygwin FAQ page says, it seems, Windows gives you CR and LF as one single character. Meaning, you, as the coder, should only read 30bytes long. That's why G++ is giving me extra He content.

So I had to modify my loadFile from reading the content all at once, into reading the content one character after the other. Then upon reading an 'LF', I will try to check if 'CR' was the previous character, giving me a CR/LF. In which, I will have to lessen my expected length in each instance of a CR/LF.

My code looked sort of something like this.
  int i = 0;
  file.seekg (0, ios::beg);
  while (file.eof() == false) {
    memblock[i] = (char) file.get();
    if (memblock[i] == '\n') {
      if (i > 0 && memblock[i-1] == '\r') {
        memblock[i-1] = '\n';
        i--;
      }
    }
    i++;
  }


Then make sure the last character of memblock would be a string-terminating "\0" since memblock is not a C++ String but a char*.

Now, if you ask, why did you get a CR then LF if Windows only gives you CR/LF as single character? Well, it seems, if you will have to read the file one character after the other, CR won't be CR/LF but simply CR. Making the next simply LF. You could assume that it's more of a G++ problem than a Windows problem. But as the Cygwin FAQ page suggests, it is a Windows file-handling problem.