I did not think that there is a difference between the input stream object read from the local file compared to the network source (Amazon S3 in this case), so hopefully someone can enlighten me.
These programs ran on a virtual machine running Centos 6.3. The test file in both cases is 10 MB.
Local File Code:
InputStream is = new FileInputStream("/home/anyuser/test.jpg"); int read = 0; int buf_size = 1024 * 1024 * 2; byte[] buf = new byte[buf_size]; ByteArrayOutputStream baos = new ByteArrayOutputStream(buf_size); long t3 = System.currentTimeMillis(); int i = 0; while ((read = is.read(buf)) != -1) { baos.write(buf,0,read); System.out.println("reading for the " + i + "th time"); i++; } long t4 = System.currentTimeMillis(); System.out.println("Time to read = " + (t4-t3) + "ms");
The result of this code is as follows: it is read 5 times, which makes sense, since the size of the buffer that is being read is 2 MB, and the file is 10 MB.
reading for the 0th time reading for the 1th time reading for the 2th time reading for the 3th time reading for the 4th time Time to read = 103ms
Now we have the same code that runs with the same 10 MB test file, with the exception of this time, the source is from Amazon S3. We do not start reading until we finish the stream with S3. However, this time the read cycle is executed thousands of times when it should read only 5 times.
InputStream is; long t1 = System.currentTimeMillis(); is = getS3().getFileFromBucket(S3Path,input); long t2 = System.currentTimeMillis(); System.out.print("Time to get file " + input + " from S3: "); System.out.println((t2-t1) + "ms"); int read = 0; int buf_size = 1024*1024*2; byte[] buf = new byte[buf_size]; ByteArrayOutputStream baos = new ByteArrayOutputStream(buf_size); long t3 = System.currentTimeMillis(); int i = 0; while ((read = is.read(buf)) != -1) { baos.write(buf,0,read); if ((i % 100) == 0) System.out.println("reading for the " + i + "th time"); i++; } long t4 = System.currentTimeMillis(); System.out.println("Time to read = " + (t4-t3) + "ms");
The output is as follows:
Time to get file test.jpg from S3: 2456ms reading for the 0th time reading for the 100th time reading for the 200th time reading for the 300th time reading for the 400th time reading for the 500th time reading for the 600th time reading for the 700th time reading for the 800th time reading for the 900th time reading for the 1000th time reading for the 1100th time reading for the 1200th time reading for the 1300th time reading for the 1400th time Time to read = 14471ms
The time taken to read a stream varies from start to start. Sometimes it takes 60 seconds, sometimes 15 seconds. It does not work faster than 15 seconds. The reading cycle still goes through 1400+ times in each test run of the program, although I think it should be only 5 times, for example, an example of a local file.
Is this how the input stream works when the source is through the network, although we have finished receiving the file from the network source? Thanks in advance for your help.