Typical URL Lengths for Storage Design (URL-shortener)

After reading a few hits in a quick google search , it seems like there is not much consistency when it comes to determining the average length of a URL.

I know that IE has a maximum URL length of 2083 characters (from here ) - so I have a good maximum for working.

My concern is that I am writing a URL shortener in PHP ( similar to some other questions on SO), and I want to make sure that I cannot exceed the storage capabilities of the server on which it is hosted.

If all URLs are IE max values, then 2^32 will not be conveniently located anywhere - it will take 2K x 4B ~= 8TB for storage: unrealistic expectation.

Without adding a cropping function (i.e. cleaning up old "shortened URLs"), what is the safest way to calculate application storage usage?

Is ~ 34 characters safe to guess? If so, then the full one (using the int database for the primary key) will digest 292 GB of space (double 146 GB for any metadata that can be saved).

What is best for such an application?

+7
source share
4 answers

Well, you do not need to know the length of the URL. This is an assumption, but I believe that URL shortening is mainly used to shorten long URLs. Why worry about cutting short already? :)

However, there is another problem. The database will also have too much overhead, so you cannot just calculate avarage and said that this is the size of the byte size.

I wrote a URL shortener and it already contains about 45 elements. Therefore, I suggest you write yours, and by the time it actually contains 2 ^ 32 URLs, buying an 8-bit hard drive will probably no longer create a problem .; -)

+2
source

This is probably unknowable without indexing the entire Internet, but according to Kelvin Tan’s analysis on a data set of 6 627 999 unique URLs out of 78 764 unique domains , the answer is 76.97

Average value: 76.97

Standard deviation: 37.41

95 percent confidence interval: 157

confidence interval 99.5 %%: 218

+20
source

I'm not sure what is typical, but out of the 11,000 URLs in our query database, the average length is 62 characters. We can be an exception, because every month we receive hundreds of requests from our client for goods from Japan. Our database contains hundreds of URLs with several hundred characters. The longest link to a Google translation of 1689 characters.

top 10 len (producturl): 1689 +792 707 693 647 606 574 569 562 560

Sample URL 647 characters:

http://www.amazon.co.jp/%E9%AD%94%E7%95%8C%E6%88%A6%E8%A8%98%E3%83%87%E3%82%A3%E3 % 82% B9% E3% 82% AC% E3% 82% A4% E3% 82% A24-% E5% 88% 9D% E5% 9B% 9E% E9% 99% 90% E5% AE% 9A% E7% 89% 88-% E5% A0% 95% E5% A4% A9% E4% BD% BF% E3% 83% 95% E3% 83% AD% E3% 83% B3-% E3% 83% 97% E3% 83% AD% E3% 83% 80% E3% 82% AF% E3% 83% 88% E3% 82% B3% E3% 83% BC% E3% 83% 89% E4% BB% 98% E3% 81% 8D% E7% 89% B9% E8% A3% BD% E3% 82% AB% E3% 83% BC% E3% 83% 89-% E3% 83% 88% E3% 83% AC% E3% 83% BC % E3% 83% 87% E3% 82% A3% E3% 83% B3% E3% 82% B0% E3% 82% AB% E3% 83% BC% E3% 83% 89% E3% 80% 8C% E3 % 83% B4% E3% 82% A1% E3% 82% A4% E3% 82% B9% E3% 82% B7% E3% 83% A5% E3% 83% B4% E3% 82% A1% E3% 83 % AB% E3% 83% 84% E3% 80% 8D% E9% 99% 90% E5% AE% 9APR% E3% 82% AB% E3% 83% BC% E3% 83% 89% E4% BB% 98 % E3% 81% 8D / dp / B0043RT8UO / ref = pd_rhf_p_t_1

to evaluate goals that you should extrapolate from some data set after applying the standard deviation, to throw out emissions that can distort your average.

+4
source

From RFC 2068 Section 3.2.1:

HTTP does not set an a priori limit on the length of a URI. Servers MUST be able to process the URIs of any resource they serve, and SHOULD be able to process URIs of unlimited length if they provide GET-based forms that can generate such URIs. The server SHOULD return 414 (Request-URI Too Long) status if the URI is longer than the server can process (see section 10.4.15).

Note. Servers must be careful depending on the length of the URIs above 255 bytes, because some older client or proxy implementations may not support this length properly.

Although IE (and probably most other browsers) supports much longer URI lengths, I don’t think most forms or client applications rely on something that exceeds 255 bytes. Server logs should contain some statistics about what URLs you see.

+3
source

All Articles