I have a table that collects data for web page performance. There are several machines that test multiple sites in 10 minutes, so I currently have about 700,000 lines (920 MB) with +/- 50,000 new lines per day.
Table Source:
SET SQL_MODE="NO_AUTO_VALUE_ON_ZERO"; CREATE TABLE `http_perf_raw_log` ( `run_dt` int(11) DEFAULT NULL, `dataset` varchar(64) DEFAULT NULL, `runner` varchar(64) DEFAULT NULL, `site` varchar(128) DEFAULT NULL, `machine` varchar(32) DEFAULT NULL, `called_url` varchar(1024) DEFAULT NULL, `method` varchar(8) DEFAULT NULL, `url` varchar(1024) DEFAULT NULL, `content_type` varchar(64) DEFAULT NULL, `http_code` int(11) DEFAULT NULL, `header_size` int(11) DEFAULT NULL, `request_size` int(11) DEFAULT NULL, `filetime` int(11) DEFAULT NULL, `ssl_verify_result` int(11) DEFAULT NULL, `redirect_count` int(11) DEFAULT NULL, `total_time` decimal(6,4) DEFAULT NULL, `namelookup_time` decimal(6,4) DEFAULT NULL, `connect_time` decimal(6,4) DEFAULT NULL, `pretransfer_time` decimal(6,4) DEFAULT NULL, `starttransfer_time` decimal(6,4) DEFAULT NULL, `redirect_time` decimal(6,4) DEFAULT NULL, `size_upload` int(11) DEFAULT NULL, `size_download` int(11) DEFAULT NULL, `speed_download` int(11) DEFAULT NULL, `speed_upload` int(11) DEFAULT NULL, `download_content_length` int(11) DEFAULT NULL, `upload_content_length` int(11) DEFAULT NULL, `certinfo` varchar(1024) DEFAULT NULL, `request_header` varchar(1024) DEFAULT NULL, `return_content` varchar(4096) DEFAULT NULL, `return_headers` varchar(2048) DEFAULT NULL, KEY `run_dt_idx` (`run_dt`), KEY `dataset_idx` (`dataset`), KEY `runner_idx` (`runner`), KEY `site_idx` (`site`), KEY `machine_idx` (`machine`), KEY `total_time_idx` (`total_time`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
To aggregate statistics (with a resolution of 1 hour) I created a view:
CREATE OR REPLACE VIEW http_perf_stats (dataset, runner, site, machine, day, hour, calls, total_time, namelookup_time, connect_time, pretransfer_time, starttransfer_time, size_download) AS SELECT dataset, runner, site, machine, DATE_FORMAT(run_dt, '%Y-%m-%d') AS day, DATE_FORMAT(run_dt, '%k') AS hour, COUNT(*) AS calls, SUM(total_time), SUM(namelookup_time), SUM(connect_time), SUM(pretransfer_time), SUM(starttransfer_time), SUM(size_download) FROM http_perf_raw_log GROUP BY runner, site, machine, day, hour ORDER BY `day` DESC
But the performance of VIEW (and the underlying SELECT) is terrible - it takes about 4 seconds.
So my questions are:
1. Uses GROUP BY in a good idea? And if not, then which alternative is better?
2. Is there (I think, yes, I am not an SQL expert: /) a way to optimize this SELECT (changing the request or the http_perf_raw_log structure)?