Skip to content
Abhishek Singh Bailoo edited this page Jun 18, 2015 · 3 revisions

To incorporate lambda architecture, we need a second storage than Cassandra. This second storge should be in the type of network file system e.g. amazon S3 or Hadoop's HDFS.

The data point is stored in a file text with Comma Separated Value format. Sample of the data point in the file would be the following:

N;v1.45C;1;26.25148;79.86157;0.06;2015-01-29@00:00:09;2;5;3;5;6;6;3;5;0;12.88

The data in the second storage will be used to restore data in Cassandra in case of disaster happens. it is also serve as a data storage for queries that are rarely executed. Frequently-executed queries should be served by Cassandra.

Please use this to estimate AWS pricing http://calculator.s3.amazonaws.com/index.html#r=SIN&s=EC2&key=calc-38C1AF83-B6F8-4A9E-B427-78CEAD0F65B9

This page is to give a description of how big the required space for implementing this system. There are two data for this system full data and last data.

For each Full Data the size on Cassandra around 820 Bytes

The following calculation is based on the assumptions:

  • 1,000,000 GPS Device
  • each device send data every 10 seconds, 6 times every minute
  • Data will be kept for 1 year (365 days)

Total Data = 820 * 6 * 60 * 24 * 365 ~ 3 GB / year

For replication factor = 3 Space for 100,000 devices = 3 GB * 100,000 * 3 = 900 TB

We will need to implement the Lambda architecture here. Older data will go on cheaper and slower storage and more recent data will go on costlier and faster storage.

Cost of storing 900 TB on Amazon = ?

Estimate Monthly cost of bandwidth writing 75 TB = ?

The calculation above is just rough space estimation. Some overhead might occur but most probably it won't be significant.

Please note that by default Cassandra already apply data compression on its tables. the size above was calculated based on the size on compressed Cassandra table.

Cassandra is best operated on multi-node cluster, especially on production. This is to provide load balancing and also fault-tolerant. the number of node in Cassandra cluster may varied from 2 to, theoretically, unlimited. For development purpose, single-node cluster is enough. For production environment it is recommended to have at least 3 nodes in Cassandra cluster. It is just rule of thumb. But the general rule is to find optimum number based on the available budget and the data criticality.

Storing all devices' data for one year in cassandra is not an option since it is very huge 900 TB for 100,000 devices.

Some strategy is needed. Since 80 % of all queries are for today's data, then today's data need to be in Cassandra for fast access by PHP. the one year full data will be kept as file.

The data calculation for Cassandra:

  • 820 Bytes per data
  • Every 10 seconds data received
  • Only last 24 hours data are stored in cassandra

Total size per device for 24 hours = 820660*24 = 7084800 bytes = 7MB Total size for 100,000 devices = 7 MB * 100000 = 700 GB

The calculation from AWS: http://calculator.s3.amazonaws.com/index.html#r=SIN&s=EC2&key=calc-C537046A-8DC7-413A-9424-7C52892AADE0

There are 5 instances for EC2 and one S3 for storage.

EC2:

  • 5 nodes m1.medium for cassandra total around 2.1 TB (about 3 replicas)
  • 2 nodes c1.medium (each for PHP and Java)

S3:

  • 1 instance of S3 3 TB

Total cost for AWS is 759.34 USD / month

the price include 75 TB data transfer in.

Let me know if you have other things in mind like for example Java will not be host at AWS.

The strategy use the following schema for full data:

CREATE TABLE full_data (
  imeih ascii,
  dtime timestamp,
  data ascii,
  PRIMARY KEY (imeih, dtime)
); 

insert into full_data (imeih, dtime,data)
values ('862170011627815@2015-01-29@00', 
        '2015-01-29', 
        'N;v1.45C;1;26.25148;79.86157;0.06;2015-01-29@00:00:09;2;5;3;5;6;6;3;5;0;12.88');

Here is how it works:

  1. Primary key only consists of 2 columns. The first one is sharding key/ row key (imeih). second one is column key (dtime). imeih is combination of imei + date & hour. Sample data: '862170011627815@2015-01-29@00'. Dtime is just column with timestamp datatype

  2. Change all text datatype to ascii since all characters in data will be in ascii format. Can you confirm Abhishek?

  3. Data column contains the XML data. However, we compacted the xml to minimized the space that will be used in Cassandra. Sample of Full Data XML: <x a="NORMAL" b="v1.45C" c="1" d="26.25148N" e="79.86157E" f="0.06" g="2015-01-29 00:00:09" h="2015-01-29 00:00:08" i="2" j="5" k="3" l="5" m="6" n="6" o="3" p="5" q="0" r="12.88"/>

    Sample of Compacted XML data and to be store in data column: N;v1.45C;1;26.25148;79.86157;0.06;2015-01-29@00:00:09;2;5;3;5;6;6;3;5;0;12.88

Note the changes:

  • No opening and closing tags of x ('<x' nd '/>')
  • No quotes (")
  • No alphabet like a, b, c. to distinguish the element. Each element are separated by semicolon. The order must be maintained by Java listener and php
  • No device time in the data, only server time. This is becuse we already put as column key
  • Message type is Change from 'NORMAL' to just 'N'.
  • Letter 'N' and 'E' in lat and lon are removed. it is better just to write the coordinate.

By applying this trategy the size I read on cassandra is 215 Bytes. this is the data and also the overhead. I calculated the overhead is around 50 Bytes. So Roughly the data size is only 165 Bytes.

I believe there are still several steps to push the size downwards even smaller like for example:

  • Change the version inside data '1.45C' into shorter version like for example 1. Later we put reference table elsewhere to translate what is 1 means as a version in the xml data
  • Change the server time in the data into the time difference (in seconds) between the device time and server time.

By doing the above additional steps, we can push the data size down to 144 Bytes and total size (including overhead) to 194 Bytes.

However, this approach will need the java listener and PHP to change the way they store and parse data from and to Cassandra.

Clone this wiki locally