This has been posted (by me ofc) on the Java Advent Calendar originally. Check it out for more interesting articles:
http://www.javaadvent.com/2013/12/big-data-reactive-way.html
A metatrend going on in the IT industry is a shift from query-based, batch oriented systems to (soft) realtime updated systems. While this is associated with financial trading only, there are many other examples such as "Just-In-Time"-logistic systems, flight companies doing realtime pricing of passenger seats based on demand and load, C2C auction system like EBay, real time traffic control and many more.
It is likely this trend will continue, as the (commercial) value of information is time dependent, value decreases with age of information.
Automated trading in the finance sector is just a forerunner in this area, because some microseconds time advantage can be worth millions of dollars. Its natural real time processing systems evolve in this domain faster.
However big parts of traditional IT infrastructure is not designed for reactive, event based systems. From query based databases to request-response based Http protcol, the common paradigm is to store and query data "when needed".
Current approaches to data management such as SQL and NOSQL databases focus on data transactions and static query of data. Databases provide convenience in slicing and dicing data but they do not support update of complex queries in real time. Uprising NOSQL databases still focus on computing a static result.
Databases are clearly not "reactive".
Current messaging products are weak at filtering. Messages are separated into different streams (or topics), so clients can do a raw preselection on the data received. However this frequently means a client application receives like 10 times more data than needed, doing fine grained filtering 'on-top'.
A big disadvantage is, that the topic approach builts filter capabilities "into" the system's data design.
E.g. if a stock exchange system splits streams on a per-stock base, a client application still needs to subscribe to all streams in order to provide a dynamically updated list of "most active" stocks. Querying usually means "replay+search the complete message history".
I had the enjoyment to do conceptional & technical design for a large scale realtime system, so I'd like to share a generic scalable solution for continuous query processing at high volume and large scale.
It is common, that real-time processing systems are designed "event sourced". This means, persistence is replaced by journaling transactions. System state is kept in memory, the transaction journal is required for historic analysis and crash recovery only.
Client applications do not query, but listen to event streams instead. A common issue with event sourced systems is the problem of "late joining client". A late client would have to replay the whole system event journal in order to get an up-to-date snapshot of the system state.
In order to support late joining clients, a kind of "Last Value Cache" (LVC) component is required. The LVC holds current system state and allows late joiners to bootstrap by querying.
In a high performance, large data system, the LVC component becomes a bottleneck as the number of clients rises.
In a continuous query data cache, a query result is kept up to date automatically. Queries are replaced by subscriptions.
Difference of data access patterns compared to static data management:
Data Cluster Nodes ("LastValueCache Nodes")
Data is organized in tables, column oriented. Each table's data is evenly partitioned amongst all data grid nodes (=last value cache node="LVC node"). By adding data nodes to the cluster, capacity is increased and snapshot queries (initializing a subscription) are sped up by increased concurrency.
There are three basic transactions/messages processed by the data grid nodes:
The data grid nodes provide a lambda-alike (row iterator) interface supporting the iteration of a table's rows using plain java code. This can be used to perform map-reduce jobs and as a specialization, the initial query required by newly subscribing clients. Since ongoing computation of continuous queries is done in the "Gateway" nodes, the load of data nodes and the number of clients correlate weakly only.
All transactions processed by a data grid node are (re-)broadcasted using multicast "Change Notification" messages.
Gateway nodes track subscriptions/connections to client applications. They listen to the global stream of change notifications and check whether a change influences the result of a continuous query (=subscription). This is very CPU intensive.
Two things make this work:
Building real time processing software backed by a continuous query system simplifies application development a lot.
This post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!
It is likely this trend will continue, as the (commercial) value of information is time dependent, value decreases with age of information.
Automated trading in the finance sector is just a forerunner in this area, because some microseconds time advantage can be worth millions of dollars. Its natural real time processing systems evolve in this domain faster.
However big parts of traditional IT infrastructure is not designed for reactive, event based systems. From query based databases to request-response based Http protcol, the common paradigm is to store and query data "when needed".
Current Databases are static and query-oriented
Databases are clearly not "reactive".
Current Messaging Products provide poor query/filtering options
A big disadvantage is, that the topic approach builts filter capabilities "into" the system's data design.
E.g. if a stock exchange system splits streams on a per-stock base, a client application still needs to subscribe to all streams in order to provide a dynamically updated list of "most active" stocks. Querying usually means "replay+search the complete message history".
A scalable, "continuous query" distributed Datagrid.
It is common, that real-time processing systems are designed "event sourced". This means, persistence is replaced by journaling transactions. System state is kept in memory, the transaction journal is required for historic analysis and crash recovery only.
Client applications do not query, but listen to event streams instead. A common issue with event sourced systems is the problem of "late joining client". A late client would have to replay the whole system event journal in order to get an up-to-date snapshot of the system state.
In order to support late joining clients, a kind of "Last Value Cache" (LVC) component is required. The LVC holds current system state and allows late joiners to bootstrap by querying.
In a high performance, large data system, the LVC component becomes a bottleneck as the number of clients rises.
Generalizing the Last Value Cache: Continuous Queries
In a continuous query data cache, a query result is kept up to date automatically. Queries are replaced by subscriptions.
subscribe * from Orders wherecreates a message stream, which initially performs a query operation, after that updates the result set whenever a data change affecting the query result happened (transparent to the client application). The system ensures each subscriber receives exactly the change notifications necessary to keep its "live" query results up-to-date.
symbol in ['ALV', 'BMW'] and
volume > 1000 and
owner='MyCompany'
Difference of data access patterns compared to static data management:
- High write volume
Real time systems create a high volume of write access/change in data. - Fewer full table scans.
Only late-joining clients or changes of a query's condition require a full data scan. Because continuous queries make "refreshing" a query result obsolete, Read/Write ratio is ~ 1:1 (if one counts the change notification resulting from a transaction as "Read Access"). - The majority of load is generated, when evaluating queries of active continuous subscriptions with each change of data. Consider a transaction load of 100.000 changes per second with 10.000 active continuous queries: this requires 100.000*10.000 = 1 Billion evaluations of query conditions per second. That's still an underestimation: When a record gets updated, it must be tested whether the record has matched a query condition before the update and whether it matches after the update. A record's update may result in an add (because it matches after the change) or a remove transaction (because the record does not match anymore after a change) to a query subscription.
Data Cluster Nodes ("LastValueCache Nodes")
Data is organized in tables, column oriented. Each table's data is evenly partitioned amongst all data grid nodes (=last value cache node="LVC node"). By adding data nodes to the cluster, capacity is increased and snapshot queries (initializing a subscription) are sped up by increased concurrency.
There are three basic transactions/messages processed by the data grid nodes:
- AddRow(table,newRow),
- RemoveRow(table,rowId),
- UpdateRow(table, rowId, diff).
The data grid nodes provide a lambda-alike (row iterator) interface supporting the iteration of a table's rows using plain java code. This can be used to perform map-reduce jobs and as a specialization, the initial query required by newly subscribing clients. Since ongoing computation of continuous queries is done in the "Gateway" nodes, the load of data nodes and the number of clients correlate weakly only.
All transactions processed by a data grid node are (re-)broadcasted using multicast "Change Notification" messages.
Gateway Nodes
Gateway nodes track subscriptions/connections to client applications. They listen to the global stream of change notifications and check whether a change influences the result of a continuous query (=subscription). This is very CPU intensive.
Two things make this work:
- by using plain java to define a query, query conditions profit from JIT compilation, no need to parse and interpret a query language. HotSpot is one of the best optimizing JIT compilers on the planet.
- Since multicast is used for the stream of global changes, one can add additional Gateway nodes with ~no impact on throughput of the cluster.
Processor (or Mutator) Nodes
These nodes implement logic on-top of the cluster data. E.g. a statistics processor does a continuous query for each table, incrementally counts the number of rows of each table and writes the results back to a "statistics" table, so a monitoring client application can subscribe to realtime data of current table sizes. Another example would be a "Matcher processor" in a stock exchange, listening to orders for a stock, if orders match, it removes them and adds a Trade to the "trades" table.
If one sees the whole cluster as kind of a "giant spreadsheet", processors implement the formulas of this spreadsheet.
Scaling Out
- with data size:
increase number of LVC nodes - Number of Clients
increase subscription processor nodes. - TP/S
scale up processor nodes and LVC nodes
Conclusion
Building real time processing software backed by a continuous query system simplifies application development a lot.
- Its model-view-controller at large scale.
Astonishing: patterns used in GUI applications for decades have not been extended regulary to the backing data storage systems. - Any server side processing can be partitioned in a natural way. A processor node creates an in-memory mirror of its data partition using continuous queries. Processing results are streamed back to the data grid. Computing intensive jobs, e.g. risk computation of derivatives can be scaled by adding processor instances subscribing to distinct partitions of the data ("sharding").
- The size of the Code Base reduces significantly (both business logic and Front-End).
A lot of code in handcrafted systems deals with keeping data up to date.
About me
I am a technical architect/senior developer consultant at an european company involved heavily in stock & derivative trading systems.This post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!
Thanks for sharing this good blog.
ReplyDeleteJava Online Training
Great Article Artificial Intelligence Projects
DeleteProject Center in Chennai
JavaScript Training in Chennai
JavaScript Training in Chennai
IntelliMindz is the best IT Training in Chennai with placement, offering 200 and more software courses with 100%
DeletePlacement Assistance. Start learning with us intellimindz, and became an expert in sap mm training in Chennai.
contact 9655877577 for more details.
SAP MM training in Chennai
SAP SD training in chennai
SAP FICO training in chennai
SAP Ariba training in chennai
SAP ABAP training in chennai
Devops is not a Tool.Devops Is a Practice, Methodology, Culture or process used in an Organization or Company for fast collaboration, integration and communication between Development and Operational Teams. In order to increase, automate the speed of productivity and delivery with reliability.
ReplyDeletepython training in bangalore
aws training in bangalore
artificial intelligence training in bangalore
data science training in bangalore
machine learning training in bangalore
hadoop training in bangalore
devops training in bangalore
corporate training companies
ReplyDeletecorporate training companies in mumbai
corporate training companies in pune
corporate training companies in delhi
corporate training companies in chennai
corporate training companies in hyderabad
corporate training companies in bangalore
Gaining Python certifications will validate your skills and advance your career.
ReplyDeletepython certification
Good information and, keep sharing like this.
ReplyDeleteCrm Software Development Company in Chennai
Nice Presentation and its hopefull words..
ReplyDeleteif you want a cheap web hosting in web
crm software development company in chennai
erp software development company in chennai
Professional webdesigning company in chennai
best seo company in chennai
This comment has been removed by the author.
ReplyDeletethanks for ur valuable information,keep going touch with us
ReplyDeleteScaffolding dealers in chennai
Nice information keep sharing like this.
ReplyDeletescaffolding dealers in chennai
Aluminium scaffolding dealers in chennai
Aluminium scaffolding hire
Wow, Great information and this is very useful for us.
ReplyDeleteprofessional bridal makeup artist in chennai
best bridal makeup artist in chennai
مهدی احمدوند
ReplyDeleteراغب
مهدی جهانی
ایوان بند
It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
ReplyDeletehttp://rexapparels.com/boxer-manufacturers-in-tirupur-india/
http://rexapparels.com/track-pants-manufacturers-in-tirupur-india/
http://rexapparels.com/innerwear-manufacturers-in-tirupur-india/
http://rexapparels.com/buying-office-in-tirupur-india/
http://rexapparels.com/export-surplus-t-shirts-in-tirupur-india/
http://rexapparels.com/t-shirt-manufacturer-in-tirupur-india/
A Chatbot Development is a product program for reproducing wise discussions with human utilizing rules or man-made brainpower. Clients connect with the Chatbot development service by means of conversational interface through composed or spoken content. Chatbots can live in informing stages like Slack, Facebook Messenger bot developer and Telegram and fill some needs – requesting items, thinking about climate and dealing with your fund in addition to other things. As a Chatbot development company advancement organization our competency let you find happiness in the hereafter by taking care of clients all the more intelligently to accomplish wanted outcome. As a Chatbot companies we can streamline a large portion of your dreary undertakings, for example, voice bot advancement and client service, online business advices and so on.
ReplyDeletegreat java tips At SynergisticIT we offer the best java training san francisco
ReplyDeleteThanks, this is generally helpful.
ReplyDeleteStill, I followed step-by-step your method in this Java online training
Java online course
fmwhatsapp
ReplyDeletehd streamz
oreo tv
Besides my C course, I have a job and family, both of which compete to get my time. I couldn't find sufficient time for the challenging C assignments, and these people came in and saved my skin. I must commend them for the genius Programming Assignment Help. Their C Homework Help tutors did the best job and got me shining grades.
ReplyDeleteI am looking for a Statistics Assignment Help expert for Statistics Homework Help. I have struggled enough with statistics and therefore I just can't do it anymore on my own. . I have come across your post and I think you are the right person to provide me with SPSS homework help. Let me know how much you charge per assignment so that I can hire you today
ReplyDeleteMatlab Assignment Help helped me to complete my seventh Matlab assignment, which was also the best-performed! It scored 92/100, which I've never scored before on any other assignment/exam in my lifetime. Otherwise, their service was as quick as usual. The delivery was also on time. I'm now requesting to use this same programmer multiple times. He seems the best in Image Processing tasks. Meanwhile, I'll ask for more Matlab Homework Help soon.
ReplyDeleteI have just come across your post and I believe this is exactly what I am looking for. I want an economics assignment help from a tutor who can guarantee me a top grade. Do you charge per page or does it depend on the
ReplyDeletebulk of the economics homework help being completed? More to that if the work is not good enough do you offer free corrections.
The ardent Programming Homework Help tutor that nailed down my project was very passionate. He answered my Python questions with long, self-explanatory solutions that make it easy for any average student to revise. Moreover, he didn't hesitate to answer other questions, too, even though they weren't part of the exam. If all Python Homework Help experts can be like this then they can trend as the best Programming school ever online.
ReplyDeleteMe and my classmates took too long to understand Matlab Assignment Help pricing criteria. we're always grateful for unique solutions on their Matlab assignments. Matlab Homework Help experts have the right experience and qualifications to work on any programming student's homework. They help us in our project.
ReplyDeleteHi, other than economics assignment help are there other subjects that you cover? I am having several assignments one needs an economics homework help expert and the other one needs a financial expert. If you can guarantee quality work on both then I can hire you to complete them. All I am sure of is that I can hire you for the economics one but the finance one I am not sure.
ReplyDeleteHello. Please check the task I have just sent and reply as soon as possible. I want an adjustment assignment done within a period of one week. I have worked with an Accounting Homework Help tutor from your team and therefore I know it’s possible to complete it within that period. Let me know the cost so that I can settle it now as your Accounting Assignment Help experts work on it.
ReplyDeleteI don’t have time to look for another expert and therefore I am going to hire you with the hope that I will get qualityeconomics assignment help. .Being aneconomics homework help professor I expect that your solutions are first class. All I want to tell you is that if the solutions are not up to the mark I am going to cancel the project.
ReplyDeleteHey there, I need an Statistics Homework Help expert to help me understand the topic of piecewise regression. In our lectures, the concept seemed very hard, and I could not understand it completely. I need someone who can explain to me in a simpler way that I can understand the topic. he/she should explain to me which is the best model, the best data before the model and how to fit the model using SPSS. If you can deliver quality work then you would be my official Statistics Assignment Help partner.
ReplyDeleteQuickPay Portal | www.quickpayportal.com | Pay Your Medical Bill
ReplyDeletePay Your Medical Bill using www.quickpayportal.com and Make a quick and fast payment online using the QuickPay Portal. Just find your QuickPay Code on your billing statement and and you're ready to go
quickpayportal.com
quickpayportal.com
quickpayportal.com
This post is very simple to read and appreciate without leaving any details out. Great work !Call center Software Solution
ReplyDeleteQuick Pay Portal
ReplyDeleteThe team created the quickpayportal.com website to give you the option to pay your medical bills right away. Located in Massachusetts, the QuickPay Portal Bill Pay is set up by Athena physicians .
Visit:- QuickPay Portal Com
I’ve read some excellent stuff here. สล็อต joker
ReplyDeleteCanli
ReplyDeleteNice information.packers and movers in kondapur
ReplyDelete남양주콜걸
ReplyDelete의정부콜걸
제천콜걸
횡성콜걸
충주콜걸
부천콜걸
총판콜걸
제천콜걸
이천콜걸
영월콜걸
Thanks for sharing. This post is very simple to read and appreciate without leaving any details out. Go to crystal shops near me and you will find that the allure of healing crystals is difficult to ignore.
ReplyDeleteانشاء مسابح
ReplyDeleteتسليك مجاري