Single or multi-game database?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Single or multi-game database?

grampajohn
Administrator

Carsten and I have had a disagreement on this topic from the beginning, and I think it's time to raise the issue to the rest of the group and make a decision. The question is whether the server's runtime database needs to hold data from multiple games. Here are the issues as I see them; I invite Carsten and others to give this issue some thought and post comments. I want a final decision before next Monday 2/14.

  • I assume data from a single game can fit in a fast in-memory database. That's not true, in general, for multi-game datasets.
  • The simulation timeline is agressive, and server latency could become a major hurdle if we are not careful. We are already using a relatively high-overhead framework and a scripting language that is 1-2 orders of magnitude slower than Java. Interaction with an external database will only add to that latency. Of course, any decent database has local cache, but that does not help much with transaction commits. There will be large numbers of commits during simulator operation, especially if we use the auditing plugin to track state changes.
  • Supporting a multi-game database requires that every data element contain a reference to the current competition, and therefore virtually all queries will have to include competition ID as a criterion, except for single-instance retrievals that use uuid values as keys.
  • We will need to do a database dump at the end of every game anyway, in order to make data available to users. It makes no sense for external users to be retrieving game data by querying the active database while a game is running.
  • Anyone who wants to build their own multi-game database can do that using any sort of data transformation they like, which might indeed include adding a competition id to the records, as well as adding indices and possibly denormalizing to enhance query performance. It is not our intent to support database queries for researchers other than for retrieval of summary statistics and access to game data dumps, and that will come from a different database.

As a result, I propose we decide that the grails database used by the simulation be scoped to the current game only. This means that the competition id does not need to be part of every record in the database, queries are simplified, and a fair amount of redundant data is eliminated (I assume there will be on the order of 10^5 records in a single competition database). This means that the database will always be empty at the start of a game, and that the database needs to be dumped and cleared at the end of each game.

There is, of course, a second database, which is durable. That is the one that holds agent registrations and competition schedules, as well as game summaries and the locations of database dumps from past games.

Please add your thoughts to this thread. It's possible I'm missing something fundamental, of course.

John

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Single or multi-game database?

chris.flath
I find John's line of argument pretty compelling. A competition-to-competition analysis would anyhow require some preprocessing and this could very well be used to merge the two databases.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Single or multi-game database?

Carsten Block
Administrator
In reply to this post by grampajohn
Hi!

Thanks John for the nice summary. I add my points in the text below.
  • I assume data from a single game can fit in a fast in-memory database. That's not true, in general, for multi-game datasets.
What is a "fast in-memory database" for you? In our current setting that's probably H2 / HSQLDB, right? Both feature  asynchronous / buffered write to disk. Also, grails comes with hibernate and its internal query optimization as well as ehcache (simply speaking a hash-map based in-memory read write cache for database queries). Both are readily configured to work nicely together in order to automatically (and transparently) take care of buffering / efficiently executing queries (read and write!) and we use this for example to efficiently pull the current competition instance from the cache (and not from the database). 

  • The simulation timeline is agressive, and server latency could become a major hurdle if we are not careful. 
I don't particularly like this line of argumentation because the claim (aggressive timeline, server latency, major hurdle) is both frightening and unproven on the one hand and it leads to potentially lots of additional work and disruptive design changes on the other before we ever really ran into any performance issue. In general I agree that we have to keep performance in mind but my claim is that we have a lot of optimization levers built in to the framework (including the mentioned query caches, the possibility to refactor components into pure Java, lazy and eager fetching modes, but also classical index optimization for the database). We should exploit all this first (and only if we actually experience performance issues in a running system). My experience from previous projects (such as eix-market.de that still runs app-sever and database on a single machine) is that the latter ("non-intrusive") optimization potential helps us a long way without changing business or domain logic at all. 

In other words I propose: 
(i) Design PowerTAC according to our "business" needs
(ii) Get it running as soon as possible, stress test it, determine its real hard limits
----
If (and only if) results from (ii) are not acceptable performance wise:
---
(iii) profile the system to see where it *really* is slow, i.e. what the bottlenecks are
(iv) optimize this part (e.g. by switching to java for a time critical component or by optimizing the queries used to retrieve data)
(v) buy faster hardware or cluster the system (e.g. db and app server on distinct machines - at least temporarily and for tournament setup)
(v) only if all previous optimization fails, change business logic and fundamental competition design     
  • We are already using a relatively high-overhead framework and a scripting language that is 1-2 orders of magnitude slower than Java.
This is true and false. 
(i) if you find that a particular component written in groovy is too slow you can simply replace it with a more efficient Java equivalent (e.g. a service class can be written under src/java and only needs to be marked with @Service annotation in order to be usable as its groovy counterpart before). But remember: That's only step (iv) in the above process.
(ii) The "high-overhead framework" comes with some nice and useful overhead such as query caching and optimistic locking for database transactions that could be readily exploited for performance tweaking.
(iii) The type of "slowness" I most often found in the code of previous projects was related to inefficient queries and missing database indexes. These can be quite easily found during code reviews and with slow query analyzers. 
  • Supporting a multi-game database requires that every data element contain a reference to the current competition, and therefore virtually all queries will have to include competition ID as a criterion, except for single-instance retrievals that use uuid values as keys.
What's the problem? We just have to set the right db indexes that include competition and there's no loss in speed at all. In my last commit I changed all domain classes to be automatically populated with the current competition id upon construction so e.g. a plugin developer never has to manually do this. For queries we could define named queries (see http://grails.org/doc/latest/ref/Domain Classes/namedQueries.html) that make the addition of competition reference transparent to developers though I don't think it's a big problem.
  • We will need to do a database dump at the end of every game anyway, in order to make data available to users. It makes no sense for external users to be retrieving game data by querying the active database while a game is running.
I suggest we simply allow the operators of powertac server instances to flexibly switch between two operating modes: 

(1) Webapp-connected mode (the powertac server periodically "calls home" to the central webapp at powertac.org): 
  (i)   The local powertac server starts up, connects to our central webapp e.g. via a REST call and asks for a competition to be executed
  (ii)  The webapp responds with a competition (basically the competition domain class generated by a competition participant through webapp's web interface that is serialized as XML)
  (iii) The local powertac server executes the competition, dumps the database into a zipped file, sends this back to the web-app and periodically keeps asking for new competitions to executed until stopped locally (the user might do that any time he wants) or provided with another competition instance to execute

(2) Standalone mode (no calling of the central webapp):
  (i)   The local powertac server and all required agents are locally started and nothing happens
  (ii)  The owner of the local server uses a minimal web interface (just scaffolded views for the competition domain class or maybe a command line parameter and a local xml file during server startup) to define one or more competitions that the server should execute.
  (iii) The local powertac server runs the competitions and stores all data in his database. 
  (iv) The server and the agents are stopped and the owner of the local server uses an off-the-shelf sql tool such as dbvisualizer to query the competition database and to execute his game analysis
  • Anyone who wants to build their own multi-game database can do that using any sort of data transformation they like, which might indeed include adding a competition id to the records, as well as adding indices and possibly denormalizing to enhance query performance. It is not our intent to support database queries for researchers other than for retrieval of summary statistics and access to game data dumps, and that will come from a different database.
There are three more important things which should also be considered before making a final decision:

1.) Random seed: 
The powertac-random plugin also uses a database table to store random seed values. This table at least has to be preserved then and there has to be some competition reference somewhere in order to determine which competition's seed to use  

2.) Multi-Competition winner determination: 
To my understanding we never finalized the discussion if and how we want to jump in competition time in order to expose participants to different seasons of a year. I really doubt that there's any simple solution to do these time jumps within a single competition. One alternative could be to establish a competition rule stating that a "PowerTAC match" consists of a set of four competitions (with each single competition happening in a different simulated period of time). The overall winner determined across all four competitions wins the match and reaches the next round of the tournament. In this situation dumping and resetting the database at the end of each single competition makes it impossible (or at least quite hard - almost nothing is impossible, right?) to determine the winner within the powertac server instance itself as it does not have information on how won before.      

3.) Central competition visualizer:
Adis and Veran are working on a central competition visualizer, which should be able to visualize multiple competitions. At this end we will need the competition reference in the domain classes anyway in order to distinguish the different instances. The current solution here is to serialize the domain instances in the server as xml, send them to the visualizer and deserialize them again over there. Not having the competition reference in the domain instances would require additional work at this end (some form of on the fly refactoring .

As a result, I propose we decide that the grails database used by the simulation be scoped to the current game only. This means that the competition id does not need to be part of every record in the database, queries are simplified, and a fair amount of redundant data is eliminated (I assume there will be on the order of 10^5 records in a single competition database). This means that the database will always be empty at the start of a game, and that the database needs to be dumped and cleared at the end of each game.

My conclusion is to not throw away the competition reference in the domain classes. The overhead from this feature is modest but provides us with the freedom to also run the server in continuous competition mode (with dump/reset as alternative mode always also being possible) and eases (i) random-seed reuse across competitions, (ii) multi-competitio winner determination, (iii) simple multi-competition visualization . Concerning the performance issues I refer to our solid system setup that allows us to do a lot of performance optimization under the hood before we have to start changing business logic. 

Ultimately, I strongly suggest not to spend time on these types of "performance triggered" refactorings unless we have *hard evidence* from stress testing a running system instance that shows us (i) that, (ii) where, and (iii) how much we have to optimize performance. Having a first version of the server running is way more important right now (and in this light, removing competition references from the domain classes and all related business logic is additional work that can be avoided right now!). Once we have the first server out there's time to do performance tweaking. 

Cheers,
Carsten
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Single or multi-game database?

asymeon
Hi everybody!

I would not argue in favor of each one of the approaches, since they both have their pros and cons.
From my point of view, though, I would prefer the multi-game database only if it provides more functionality and flexibility.
In any other case, I would go for the single-game database.

Best regards,

Andreas
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Single or multi-game database?

grampajohn
Administrator
In reply to this post by Carsten Block
Carsten Block wrote
Thanks John for the nice summary. I add my points in the text below.
And thank you, Carsten, for laying out your thoughts clearly and logically. Before I address your points individually, I'll start with the most compelling, in my opinion. The current structure makes Competition dependent on all the other persistent types -- here's the relevant line from Competition.groovy:
  static hasMany = [brokers: Broker, cashUpdates: CashUpdate, customers: CustomerInfo, meterReadings: MeterReading, orderbooks: Orderbook, positionUpdates: PositionUpdate, products: Product, shouts: Shout, tariffs: Tariff, timeslots: Timeslot, transactionLogs: TransactionLog]
The result is that Competition is involved in a circular dependency with each of the other persistent types, because they also must refer to Competition in order to do their queries. This unnecessary dependency makes it impossible to define types that are stored in the database in any plugin module besides the one that includes Competition.

If we are going to continue using the Grails plugin architecture effectively, then it seems it will be necessary to drop the competition field from database tables.

> The simulation timeline is agressive, and server latency could become a major hurdle if we are not careful.
I don't particularly like this line of argumentation because the claim (aggressive timeline, server latency, major hurdle) is both frightening and unproven on the one hand and it leads to potentially lots of additional work and disruptive design changes on the other before we ever really ran into any performance issue. ...
I also am a strong advocate of the "first make it work, then make it fast" school. But including a feature that neither adds value nor improves performance makes no sense.

In other words I propose:
(i) Design PowerTAC according to our "business" needs
I agree, but I have not yet been convinced that running the server on a multi-game database is a business need.

(ii) Get it running as soon as possible, stress test it, determine its real hard limits
----
If (and only if) results from (ii) are not acceptable performance wise:
---
(iii) profile the system to see where it *really* is slow, i.e. what the bottlenecks are
(iv) optimize this part (e.g. by switching to java for a time critical component or by optimizing the queries used to retrieve data)
Of course this is a reasonable process, but it can be extraordinarily time-consuming. Time is not something we have in surplus.

(v) buy faster hardware or cluster the system (e.g. db and app server on distinct machines - at least temporarily and for tournament setup)
A critical requirement is that the server be easily installable and usable by researchers on simple desktop machines. This means that throwing hardware at performance problems is not even a good short-term solution.

(v) only if all previous optimization fails, change business logic and fundamental competition design    
Yes, but a multi-game database is not a part of the fundamental competition design.

> We are already using a relatively high-overhead framework and a scripting language that is 1-2 orders of magnitude slower than Java.

This is true and false.
(i) if you find that a particular component written in groovy is too slow you can simply replace it with a more efficient Java equivalent (e.g. a service class can be written under src/java and only needs to be marked with @Service annotation in order to be usable as its groovy counterpart before). But remember: That's only step (iv) in the above process.
Of course, but as I said this is a lot of work. Better if we take steps now, such as removing the need for database indices, that will avoid future problems.

(ii) The "high-overhead framework" comes with some nice and useful overhead such as query caching and optimistic locking for database transactions that could be readily exploited for performance tweaking.
All this just makes the relational database nearly as fast as a solution that does not use a database at all. The simulation server does not require a relational database. We are using it because it's convenient, and because the KIT group had already started down this path. The SCM server uses its database only for agent registrations and game-summary data, not for in-game data.

(iii) The type of "slowness" I most often found in the code of previous projects was related to inefficient queries and missing database indexes. These can be quite easily found during code reviews and with slow query analyzers.
Again, these are consequences of mapping non-relational data to a relational model. They do not arise from the requirements of this project. It is indeed possible to speed up query processing by adding indices, but the tradeoff is that this slows down inserts. The simulator is an insert-intensive process, and therefore should use as few indices as possible. One way to reduce the number of indices is to omit a data element from every table that has no role at all in the simulation - it's only there to allow cross-game analysis. That sort of analysis should not be done in the simulation environment anyway.

> Supporting a multi-game database requires that every data element contain a reference to the current competition, and therefore virtually all queries will have to include competition ID as a criterion, except for single-instance retrievals that use uuid values as keys.

What's the problem? We just have to set the right db indexes that include competition and there's no loss in speed at all. In my last commit I changed all domain classes to be automatically populated with the current competition id upon construction so e.g. a plugin developer never has to manually do this. For queries we could define named queries (see http://grails.org/doc/latest/ref/Domain Classes/namedQueries.html) that make the addition of competition reference transparent to developers though I don't think it's a big problem.
The problem is additional indices and additional complexity that adds no value.

> We will need to do a database dump at the end of every game anyway, in order to make data available to users. It makes no sense for external users to be retrieving game data by querying the active database while a game is running.

I suggest we simply allow the operators of powertac server instances to flexibly switch between two operating modes:

(1) Webapp-connected mode (the powertac server periodically "calls home" to the central webapp at powertac.org):
  (i)   The local powertac server starts up, connects to our central webapp e.g. via a REST call and asks for a competition to be executed
  (ii)  The webapp responds with a competition (basically the competition domain class generated by a competition participant through webapp's web interface that is serialized as XML)
  (iii) The local powertac server executes the competition, dumps the database into a zipped file, sends this back to the web-app and periodically keeps asking for new competitions to executed until stopped locally (the user might do that any time he wants) or provided with another competition instance to execute

(2) Standalone mode (no calling of the central webapp):
  (i)   The local powertac server and all required agents are locally started and nothing happens
  (ii)  The owner of the local server uses a minimal web interface (just scaffolded views for the competition domain class or maybe a command line parameter and a local xml file during server startup) to define one or more competitions that the server should execute.
  (iii) The local powertac server runs the competitions and stores all data in his database.
  (iv) The server and the agents are stopped and the owner of the local server uses an off-the-shelf sql tool such as dbvisualizer to query the competition database and to execute his game analysis
I agree with #1 here, but in #2 you are putting too much responsibility on the server, in my opinion. We run SCM experiments frequently that involve more than 100 simulations. We can run as many as 3 simulations in parallel, using whatever machines are available from a large collection of lab machines and a few spare VMs. Games are started whenever there are enough free machines (it takes 7 machines for an SCM simulation) to run one. We have a simple php web-app that lets you set up your experiments and retrieve the results, and a large collection of analysis tools to extract data from the logfiles and do statistical analysis. Some of those require dumping the logs into a database and running queries, but most do not, and they run much faster than the identical analysis done by running a database query and analyzing the results.

> Anyone who wants to build their own multi-game database can do that using any sort of data transformation they like, which might indeed include adding a competition id to the records, as well as adding indices and possibly denormalizing to enhance query performance. It is not our intent to support database queries for researchers other than for retrieval of summary statistics and access to game data dumps, and that will come from a different database.

There are three more important things which should also be considered before making a final decision:

1.) Random seed: ...

2.) Multi-Competition winner determination: ...

3.) Central competition visualizer: ...
I agree with these requirements, but I do not see that any of them will require the competition to be a part of every database table.

...
Ultimately, I strongly suggest not to spend time on these types of "performance triggered" refactorings unless we have *hard evidence* from stress testing a running system instance that shows us (i) that, (ii) where, and (iii) how much we have to optimize performance. Having a first version of the server running is way more important right now (and in this light, removing competition references from the domain classes and all related business logic is additional work that can be avoided right now!). Once we have the first server out there's time to do performance tweaking.
It's a modest amount of extra work either way. Sure, it's a bit of work to go through the existing classes and remove the competition reference. But it's also extra work to include it in new types, because it complicates testing, and because the competition object ends up depending on all the other types, as I pointed out at the beginning.

I think it's time for me to make an executive decision that the competition will not be a required element in our database tables.

Thanks for your thoughtful input.

John
Loading...