Archive for db
Slides from my talk at CloudTech III in early October 2012:
As more startups use Amazon Web Services, the following scenario becomes increasingly frequent – the startup is acquired but required by the parent company to move away from AWS and into their own data centers. Given the all encompassing nature of AWS, this is not a trivial task and requires careful planning at both the application and systems level. In this presentation, I recount my experiences at Delve, a video publishing SaaS platform, with our post acquisition migration to Limelight Networks, a global CDN, during a period of tremendous growth in traffic. In particular, I share some of the tips/techniques we employed during this process to reduce AWS dependence and evolve to a hybrid private/AWS global architecture that allowed us to compete effectively with other digital video leaders.
Last month, Lenz Grimmer, Community Manager at MySQL (now Oracle, I suppose), gave an overview of a number of MySQL HA techniques at the SF MySQL Meetup group. My notes from that talk:
MySQL not trying to be an Oracle replacement, rather the goal is to make it better for its specific needs and requirements.
HA Concepts and Considerations:
- something can and will fail
- can’t afford downtime eg. maintenance
- adding HA to an existing system is complex (Note: from my experience, this is definitely the case with MySQL!)
- a heartbeat checker is definitely necessary to check
- whether services still alive?
- components: individual servers, services, network etc
- HA monitoring
- have to be able to add and remove services
- have to allow shutdown/startup operations on services
- have to allow manual control if necessary
- Shared storage/replication
One of the possible failure scenarios for a distributed system is the Split Brain syndrome whereby communication failures can lead to cluster partitioning. Further problems can ensue when each partition tries to take control of the entire cluster. Have to use approaches such as fencing or moderation/arbitration to avoid.
Some notes on MySQL replication:
- replicate statements that have changed. This is a statement or row based approach.
- can be asynchronous so slaves can lag
- new in mysql 5.5 – semi sync replication
- not fully synchronous but replication is included in the transaction i.e. transcation will not proceed until master receives ok from at least one slave
- master maintains binary log and index
- replication on slave is single threaded i.e. parallel transactions are serialized
- there is no automated fail-over
- a new feature in 5.5 – replication heartbeat
The master master configuration is not suitable for write load balancing. Don’t write to both masters at the same time, use sharding/partitioning instead eg. auto increments is a PITA (audience query)
Disk replication is another HA technique. This is not mysql level replication. Instead, files are replicated to another disk at the disk level via block level replication. DRBD (Disk Replacement Block Device) is one such technology. Some features:
- raid-1 over network
- synchronous/async block replication
- automatic resync
- application agnostic since operating at the disk level
- can mask local I/O issues
By default, DRBD operates on an active-passive configuration such that block device on 2nd system isn’t accessible. Now, DRBD has changed to allow writes on the passive device as well but it only really works if using clustered file system underneath like GFS or OCFS2. However, it remains a dangerous practice.
When using DRBD with MySQL:
- really bad with MyISAM tables since the replication occurs at the block level. Failover could lead to an inconsistent file system, so integrity check to repair would be required. Hence, can only use journaled file system with DRBD. Also, Innodb more easily repaired than MyISAM.
- MySQL server runs only on primary drbd node, not on secondary.
Instead of replication, another possibility is to use a storage area network to secure your data. However, in that case, the SAN can become a single point of failure. In addition, following a switchover, new MySQL instances can have cold caches – since they have had not had time to warm up yet.
MySQL Cluster technology, on the other hand, is not good with cross table JOINs. In addition, owing to their architecture, they may not be suitable for all applications.
A number of companies were mentioned in the talk that are active in the space. One such example is Galera, a Norwegian company which provides their own take on MySQL replication. Essentially, they have produced a patch for Innodb as well as an external library. This allows single or multi master setups as well as multicast based replication.
From Ed Dumbill at O’Reilly Radar comes some nice thoughts on key data trends for 2011. First, the emergence of a data marketplace:
Marketplaces mean two things. Firstly, it’s easier than ever to find data to power applications, which will enable new projects and startups and raise the level of expectation—for instance, integration with social data sources will become the norm, not a novelty. Secondly, it will become simpler and more economic to monetize data, especially in specialist domains.
The knock-on effect of this commoditization of data will be that good quality unique databases will be of increasing value, and be an important competitive advantage. There will also be key roles to play for trusted middlemen: if competitors can safely share data with each other they can all gain an improved view of their customers and opportunities.
There’s a number of companies emerging that crawl the general web, Facebook and Twitter to extract raw data, process/cross-reference that data and sell access. The article mentions InfoChimp and Gnip. Other practitioners include BackType, Klout, RapLeaf etc. Their success indicates a growing hunger for this type of information. I definitely seeing this need where I am currently. Limelight, by virtue of its massive CDN infrastructure and customers such as Netflix, collects massive amounts of user data. Such data could greatly increase in value when cross referenced against other databases which provide additional dimensions such as demographic information. This is something that might best be obtained from some sort of third party exchange.
Another trend that seems familiar is the rise of real time analytics:
This year’s big data poster child, Hadoop, has limitations when it comes to responding in real-time to changing inputs. Despite efforts by companies such as Facebook to pare Hadoop’s MapReduce processing time down to 30 seconds after user input, this still remains too slow for many purposes.
It’s important to note that MapReduce hasn’t gone away, but systems are now becoming hybrid, with both an instant element in addition to the MapReduce layer.
The drive to real-time, especially in analytics and advertising, will continue to expand the demand for NoSQL databases. Expect growth to continue for Cassandra and MongoDB. In the Hadoop world, HBase will be ever more important as it can facilitate a hybrid approach to real-time and batch MapReduce processing.
Having built Delve’s (near) real time analytics last year, I am familiar with the pain points of leveraging hadoop to fit into this kind of role. In addition NoSQL based solutions, I’d note that other approaches are emerging:
It’s interesting to see how a new breed of companies have evolved from treating their actual code as a valuable asset to giving away their code and tools and treating their data (and the models they extract from that data) as major assets instead. With that in mind, I would add a third trend to this list: the rise of cloud based data processing. Many of the startups in the data space use Amazon’s cloud infrastructure for storage and processing. Amazon’s ElasticMapReduce, which I’ve written about before, is a very well put together and stable system that obviates the need to maintain a continuously running Hadoop cluster. Obviously, not all applications fit this criteria but if it does, it can be very cost effective.
A key advantage associated with cloud computing is that of scalability. Theoretically, it should be easy to provision new machines or decommission older ones from an existing application and scale thus. In reality, things are not so simple. The application has to be suitably structured from ground up in order to best leverage this feature. Merely adding more CPUs or storage will not deliver linear performance improvements unless the application was explicitly designed with that goal. Most legacy systems are not and consequently, as traffic and usage grows, must be continually monitored and patched to keep performing at an acceptable level. This is not optimal. Consequently, extracting maximum utility from the cloud requires applications follow a set of architectural guidelines. Some thoughts on what those should be:
Stateless, immutable components
An important guideline for linear scalabilty is to have relatively lightweight, independent stateless processes which can execute anywhere and run on newly deployed resource (threads/nodes/cpus) as appropriate in order to serve an increasing number of requests. These services share nothing with others, merely processing asynchronous messages. At Delve, we make extensive use of this technique for multimedia operations such as thumbnail extraction, transcoding and transcription that fit well into this paradigm. Scaling for these services involves spinning up, automatically configuring and deploying additional dedicated instances which can be put to work immediately and subsequently taken down once they are no longer needed. Without planning for this type of scenario, however, it is difficult for legacy applications to leverage this type of functionality.
Reduced reliance on relational databases
Relational databases are primarily designed for managing updates and transactions on a single instance. They scale well, but usually on a single node. When the capacity of that single node is reached, it is necessary to scale out and distribute the load across multiple machines. While there are best practices such as clustering, replication and sharding to allow this type of functionality, they have to be incorporated into the system design from the beginning for the application to benefit. Moving into the cloud does not get rid of this problem.
Furthermore, even if these techniques are utilized by the application, their complexity makes it very difficult to scale to hundreds or thousands of nodes, drastically reducing their viability for large distributed systems. Legacy applications are more likely to be reliant on relational databases and moving the actual database system to the cloud does not eliminate any of these issues.
Alternatively, applications designed for the cloud have the opportunity to leverage a number of cloud based storage systems to reduce their dependence on RDBMS systems. For example, we use Amazon’s SimpleDB as core for a persistent key/value store instead of MySQL. Our use case does not require relational database features such as joining multiple tables. However, scalability is essential and SimpleDB provides a quick and uncomplicated way for us to implement this feature. Similarly, we use Amazon’s Simple Storage Service (S3) to store Write Once Read Many data such as very large video file backups and our analytics reports. Both of these requirements, were we to use MySQL like many legacy applications providing similar functionality, would require a heavy initial outlay of nodes and management infrastructure. By using SimpleDB and S3, we are able to provide functionality comparable to or better than legacy systems at lower cost.
There are caveats, however, with using nosql type systems. They have their own constraints and using them effectively requires understanding those limitations. For example, S3 works under a version of the eventual consistency model which does not provide the same guarantees as a standard file system. Treating it as such would lead to problems. Similarly, SimpleDB provides limited db functionality – treating it as a mysql equivalent would be a mistake.
Integration with other cloud based applications
A related advantage of designing for the cloud is the ability to leverage systems offered by the cloud computing provider. In our case, we extensively use Amazon’s Elastic Map Reduce (EMR) service for our analytics. EMR, like Amazon’s other cloud offerings, is a pay as you go system. It is also tightly coupled with the rest of Amazon’s cloud infrastructure such that transferring data within Amazon is free. At periodic intervals, our system spins up a number of nodes within EMR, transfers data from S3, performs computations, saves the results and tears down the instances. The functionality we thus achieve is similar to constantly maintaining a large dedicated map-reduce cluster such as that would be required by a legacy application but at a fraction of the cost.
Deploying an application to the cloud demands special preparation. Cloud machines are typically commodity hardware – preparing an environment able to run the different types of services required by an application is time consuming. In addition, deployment is not a one time operation. The service may need additional capacity to be added later, fast. Consequently, it is important to be able to quickly commission, customize and deploy a new set of boxes as necessary. Existing tools do not provide the functionality required. As cloud computing is relatively new, tools to deploy and administer in the cloud are similarly nascent and must be developed in addition to the actual application. Furthermore, developing, using and maintaining such tools requires skills typically not found in the average sysadmin. The combination of tools and personnel required to develop and run them poses yet another hurdle for moving existing applications to the cloud. For new applications, these considerations must be part of any resource planning.
Typically, cloud infrastructure providers do not guarantee uptime for a node. This implies a box can go down at any time. Additionally, providers such as Amazon will provide a certain percentage uptime confidence level for data centers. While in reality nodes are usually stable, an application designed for the cloud has to have redundancy built in such that a) backups are running and b) they are running in separate data centers. These backup systems also must meet other application requirements such as scalability. Their data must also be in sync with that of the primary stores. Deploying and coordinating such a system, imposes additional overhead in terms of design, implementation, deployment and maintenance, particularly relational databases are involved. Consequently, applications designed from the grounds up with these constraints in mind are much more likely to have an easier transition to the cloud.
MongoDB, Tokyo Cabinet, Project Voldemort … Systems that provide distributed, persistent key value store functionality are proliferating like kudzu. Sometimes it seems that not a day goes by without my hearing about a new one. Case in point: just right now, while browsing, I came across Riak, a “decentralized key-value store, a flexible map/reduce engine, and a friendly HTTP/JSON query interface to provide a database ideally suited for Web applications.”
I understand the motivation behind the NoSQL movement: one of them has to be backlash at the problems associated with MySQL. It’s one of those beasts that is very easy to get started on but, if you don’t start with the right design decisions and growth plan, can be hellacious to scale and maintain. Sadly, this is something I’ve seen happen at places repeatedly throughout my tenure at various companies. It happens all too often. Small wonder then that developers and companies have identified one of the most frequent use cases with modern web applications and MySQL – that of needing to quickly look up key value pairs reliably – and have built or are building numerous pieces of software optimized for doing precisely that at very high performance levels.
The trouble is if you need one yourself. Which one to pick? There are some nice surveys out there (here’s one from highscalability with many good links) but most are in various stages of development, usually with version numbers less than 1 and qualifiers like “alpha” or “beta” appended. Some try to assuage your fears:
Version 0.1? Is it ready for my production data?
That kind of decision depends on many factors, most of which cannot be answered in general but depend on your business. We gave it a low version number befitting a first public appearance, but Riak has robustly served the needs of multiple revenue-generating applications for nearly two years now.
In other words, “we’ve had good experiences with it but caveat emptor. You get what you pay for.”
This is why I really enjoyed the following entry in BJ Clark’s recent survey:
type: key/value store
Conclusion: Scales amazingly well
You’re probably all like “What?!?”. But guess what, S3 is a killer key/value store. It is not as performant as any of the other options, but it scales *insanely* well. It scales so well, you don’t do anything. You just keep sticking shit in it, and it keeps pumping it out. Sometimes it’s faster than other times, but most of the time it’s fast enough. In fact, it’s faster than hitting mysql with 10 queries (for us). S3 is my favorite k/v data store of any out there.
I couldn’t agree more. Recently, I finished a major project at Delve (and I hope to write about more of this later) where one of our goals was to have all our reports we computed for our customers to be available indefinitely. Our current system stores all the reports in, you guessed it, MySQL. The trouble is this eats up MySQL resources and since we don’t do any queries on these reports, we, in essence, are simply using MySQL as a repository. By moving our reporting storage to S3 (and setting up a simple indexing scheme to list and store lookup keys), we have greatly increased the capacity for our current MySQL installation and are now able to keep and lookup reports for customers indefinitely. We are reliant on S3 lookup times and availability – but, for this use case, the former is not as big an issue and having Amazon take care of the latter frees us to worry about other pressing problems of which are fairly plentiful at a startup!