P.Linux Laboratory

自己动手实现Multi-Master Replication

2 月 14th, 2012 | Posted by P.Linux | Filed under 未分类

首发：http://www.mysqlops.com/2012/02/14/diy_multi_master_replication.html

直到今天为止，MySQL依然只支持一个Slave从一个Master复制数据，虽然也可以做到一主多备(M->S)，双主复制(M<->M)等架构，但是局限性依然很大。
例如最近我们遇到一个问题，需要为线上的集群搭建在线延时备份，即从线上的双主集群中再延伸出一组Slave，以防重要集群主备都宕机。按照现在MySQL的架构，要搭建这种在线备份，只能启动相同数据的实例来实现，假设线上有128个实例在提供服务，那么我就需要128个实例来做这128个实例的复制，这个管理成本是巨大的。
之前我们也有个方案，利用Perl脚本来做，参见这篇文章：点我阅读。这个方案的最大问题就是管理不方便，没有可以监控的地方，也不能随便停止脚本等等，如果完善这些部分，代码量太大，几乎就实现了一个MySQL Replication，那还不如利用MySQL的管理部分，在MySQL里实现多Master。

通过研究源码，可以发现，MySQL管理每个复制通道，都是通过一个Master_info类（sql/rpl_mi.h中定义），start_slave/change_master/stop_slave/show_slave/end_slave这些函数都需要传入一个Master_info指针，这就给我们改造多Master提供了很大的便利，基本只需要为每个复制通道传入相应的Master_info即可。

除了找到函数入口，还需要让语法支持多主，否则CHANGE MASTER TO语句并不能支持多主。我修改了sql_yacc.yy，支持如下语法：
CHANGE MASTER ‘通道标识’ TO，START SLAVE ‘通道标识’，STOP SLAVE ‘通道标识’，SHOW SLAVE ‘通道标识’ STATUS。
这样就可以支持多Master的语法了。

另一个问题是怎么保存多个通道的信息，默认单通道的情况下，用master.info存Master的信息，用relay-log.info存复制应用的情况。所以存储文件的名称也要修改，我的方式是，master.info和relay-log.info在末尾加上通道标识后缀，例如名为”plx”的通道，会存成master.info.plx和relay-log.info.plx。Relay Log因为有序列，所以增加”-通道标识”在序列前。
还有一个问题就是，操作命令都是用通道标识来确定一个通道，那么肯定需要持久化正在用的通道名称，以及建立通道后可以用通道名获取相应的Master_info。于是我新建了一个MASTER_INFO_INDEX类（在sql/rpl_mi.h），里面包含一个通道标识和Master_info指针的对应HASH表，以及持久化需要的IO_CACHE，通过master.info.index这个文件来存已有的通道标识。
命名实例如下：

-rw-rw—- 1 mysql mysql 10 Feb 13 20:40 master.info.index
-rw-rw—- 1 mysql mysql 76 Feb 14 17:27 master.info.plx1
-rw-rw—- 1 mysql mysql 71 Feb 14 17:27 master.info.plx2
-rw-rw—- 1 mysql mysql 90 Feb 14 17:25 relay-log.info.plx1
-rw-rw—- 1 mysql mysql 90 Feb 14 17:27 relay-log.info.plx2

-rw-rw—- 1 mysql mysql 160 Feb 14 10:16 mysql-relay-bin-plx1.000011
-rw-rw—- 1 mysql mysql 83765425 Feb 14 17:27 mysql-relay-bin-plx1.000012
-rw-rw—- 1 mysql mysql 106 Feb 14 10:16 mysql-relay-bin-plx1.index
-rw-rw—- 1 mysql mysql 160 Feb 14 10:16 mysql-relay-bin-plx2.000014
-rw-rw—- 1 mysql mysql 83455792 Feb 14 17:27 mysql-relay-bin-plx2.000015
-rw-rw—- 1 mysql mysql 106 Feb 14 10:16 mysql-relay-bin-plx2.index

下载Patch在此：http://bugs.mysql.com/file.php?id=18020

有了多Master以后我们可以做什么呢？下面给两个应用场景。
第一个是一备多的备份。因为我们采用的分库策略，使我们一个集群会有很多个实例，每个实例里面有几个Schema，但是肯定不会重复。例如第一个实例是1～3号Schema。第二个实例就是4～6号Schema，所以binlog应用到一起并不会冲突数据。这是我们测试的在线备份方案。

第二个是跨机房的HA。为了容灾或者加速，很多公司都采用在不同机房部署数据库的方式，所以就涉及到数据同步。为了保证每个机房产生的数据不冲突，一般来说我们采用的是auto_increment_increment，auto_increment_offset这两个参数，可以控制步进。例如双MAster，我们会配置主库是奇数序列的ID，备库是偶数序列的ID，这样切换时就算有少量binlog还未应用，也不会导致数据冲突。跨机房以后，例如两个机房都有双Master，两个机房之间数据又需要同步，以前需要借助第三方脚本或者程序，有了多Master，按如下方式搭建，设置步进为4，就可以保证每个机房有双MAster HA，机房之间数据又可以同步。

已知缺陷：
1. 我还没做reset slave ‘通道标识’命令，就是复制通道还不能重置，只能CHANGE MASTER来改，不是做不了，因为暂时我们没这个需求，等稳定了再考虑这个细节。
2. 数据冲突没有检测。这个是无法解决的，我只是简单的调用了启动Slave的函数来启动多个复制线程，binlog取到本地应用，有数据冲突是不能事先检测的，执行到了才会报出来，可以设置skip-slave-error，对全局有效。其他复制相关的也是全局有效。

最新版patch
已经修改了缺陷1，可以reset slave了。

标签: 数据库, Multi-Master, MySQL, Percona, Replication

一个InnoDB性能超过Oracle的调优Case

1 月 23rd, 2012 | Posted by P.Linux | Filed under 未分类

6 条评论

年前抽空到兄弟公司支援了一下Oracle迁移MySQL的测试，本想把MySQL调优到接近Oracle的性能即可，但经过 @何_登成 @淘宝丁奇 @淘宝褚霸 @淘伯松诸位大牛的指导和帮助（排名不分先后，仅按第一次为此CASE而骚扰的时间排序），不断修正方案，最终获得了比Oracle更好的性能，虽然是个特殊场景，但是我觉得意义是很广泛的，值得参考，遂记录于此。
所有涉及表结构和具体业务模型的部分全部略去，也请勿咨询，不能透露，敬请谅解。

一、测试模型：

包含12张业务表，每个事务包含12个SQL，每个SQL向一张表做INSERT，做完12个SQL即完成一个事务。

用一个C API编写的程序连接MySQL，不断执行如下操作

开始事务：START TRANSACTION;
每张表插入一行：INSERT INTO xxx VALUES (val1,val2,…); #一共12次
提交事务：COMMIT;

通过一个Shell脚本来启动32个测试程序并发测试

二、测试环境：

1. 机型：

R510
CPU：Intel(R) Xeon(R) CPU E5645 @ 2.40GHz 双路24线程
内存：6 * 8G 48G
存储：FusionIO 320G MLC

R910
CPU：Intel(R) Xeon(R) CPU E7530 @ 1.87GHz 四路48线程
内存：32* 4G 128G
存储：FusionIO 640G MLC

2. Linux配置：

单实例启动数据库：/boot/grub/menu.lst修改kernel启动参数增加numa=off
多实例启动数据库：numactl –cpunodebind=$BIND_NO –localalloc $MYSQLD

RHEL 5.4 with 2.6.18内置内核
RHEL 6.1 with 2.6.32淘宝版内核

fs.aio-max-nr = 1048576 #调整系统允许的最大异步IO队列长度
vm.nr_hugepages = 18000 #大页页数
vm.hugetlb_shm_group = 601 #允许使用大页的用户id，即mysql用户
vm.swappiness = 0 #不倾向使用SWAP

阅读全文…

标签: 数据库, AIO, InnoDB, Kernel, Linux, MySQL, Percona, XtraDB

在Server层实现Kill Idle Transaction

12 月 23rd, 2011 | Posted by P.Linux | Filed under 未分类

1 条评论

在上一篇文章里我们写了如何针对InnoDB清理空闲事务《如何杀掉空闲事务》，在@sleebin9 的提示下，这个功能不仅可以针对InnoDB，也可以用于所有MySQL的事务引擎。

如何在Server层实现呢，sql/sql_parse.cc的do_command()函数是个好函数，连接线程会循环调用do_command()来读取并执行命令，在do_command()函数中，会调用my_net_set_read_timeout(net, thd->variables.net_wait_timeout)来设置线程socket连接超时时间，于是在这里可以下手。
主要代码：

830   /*
 831     This thread will do a blocking read from the client which
 832     will be interrupted when the next command is received from
 833     the client, the connection is closed or "net_wait_timeout"
 834     number of seconds has passed
 835   */
 836   /* Add For Kill Idle Transaction By P.Linux */
 837   if (thd->active_transaction())
 838   {
 839     if (thd->variables.trx_idle_timeout > 0)
 840     {
 841       my_net_set_read_timeout(net, thd->variables.trx_idle_timeout);
 842     } else if (thd->variables.trx_readonly_idle_timeout > 0 && thd->is_readonly_trx)
 843     {
 844       my_net_set_read_timeout(net, thd->variables.trx_readonly_idle_timeout);
 845     } else if (thd->variables.trx_changes_idle_timeout > 0 && !thd->is_readonly_trx)
 846     {
 847       my_net_set_read_timeout(net, thd->variables.trx_changes_idle_timeout);
 848     } else {
 849       my_net_set_read_timeout(net, thd->variables.net_wait_timeout);
 850     }
 851   } else {
 852     my_net_set_read_timeout(net, thd->variables.net_wait_timeout);
 853   }
 854   /* End */

大家看明白了吗？其实这是偷梁换柱，本来在这里是要设置wait_timeout的，先判断线程是不是在事务里，就可以转而实现空闲事务的超时。

trx_idle_timeout 控制所有事务的超时，优先级最高
trx_changes_idle_timeout 控制非只读事务的超时
trx_readonly_idle_timeout 控制只读事务的超时

效果：

root@localhost : (none) 08:39:49> set autocommit = 0 ;
Query OK, 0 rows affected (0.00 sec)

root@localhost : (none) 08:39:56> set trx_idle_timeout = 5;
Query OK, 0 rows affected (0.00 sec)

root@localhost : (none) 08:40:17> use perf 
Database changed
root@localhost : perf 08:40:19> insert into perf (info ) values('11');
Query OK, 1 row affected (0.00 sec)

root@localhost : perf 08:40:26> select * from perf;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    6
Current database: perf

+----+------+
| id | info |
+----+------+
|  7 | aaaa |
|  9 | aaaa |
| 11 | aaaa |
+----+------+
3 rows in set (0.00 sec)

完整的patch这里下载：

server_kill_idle_trx.patch (5.7 KiB, 3,275 hits)

标签: 数据库, InnoDB, MySQL, Percona, XtraDB, kill_idle_transaction, patch

如何杀掉空闲事务

11 月 29th, 2011 | Posted by P.Linux | Filed under 未分类

1 条评论

我们经常遇到一个情况，就是网络断开或程序Bug导致COMMIT/ROLLBACK语句没有传到数据库，也没有释放线程，但是线上事务锁定等待严重，连接数暴涨，尤其在测试库这种情况很多，线上也偶有发生，于是想为MySQL增加一个杀掉空闲事务的功能。

那么如何实现呢，通过MySQL Server层有很多不确定因素，最保险还是在存储引擎层实现，我们用的几乎都是InnoDB/XtraDB，所以就基于Percona来修改了，Oracle版的MySQL也可以照着修改。

需求：
1. 一个事务启动，如果事务内最后一个语句执行完超过一个时间(innodb_idle_trx_timeout)，就应该关闭链接。
2. 如果事务是纯读事务，因为不加锁，所以无害，不需要关闭，保持即可。
虽然这个思路被Percona的指出Alexey Kopytov可能存在“Even though SELECT queries do not place row locks by default (there are exceptions), they can still block undo log records from being purged.”的问题，但是我们确实有场景SELECT是绝对不能kill的，除非之后的INSERT/UPDATE/DELETE发生了，所以我根据我们的业务特点来修改。
跟Percona的Yasufumi Kinoshita和Alexey Kopytov提出过纯SELECT事务不应被kill，但通过一个参数控制的方案还没有被Alexey Kopytov接受，作为通用处理我提出了用两个变量分别控制纯读事务的空闲超时时间和有锁事务的空闲超时时间，还在等待Percona的回复，因为这个方案还在测试，就先不开放修改了，当然如果你很熟悉MYSQL源码，我提出这个思路你肯定知道怎么分成这两个参数控制了。

根据这两个需求我们来设计方法，首先想到这个功能肯定是放在InnoDB Master Thread最方便，Master Thread每秒调度一次，可以顺便检查空闲事务，然后关闭，因为在事务中操作trx->mysql_thd并不安全，所以一般来说最好在InnoDB层换成Thread ID操作，并且InnoDB中除了ha_innodb.cc，其他地方不能饮用THD，所以Master Thread中需要的线程数值，都需要在ha_innodb中计算好传递整型或布尔型返回值给master thread调用。
阅读全文…

标签: 数据库, MySQL, Percona

MySQL的timeout那点事

11 月 24th, 2011 | Posted by P.Linux | Filed under 未分类

7 条评论

因为最近遇到一些超时的问题，正好就把所有的timeout参数都理一遍，首先数据库里查一下看有哪些超时：

root@localhost : test 12:55:50> show global variables like "%timeout%";
+----------------------------+--------+
| Variable_name              | Value  |
+----------------------------+--------+
| connect_timeout            | 10     |
| delayed_insert_timeout     | 300    |
| innodb_lock_wait_timeout   | 120    |
| innodb_rollback_on_timeout | ON     |
| interactive_timeout        | 172800 |
| net_read_timeout           | 30     |
| net_write_timeout          | 60     |
| slave_net_timeout          | 3600   |
| table_lock_wait_timeout    | 50     | # 这个参数已经没用了
| wait_timeout               | 172800 |
+----------------------------+--------+

我们一个个来看

connect_timeout

手册描述:
The number of seconds that the mysqld server waits for a connect packet before responding with Bad handshake. The default value is 10 seconds as of MySQL 5.1.23 and 5 seconds before that.
Increasing the connect_timeout value might help if clients frequently encounter errors of the form Lost connection to MySQL server at ‘XXX’, system error: errno.
解释：在获取链接时，等待握手的超时时间，只在登录时有效，登录成功这个参数就不管事了。主要是为了防止网络不佳时应用重连导致连接数涨太快，一般默认即可。

delayed_insert_timeout

手册描述：
How many seconds an INSERT DELAYED handler thread should wait for INSERT statements before terminating.
解释：这是为MyISAM INSERT DELAY设计的超时参数，在INSERT DELAY中止前等待INSERT语句的时间。

innodb_lock_wait_timeout

手册描述：
The timeout in seconds an InnoDB transaction may wait for a row lock before giving up. The default value is 50 seconds. A transaction that tries to access a row that is locked by another InnoDB transaction will hang for at most this many seconds before issuing the following error:

ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

When a lock wait timeout occurs, the current statement is not executed. The current transaction is not rolled back. (To have the entire transaction roll back, start the server with the –innodb_rollback_on_timeout option, available as of MySQL 5.1.15. See also Section 13.6.12, “InnoDB Error Handling”.)
innodb_lock_wait_timeout applies to InnoDB row locks only. A MySQL table lock does not happen inside InnoDB and this timeout does not apply to waits for table locks.
InnoDB does detect transaction deadlocks in its own lock table immediately and rolls back one transaction. The lock wait timeout value does not apply to such a wait.
For the built-in InnoDB, this variable can be set only at server startup. For InnoDB Plugin, it can be set at startup or changed at runtime, and has both global and session values.
解释：描述很长，简而言之，就是事务遇到锁等待时的Query超时时间。跟死锁不一样，InnoDB一旦检测到死锁立刻就会回滚代价小的那个事务，锁等待是没有死锁的情况下一个事务持有另一个事务需要的锁资源，被回滚的肯定是请求锁的那个Query。

innodb_rollback_on_timeout

手册描述：
In MySQL 5.1, InnoDB rolls back only the last statement on a transaction timeout by default. If –innodb_rollback_on_timeout is specified, a transaction timeout causes InnoDB to abort and roll back the entire transaction (the same behavior as in MySQL 4.1). This variable was added in MySQL 5.1.15.
解释：这个参数关闭或不存在的话遇到超时只回滚事务最后一个Query，打开的话事务遇到超时就回滚整个事务。

interactive_timeout/wait_timeout

手册描述：
The number of seconds the server waits for activity on an interactive connection before closing it. An interactive client is defined as a client that uses the CLIENT_INTERACTIVE option to mysql_real_connect(). See also
解释：一个持续SLEEP状态的线程多久被关闭。线程每次被使用都会被唤醒为acrivity状态，执行完Query后成为interactive状态，重新开始计时。wait_timeout不同在于只作用于TCP/IP和Socket链接的线程，意义是一样的。

net_read_timeout / net_write_timeout

手册描述：
The number of seconds to wait for more data from a connection before aborting the read. Before MySQL 5.1.41, this timeout applies only to TCP/IP connections, not to connections made through Unix socket files, named pipes, or shared memory. When the server is reading from the client, net_read_timeout is the timeout value controlling when to abort. When the server is writing to the client, net_write_timeout is the timeout value controlling when to abort. See also slave_net_timeout.
On Linux, the NO_ALARM build flag affects timeout behavior as indicated in the description of the net_retry_count system variable.
解释：这个参数只对TCP/IP链接有效，分别是数据库等待接收客户端发送网络包和发送网络包给客户端的超时时间，这是在Activity状态下的线程才有效的参数

slave_net_timeout

手册描述：
The number of seconds to wait for more data from the master before the slave considers the connection broken, aborts the read, and tries to reconnect. The first retry occurs immediately after the timeout. The interval between retries is controlled by the MASTER_CONNECT_RETRY option for the CHANGE MASTER TO statement or –master-connect-retry option, and the number of reconnection attempts is limited by the –master-retry-count option. The default is 3600 seconds (one hour).
解释：这是Slave判断主机是否挂掉的超时设置，在设定时间内依然没有获取到Master的回应就人为Master挂掉了

标签: 数据库, MySQL, timeout