InnoDB一定会在索引中加上主键吗

9月 20th, 2012 | Posted by | Filed under 数据库

DBA群里在讨论一个问题,到底InnoDB会不会在索引末尾加上主键,什么时候会加?

我之前看代码记得是如果索引末尾就是主键,那么InnoDB就不再添加主键了,如果索引末尾不是主键,那么会添加主键,但是这跟测试结果不符:

CREATE TABLE t (
  a char(32) not null primary key,
  b char(32) not null,
  KEY idx1 (a,b),
  KEY idx2 (b,a)
) Engine=InnoDB;

插入部分数据后可以看到idx1和idx2两个索引的大小相同。这说明idx1和idx2的内部结构是一样的,因此 不可能 是idx1在内部存为(a,b,a)。

在登博的指导下看了 dict0dict.cc:dict_index_build_internal_non_clust() 这个函数,就是构造索引的数据字典的过程,理解了这个过程就明白了,我们接下来解读下这个函数(基于5.6最近trunk):

阅读全文…

标签: , ,

InnoDB实现独立表空间多数据文件 (InnoDB multiple datafiles per single-tablespace)

9月 12th, 2012 | Posted by | Filed under 数据库

我们知道,在Oracle中,每个表空间都可以由很多文件组成,这样文件的IO就可以分散在很多存储路径上。虽然MySQL的服务器一般来说不会配置多路径存储,但是,很多老式文件系统(例如EXT3)对大文件的IO操作支持不好,性能很差,所以对MySQL/InnoDB来说,把数据文件大小控制在比较小的范围,也是有好处的。

As we know that Oracle can let one tablespace contains many datafiles, so file IO can distribute to multiple storage paths. Most of MySQL servers will not use multiple storage paths, but many old filesystems can’t support large files well, if datafiles too larger, performance will be lower, such as EXT3. So keeping MySQL/InnoDB datafiles size in a relatively small range is beneficial, too.

InnoDB在共享表空间模式下,是支持多文件的,用innodb_data_file_path选项可以配置:

InnoDB supports multiple datafiles in the shared tablespace mode, and we can configure it with innodb_data_file_path:

innodb_data_file_path = /disk1/ibdata1:2G;/disk2/ibdata2:2G:autoextend

这样配置就把数据文件分散在了disk1和disk2两个路径下,第一个文件固定2G大小,第二个文件初始化2G,可以自增长。

Configuring innodb_data_file_path like this, InnoDB can distribute the two datafiles to disk1 & disk2. The first file is fiexed size, 2GB. And the second file is auto extened, initial size id 2GB.

但是如果设置了 innodb_file_per_table 选项,每张表都会有一个独立的表空间文件,就不能再对每个独立表空间使用多数据文件了。但是,即使每张表分配一个独立的文件,还是可能有某些表变得非常大,例如我们就有几百GB的表,在XFS文件系统上这还没什么问题,有些系统为了“安全”依然使用EXT3,大文件的操作性能就堪忧了。

But if innodb_file_per_table = TRUE, each tables will have their single tablespace datafile, and innodb_data_file_path only used for system-tablespace, single-tablespace can’t use mutiple datafiles. Even thought each tables will have one single datafile, file maybe become very large, too.

当然,可以通过分库分表分区来让数据文件变小,对于大部分小公司都没有中间件来完成分库分表的工作,而大表随处可见,业务变化快,用分区也不合理,因此,通过为独立表空间增加多数据文件的功能,是很好的选择。

Of course, we can split databases / tables, or use partition, it can let datafiles become small. But most of small companies haven’t middleware to split these, and they also have many big tables. So it’s best for them to use mutile datafiles per single-tablespace feature.

如何通过尽可能少的改动,来为InnoDB独立表空间也增加多数据文件的功能呢,经过一段时间调研,可以发现,大多数地方,InnoDB并没有用特别的方式来判断是独立表空间还是共享表空间,并且表空间描述符并没有因共享表空间还是独立表空间而有差异,都是使用 fil_space_t,并且其中 fil_space_t->chain 就是记录从属于表空间的所有文件,用 fil_node_t 描述。

How to implement multiple datafiles per single-tablespace feature with modifying source code as little as possible? I found something useful through research, InnoDB haven’t do many special judge for shared/single-tablesapce, and tablespace descriptor is the same for them (fil_space_t). And fil_space_t->chain (fil_node_t) is the list of the files belong to this tablespace.

尤其当我看到这个注释时:

Especially when I saw this comment:

  /* TODO: The following code must change when InnoDB supports
  multiple datafiles per tablespace. */ 

我觉得InnoDB团队在开发时,也已经考虑到了未来需要增加表空间多文件的支持,更让我确信这是可以实现的。

I think InnoDB team want to do it, too. And they are already do enough preparation when they code. So I’m sure I can implement this feature.

因此基于5.6的源码树修改代码测试,我觉得如下思路是靠谱的,正按着这个方案重新整理代码:

And then I try to modify code on MySQL 5.6 source code, I found a practical way, I’m coding with this design:

用户接口 (User Interface):

CREATE TABLE语法新增两个参数:DATAFILE_INITIAL_SIZE, DATAFILE_NUM,分别表示数据文件初始大小和数据文件数量。

I added two options in CREATE TABLE syntax: DATAFILE_INITIAL_SIZE & DATAFILE_NUM. They represent the initial size of the data files and the number of data files.

CREATE TABLE table_name (...) ENGINE=InnoDB 
  DATAFILE_INITIAL_SIZE=1000000, DATAFILE_NUM=100;

这样就会建100个包含1000000个页面的文件,命名方式采用 “table_name#num.ibd”,都建在默认数据目录下,最多允许初始化255个文件,每个文件都是固定大小,如果还需要增加文件,需要使用ALTER TABLESPACE命令。

This SQL will let MySQL create a table with 100 datafiles, and each datafiles have 1000000 pages. The auto created datafiles named “table_name#num.ibd” in the default datadir. Allowed to contain up to 255 data files, each datafiles are fixed size. If you want to add datafile after created table, you need to use ALTER TABLESPACE command.

ALTER TABLESPACE `db_name/table_name` 
  ADD DATAFILE '/diskN/table_name#256' 
  INITIAL_SIZE = 5000 AUTOEXTEND_SIZE=1000 ENGINE=InnoDB;

这个命令会为db_name下的table_name表增加一个数据文件,位置在”/diskN/table_name#256.ibd”(后缀自动加),初始大小为5000个页面,每次自动扩展1000个页面。

This SQL will add a datafile for db_name.tablename, datafile path is “/diskN/table_name#256.ibd” (suffix .ibd is added automatically), initial size is 5000 pages, each autoextend operation will extend 1000 pages.

阅读全文…

MySQL下实现闪回的设计思路 (MySQL Flashback Feature)

9月 9th, 2012 | Posted by | Filed under 数据库

用过Oracle数据库的同学都知道,Oracle有一个Flash Recovery Area,可以把变更的块写入这块区域,当数据操作错误,需要恢复的时候,可以利用闪回空间中存储的数据块覆盖回去,也可以重构回滚段,恢复到需要的一致点。
As we know, There has a Flash Recovery Area in Oracle DB, Which allows the modified blocks been written into. So that, if there’s any incorrect deletion of data, and need to recover, DBA can use the data blocks which were stored in the Flash Recovery Area ,or reconstructed rollback segments, to restore the data to the consistent point.

而MySQL/InnoDB暂时没有提供这些功能,但是InnoDB很多设计都参考了Oracle,因此我觉得InnoDB也可以实现Flashback功能。
MySQL / InnoDB haven’t performed this great and useful function before I worked on it , though many designs of InnoDB are referred to Oracle. In this case, I think InnoDB should implement Flashback as well.

最开始我是想仿照Oracle,利用undo log来闪回,通过把COMMITTED的TRX标记为UNCOMMITTED,让InnoDB认为已经提交的事务没有提交,从而进行回滚。
具体方案是这样:
At first, I want to implement this feature, Oracle of reference. I can set COMMITTED transactions to UNCOMMITTED status during InnoDB starting with processing undo log. Then InnoDB will regard these committed transactions as uncommitted one, and rollback it.
Here are the details:

1. 在my.cnf中配置一个InnoDB_Flashback_Trx_ID的参数,标识回滚到这个trx_id的一致状态。
1. Add an option on my.cnf named InnoDB_Flashback_Trx_ID. It mean InnoDB need rollback to this trx snapshot.

2. 在InnoDB启动读取回滚段构造回滚事务时,凡是比InnoDB_Flashback_Trx_ID大的事务,都标记为UNCOMMITTED。
2. When InnoDB starting, and reading undo segments, I will set all transactions that trx_id > InnoDB_Flashback_Trx_ID to UNCOMMITTED.

3. InnoDB会把这些提交的事务认为没有提交,进而构造未提交事务,利用InnoDB自己的机制,将会在打开数据库前回滚这些事务。
3. InnoDB will consider these committed transactions are uncommitted, so construction the trx, and after construction all uncommitted transactions, InnoDB will rollback these transactions.

但这个方案有明显的弊端,首先只能适用于InnoDB,然后闪回操作需要重启,并且在实际编码实现这个方案的测试中发现,如果发生了DDL,再做一次闪回到DDL之前的TRX_ID,那么InnoDB会崩溃,并且无法再启动,应该是数据文件已经损坏,因为InnoDB的undo是逻辑记录,而非物理记录。
But this way have an Obvious disadvantages, it can only used by InnoDB. And flashback need restart MySQL. In the actual coding I found that if InnoDB did DDL, and I will rollback to the TRX_ID before DDL, InnoDB will crash, and can’t start again. I think the datafiles is corrupted, because InnoDB undo is logical records, not physical records.

因此想到了第二个方案,就是利用binlog,因为如果是ROW格式的binlog,其中记录了每个ROW的完整信息,INSERT会包含每个字段的值,DELETE也会包含每个字段的值,UPDATE会在SET和WHERE部分包含所有的字段值。因此binlog就是个完整的逻辑redo,把它的操作逆过来,就是需要的“undo”。
具体方案是这样:
So I think another way that use binlog. Because the ROW format binlog will record whole information about modified rows. INSERT/DELETE will contain all columns’ values. UPDATE will contain all columns’ on SET/WHERE part. So binlog like a whole logical redo log, reversed them can get the “undo” I need. Detail:

1. 修改Row_log_event的print的结果,将Event_type逆转:WRITE_ROWS_EVENT转为DELETE_ROWS_EVENT / DELETE_ROWS_EVENT转为WRITE_ROWS_EVENT,这只要改一个标记位即可,就是第4个字节ptr[4]。
1. Modifying the result of Row_log_event::print that reversed Event_type: Modifying WRITE_ROWS_EVENT to DELETE_ROWS_EVENT / DELETE_ROWS_EVENT to WRITE_ROWS_EVENT, this change need only modify a byte, that’s ptr[4].

2. 对于UPDATE_ROWS_EVENT,需要对调SET和WHERE部分,这是唯一相对有点麻烦的地方,我增加了个exchange_update_rows函数来完成。主要是利用print_verbose_one_row函数来解析出SET和WHERE部分的长度,以此来推断SET和WHERE的分割点,然后用memcpy交换。
2. With UPDATE_ROWS_EVENT, it need swap SET/WHERE parts. This is the only place has little trouble, I added an exchange_update_rows() function to do it. It will use print_verbose_one_row() to parse the length of SET/WHERE parts, so I can get the cut-point of SET/WHERE parts, and then swap it with memcpy().

3. 得到了逆转后的Event,就需要逆转输出。因此我在内存中拦截输出,我修改了Write_on_release_cache类,并且在Log_event中增加了一个buff,可以把Event的print结果打印在buff中,因此mysqlbinlog可以得到每个event的输出,并且存在内存中。
3. After get the reversed Event, it need reverse the sequence of Events. So I intercepted event output in memory by modifying Write_on_release_cache class, and I added a buff member on Log_event to save the print output. So mysqlbinlog can get all events’ output, and store in memory.

4. mysqlbinlog中我用动态数组存下所有的event输出,然后就从末尾向前逆向输出所有的事件,这样就可以获得闪回的逆操作文件,把这个文件导入目标库既可以完成闪回。
4. I used DYNAMIC_ARRAY to cache all events’ output in mysqlbinlog. and then I print the events’ output from end to begin, so I get the flashback file. You can import this file to MYSQL, data can flashback.

这个方案的好处很明显,通用于所有的存储引擎,因为binlog是Server层的。另外可以利用mysqlbinlog已有的各种filter来筛选部分日志输出为回滚日志,这样可以灵活选择闪回某一段操作,闪回某一个库的操作,某一个时间段的操作等等。
The advantage of this way is that all store engines can use it, because binlog is the log of Server. And then, mysqlbinlog have many filters, such as start-position/start-datatime and so on.

补丁可以看这里(Patch here):http://mysql.taobao.org/index.php/Patch_source_code#Add_flashback_feature_for_mysqlbinlog

百度AStar2008的一道题:成语纠错

6月 17th, 2012 | Posted by | Filed under 程序设计

有个小盆友正好问我这个问题,当年这个题在Astar我是满分pass了,贴出来参考下,无技术含量。

问题背景
成语是中华民族的文化瑰宝,作为历史的缩影、智慧的结晶、汉语言的精华,闪烁着睿智的光芒。
你的任务是给一个错误的四字成语进行纠错,找到它的正确写法。具体来说,你只允许修改四个汉字中的其中一个,使得修改后的成语在给定的成语列表中出现。原先的错误成语保证不在成语列表中出现。

有时,这样的“纠错”结果并不惟一。例如“一糯千金”可以改为“一字千金”也可以改成“一诺千金”。但由于“糯”和“诺”是同音字,“一糯千金”实为“一诺千金”的可能性比较大。
因此,我们还将提供一个汉字分类表,要求修改前后的两个字必须属于同一个分类。
在这样的限制下,我们保证成语纠错的结果惟一。
注意
1、汉字均采用GBK编码(参见FAQ)
2、每个汉字分类至少包含两个汉字,同一个汉字可能出现在多个类别中。
3、成语列表中的成语都是真实存在的四字成语。成语列表和待纠错成语中的所有汉字均在汉字分类表中的至少一个分类中出现。
输入格式
输入第一行包含两个整数n, m(1<=n<=200, 1<=m<=20000)。n表示汉字类别的个数,m表示成语的个数。 以下n行每行用一个无空白分隔符(空格、TAB)的汉字串表示一个分类中的所有汉字。注意,该汉字串最多可能包含200个汉字。 以下m行为成语列表,每行一个成语,恰好四个汉字。 最后一行为待纠错的成语,恰好四个汉字,且不在成语列表中出现。 输出格式 仅一行,为一个四字成语。在“修改必须在同一分类中进行”的限制下,输入数据保证纠错结果惟一。 样例输入 7 3 糯诺挪喏懦 字自子紫籽 前钱千牵浅 进近今仅紧金斤尽劲 完万 水睡税 山闪衫善扇杉 一诺千金 一字千金 万水千山 一糯千金 样例输出 一诺千金

#include 
#include 
#include 
using namespace std;

int hashkey(char ch1,char ch2)
{
    return ((unsigned char)ch1-129)*190 + ((unsigned char)ch2-64) – (unsigned char)ch2/128;
}

int checksame(string str1,string str2,string &strn,string &strm)
{
    int count = 0;
    for(int i = 0; i < 4; ++i)
        if((str1[i*2]==str2[i*2])&&(str1[i*2+1]==str2[i*2+1]))
            ++count;
            else{
                strn.resize(2);
                strm.resize(2);
                strn[0] = str1[i*2];
                strn[1] = str1[i*2+1];
                strm[0] = str2[i*2];
                strm[1] = str2[i*2+1];
            }
    //cout << "Check:" << str1 << "," << str2
    //<< ":" << count
    //<< "--" << strn[0] << strn[1]
    //<< "|" << strm[0] << strm[1] < > hashn;
    vector< vector < vector > > hashm;
    string str,tmp1,tmp2;
    string kind[200],word[20000];

    cin >> n >> m;

    hashn.resize(25000);
    for(int i = 0; i < 25000; ++i)
        hashn[i].resize(1,0);
    for(int i = 0; i < n; ++i){
        cin >> kind[i];
        //cout << kind[i].size() << endl;
        for(int j = 0; j < (kind[i].size()/2); ++j){
            GBKindex = hashkey(kind[i][j*2],kind[i][j*2+1]);
            count = ++hashn[GBKindex][0];
            hashn[GBKindex].resize(count+1);
            hashn[GBKindex][count] = i;
            //cout << GBKindex << ":" << hashn[GBKindex][0] << ":"
            //<< i << ":" <> word[i];
        for(int j = 0; j < 4; ++j){
            GBKindex = hashkey(word[i][j*2],word[i][j*2+1]);
            count = ++hashm[j][GBKindex][0];
            hashm[j][GBKindex].resize(count+1);
            hashm[j][GBKindex][count] = i;
            //cout << GBKindex << ":" << hashm[j][GBKindex][0] << ":"
            //<< i << ":" << word[i][j*2] << word[i][j*2+1] <> str;

    for(int i = 0; i < 4; ++i ){
        GBKindex = hashkey(str[i*2],str[i*2+1]);
        //cout << GBKindex <		
标签:

为MySQL增加线程内存监控 (MySQL Thread Memory Usage Monitor)

4月 27th, 2012 | Posted by | Filed under 数据库

为了国际友人看得懂,以后我的博客都同时提供中英文版。:)
For foreign friends can understand, all of my blog at the same time in English in the future.

使用MySQL中我经常发现mysqld的内存使用会涨的很快(Buffer Pool是大页分配的),以至于使用SWAP,到底Server层用了多少内存,没有一个监控机制,所以第一步我编写了个patch(基于5.6.6)来监控每个线程用了多少内存,一旦mysqld进程使用太多内存,就去看哪些线程用的多,杀掉这些线程。
I often found mysqld process use memory will grow up very fast(InnoDB Buffer Pool used large page), lead to mysqld use SWAP. How many memory MySQL Server(Threads) used? no monitor now! So I write a patch based on MySQL 5.6.6 first, it can monitor how many memory used each threads. If I found mysqld process used too many memory, I can watch which threads used more memory, and kill them.

打上补丁后的效果像这样:
This is the effect after patched:
MySQL Thread Memory Usage

代码可以看patch
This is the patch:

  5.6_thread_mem_usage.patch (4.1 KiB, 1,833 hits)

基本方法就是在my_malloc和my_free中增加回调函数(@淘宝丁奇 提供的思路,太帅了),获取调用my_malloc和my_free函数的THD描述符,用THD中新加的malloc_size字段去记录申请和释放内存,其实my_realloc也应该去更新malloc_size,暂时还没加进去。
The method is add callback function on my_malloc/my_free(Xiaobin Lin give me this Callback idea) to get the THD which call my_malloc/my_free. And use a variable named “malloc_size” on THD to record how many memory malloc/free. In fact, my_realloc is also need calc malloc_size, but I have not add it on this version.

然后使用malloc_usable_size函数在free时判断指针申请了多少内存,在GCC 4.2以上可以使用malloc_size(pointor)去判断。
And then, I use malloc_usable_size function to get the size of pointor which will be free. After GCC 4.2, we can use malloc_size to get it.

下一步我会分类监控,把每个线程sort_buffer/join_buffer/net_buffer等线程级内存都分类统计出来占用多少,方便更直观的监控。
Next step, I will monitor the size of sort_buffer/join_buffer/net_buffer in each threads, not only total size each threads.

这是新版补丁,计算了my_realloc重分配的内存:
This is the new patch, calc my_realloc size:

  5.6_thread_mem_usage_ver2.patch (5.0 KiB, 1,575 hits)

(基于mysql-5.6.6)

  5.5_thread_mem_usage.patch (5.2 KiB, 1,431 hits)

(基于percona-5.5.22)

标签: ,