Phoenix 安装与使用

Phoenix是构建在Apache HBase(列式大数据存储)之上的一个SQL中间层。Phoenix完全使用Java编写,代码位于GitHub上,并且提供了一个客户端可嵌入的JDBC驱动。

Phoenix查询引擎会将SQL查询转换为一个或多个HBase scan,并编排执行以生成标准的JDBC结果集。直接使用HBase API、协同处理器与自定义过滤器,对于简单查询来说,其性能量级是毫秒,对于百万级别的行数来说,其性能量级是秒。

安装

不能使用 Phoenix 二进制文件应用于 CDH 的 HBase上,需要从源码编译安装。

不过 CDH 自带 Phoenix 组件,可通过远程 Parcel 存储库 URL安装,http://archive.cloudera.com/cloudera-labs/phoenix/parcels/latest/

在HBase 的 Region Server 中增加配置:

<property>
  <name>hbase.regionserver.wal.codec</name>
  <value>org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec</value>
</property>

在使用shell客户端时候,有时候会超时,修改配置文件,在当前目录执行:

vi /opt/cloudera/parcels/CLABS_PHOENIX/lib/phoenix/bin/hbase-site.xml
<property>
    <name>phoenix.query.timeoutMs</name>
    <value>3600000</value>
</property>
cd /opt/cloudera/parcels/CLABS_PHOENIX/lib/phoenix/bin
phoenix-sqlline.py hadoop-01:2181

如果你有一张Hbase的表,并且以前已经有数据,最好建立VIEW,因为一旦建立了TABLE, Phoenix 会认为这张table完全是属于Phoenix的,就算你只是删除Phoenix里面的表,Hbase里面对应的表也会一起删除掉

如果你之前没有这张Hbase的表,你想建立全新的Phoenix表,就用Table,但是要记住所有的操作都要经过Phoenix,那张同时被建立的表,只是一张附属表,不要试图往里面手动维护数据,忘记它的存在。

Squirrel GUI

在Squirrel安装目录的lib下,添加CDH自带的client jar包:phoenix-1.2.0-client.jar
该文件位置在CDH中:/opt/cloudera/parcels/CLABS_PHOENIX/lib/phoenix/

在Squirrel GUI中添加驱动:
Name: Phoenix
Example: jdbc:phoenix:localhost
ClassName: org.apache.phoenix.jdbc.PhoenixDriver

添加连接测试:
jdbc:phoenix:hadoop-05

视图

创建视图关联已存在 HBase 表,这种是只读表,表字段与列族必须对应

视图的数据和索引都是不会变化的,即使你在HBase中继续添加数据,看到的依然是旧的数据

CREATE VIEW IF NOT EXISTS "raw:51job" (
rowkey  VARCHAR PRIMARY KEY, 
"d"."url" VARCHAR, 
"d"."name" VARCHAR,
"d"."html" VARCHAR
);

CREATE VIEW IF NOT EXISTS "test:demo" (
rowkey  VARCHAR PRIMARY KEY, 
"d"."name" VARCHAR, 
"d"."age" VARCHAR,
"d"."time" VARCHAR
);

创建表关联已存在 HBase 表,这种是读写表,如果列族未定义,会在HBase中自动创建出来,注意一旦关联,在Phoenix中删除表同时会删除HBase中的表。

创建前:

{NAME => 'd', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCK
SIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}   

Phoenix执行创建表:

CREATE TABLE IF NOT EXISTS "test:demo" (
rowkey  VARCHAR PRIMARY KEY, 
"d"."name" VARCHAR, 
"d"."age" VARCHAR,
"d"."time" VARCHAR
);

创建后HBase表的变更:

test:demo, {TABLE_ATTRIBUTES => {coprocessor$1 => '|org.apache.phoenix.coprocessor.ScanRegionObserver|805306366|', coprocessor$2 => '|org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver|805306366|',
 coprocessor$3 => '|org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver|805306366|', coprocessor$4 => '|org.apache.phoenix.coprocessor.ServerCachingEndpointImpl|805306366|', coprocessor$5 => '|org.apa
che.phoenix.hbase.index.Indexer|805306366|index.builder=org.apache.phoenix.index.PhoenixIndexBuilder,org.apache.hadoop.hbase.index.codec.class=org.apache.phoenix.index.PhoenixIndexCodec', coprocessor$6 => '|org
.apache.hadoop.hbase.regionserver.LocalIndexSplitter|805306366|'}                                                                                                                                                
COLUMN FAMILIES DESCRIPTION                                                                                                                                                                                      
{NAME => 'd', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => 'FOREVER', MIN_VERSIONS => '0', KEEP_DELETED_CELLS => 'FALSE', BLOCK
SIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'} 

系统会自动增加一个d:_0的空值,目的是为了更好的查询,所以表的查询性能比视图更好

直接通过 Phoenix 创建表

CREATE TABLE IF NOT EXISTS "test:demo2" (
"id" VARCHAR NOT NULL PRIMARY KEY, 
"d"."name"VARCHAR,
"d"."age" VARCHAR
) DATA_BLOCK_ENCODING='NONE',COMPRESSION='SNAPPY',VERSIONS=1;

创建索引、删除索引

CREATE INDEX idx_demo ON "test:demo"("d"."name") INCLUDE ("d"."age", "d"."time");

CREATE INDEX index.demo2 ON "test:demo2"("d"."name");

DROP INDEX idx_51job ON"raw:51job";

一些查询示例


SELECT * FROM "raw:51job" LIMIT 10;

select ROWKEY, "url","name" from "raw:51job"  where "url" = 'http://jobs.51job.com/all/co615357.html' OR "name" = '重庆皇洋房地产开发有限公司'

select ROWKEY, "url","name" from "raw:51job" where "url" like  'http://jobs.51job.com/all/co%' order by "url" desc limit 10;

select ROWKEY from "raw:51job" where "name" = '泰利明苑物业公司' 

select * from "raw:51job" where ROWKEY = (select ROWKEY from "raw:51job"  where "url" = 'http://jobs.51job.com/all/co2139048.html')

select * from "raw:51job" where ROWKEY = (select ROWKEY from "raw:51job" where "name" = '重庆皇洋房地产开发有限公司')

select COUNT(*) from "test:content_k"

select COUNT(*) from "raw:51job"

索引对比

MySQL 9000 万数据,容量以及索引创建耗时

-rw-rw---- 1 mysql mysql  23G 1月  19 22:20 content_b.MYD
-rw-rw---- 1 mysql mysql 1.9G 1月  19 22:26 content_b.MYI

mysql> ALTER TABLE `test`.`content_b` ADD INDEX `idx_published` (`published`);              
Query OK, 89999998 rows affected (20 min 29.78 sec)   26G
Records: 89999998  Duplicates: 0  Warnings: 0

Phoenix 相同数据索引创建与耗时

CREATE VIEW IF NOT EXISTS "test:content_m" (
rowkey  VARCHAR PRIMARY KEY, 
"d"."md5" VARCHAR, 
"d"."title" VARCHAR,
"d"."catid" VARCHAR,
"d"."modelid" VARCHAR,
"d"."published" VARCHAR,
"d"."publishedby" VARCHAR,
"d"."time" VARCHAR,
"d"."datetime" VARCHAR,
"d"."status" VARCHAR
);

CREATE INDEX index_content_b ON "test:content_b"("d"."published");

1.1 G   2.1 G    /hbase/data/default/INDEX_CONTENT_B
89,999,998 rows affected (935.629 seconds)

删除索引

DROP INDEX index_content_b  ON "test:content_b";