clickhouse sql优化,减少查询次数提升效率

    技术2022-07-13  69

    业务场景:计算sales/count/sku的值、同比、环比 用开窗函数neighbor计算yoy同比和chain环比 // 原本计算同比和环比需要使用三条sql聚合后相除 SELECT toStartOfYear(toDate('2020-05-01')) AS date, ROUND(SUM(sales), 2) AS sales, SUM(count) AS count, COUNT(DISTINCT product_id) AS sku FROM mars_rc_roma_wide WHERE (category_id IN (50015380))         AND month >= '2020-01-01'         AND month <= '2020-05-01' GROUP BY  date SELECT toStartOfYear(toDate('2019-05-01')) AS date, ROUND(SUM(sales), 2) AS sales, SUM(count) AS count, COUNT(DISTINCT product_id) AS sku FROM tmall_wide_idx2 WHERE (category_id IN (50010794))         AND month >= '2019-01-01'         AND month <= '2019-05-01' GROUP BY  date SELECT toStartOfYear(toDate('2019-03-01')) AS date, ROUND(SUM(sales), 2) AS sales, SUM(count) AS count, COUNT(DISTINCT product_id) AS sku FROM mars_rc_roma_wide WHERE (category_id IN (50015380))         AND month >= '2019-01-01'         AND month <= '2019-03-01' GROUP BY  date // 使用开窗函数后将其合并成一条查询 SELECT ROUND(SUM(if(month >= '2020-01-01’ AND month <= '2020-05-01', sales, 0)), 2) AS sales_last,      SUM(if(month >= '2020-01-01’ AND month <= '2020-05-01', count, 0)) AS count_last,      COUNT(DISTINCT(if(month >= '2020-01-01’ AND month <= '2020-05-01', product_id, 0))) AS sku_last,      (sales_last / ROUND(SUM(if(month >= '2019-01-01' AND month <= '2019-05-01', sales, 0)), 2) -1) * 100 AS sales_yoy,      (count_last / SUM(if(month >= '2019-01-01' AND month <= '2019-05-01', count, 0)) -1) * 100 AS count_yoy,                   (sku_last / COUNT(DISTINCT(if(month >= '2019-01-01' AND month <= '2019-05-01', product_id, 0))) -1) * 100 AS sku_yoy,  (sales_last / ROUND(SUM(if(month >= '2019-01-01' AND month <= '2019-03-01', sales, 0)), 2) -1) * 100 AS sales_ratio,  (count_last / SUM(if(month >= '2019-01-01' AND month <= '2019-03-01', count, 0)) -1) * 100 AS count_ratio, (sku_last / COUNT(DISTINCT(if(month >= '2019-01-01' AND month <= '2019-03-01', product_id, 0))) -1) * 100 AS sku_ratio FROM mars_rc_roma_wide WHERE (category_id IN (50015380)) // 注意⚠️:在计算sales销售额和count销量时,使用的是SUM;计算sku时,使用的是count // COUNT(DISTINCT(if(month >= '2019-01-01' AND month <= '2019-05-01', product_id, 0))) // 这里如果对精度要求很高的话,count(if(clause, a, b))中的b不要写0或者任何有意义的数 // 因为不管b是标成0还是-999,count后都会显示成1,这样没办法区分是业务值为1还是0。比较好的方法是写成NULL,这样得到的值会是[NULL]

     

          业务场景:clickhouse中array(T)中某个字符串类型的值,进行数值区间的计算 // 这里有n个区间就会进行 n*2 次查询 SELECT '100+' AS attr_value, SUM(sales) AS sales, SUM(count) AS count FROM     (SELECT LOWER(`a.attr_name`) AS attr_name,          toUInt64(toFloat64OrZero(a.attr_value)) AS value,          ROUND(SUM(sales), 2) AS sales,          SUM(count) AS count     FROM mars_rc_roma_wide ARRAY     JOIN attributes AS a     WHERE (category_id IN (50015380))             AND month >= '2019-01-01'             AND month <= '2019-05-01'             AND a.attr_name = 'mars_rc_roma_unit_price_kg'             AND a.attr_value != ''     GROUP BY  `a.attr_name`, `a.attr_value`     ORDER BY  sales DESC) WHERE (value >= 100) // 把原本多条 multiprocessing 的多线程函数不同的值,放到一起查询 // 这里原本有n个区间 [duration1, duration2],将区间前后通过其index进行命名取出 // 改进后将查询缩减成一次,但需要注意的是,if()条件中用到的值(month)需要再子查询中先查出来,以及必须先在子查询进行聚合,才能使用这样的优化 SELECT 'mars_mwc_roma_unit_price_kg' AS attr_value,     SUM(if((value > 0 AND value < 20) AND month >= '2020-01-01' AND month <= '2020-05-01' , sales, 0)) AS s0,      (s0 / SUM(if((value > 0 AND value < 20) AND month >= '2019-01-01' AND month <= '2019-05-01' , sales, 0)) - 1) * 100 AS s0_ratio,      (s0 / SUM(if(month >= '2020-01-01' AND month <= '2020-05-01' , sales, 0))) * 100 AS s0_share FROM     (SELECT toUInt64(toFloat64OrZero(a.attr_value)) AS value,          SUM(sales) AS sales,          month     FROM mars_mwc_roma_wide ARRAY     JOIN attributes AS a     WHERE (category_id IN (1))             AND (price >= 0)             AND (size >= 0)             AND a.attr_name = 'mars_mwc_roma_unit_price_kg'             AND a.attr_value != ''     GROUP BY  `a.attr_name`, `a.attr_value`, month     ORDER BY  sales DESC)

     

          业务场景:需要计算各个词的 tgi 的值。,主要是为了观察热词在该品牌的热度,并与其在该品类的热度进行比较。 解释 TGI:即Target Group Index(目标群体指数)  TGI指数= [目标群体中具有某一特征的群体所占比例/总体中具有相同特征的群体所占比例]*标准数100。 例如,在15-24岁的人群中,有8.9%的人过去一年内去过某电影网站看电影,而在总体人群中,去过该电影网站看电影的人数比例为6.6%,则该电影网站在15-24岁人群中的TGI指数是134.9(8.9%/6.6%×100),这说明,该电影网站主要定位在15-24岁的人群中。其数额越大,就表明目标群体吻合度度就越高。 TGI指数表征不同特征用户关注问题的差异情况,其中TGI指数等于100表示平均水平,高于100,代表该类用户对某类问题的关注程度高于整体水平。 // 查看整体的 sales_in_category_brand 和 sales_in_category // 1:sql_sales_in_category_brand SELECT ROUND(SUM(sales), 2) AS sales FROM tmall_wide_idx2 WHERE (category_id IN (50003695, 350407, 350402, 50008555, 50019790, 50012144, 50019649, 50008652, 120611, 121704, 50011867, 50008109, 50016107, 50008739, 50005266, 50024944, 1205, 50005050, 50005174, 50228001, 126412033, 125538002, 350203, 50011883, 1101))         AND LOWER(`brand_name`) = LOWER('HUAWEI\/\华\为')         AND (price >= 0)         AND (size >= 0)         AND `month` >= '2020-01-01'         AND `month` <= '2020-05-01’  // 2:sql_sales_in_category SELECT ROUND(SUM(sales), 2) AS sales FROM tmall_wide_idx2 WHERE (category_id IN (50003695, 350407, 350402, 50008555, 50019790, 50012144, 50019649, 50008652, 120611, 121704, 50011867, 50008109, 50016107, 50008739, 50005266, 50024944, 1205, 50005050, 50005174, 50228001, 126412033, 125538002, 350203, 50011883, 1101))         AND (price >= 0)         AND (size >= 0)         AND `month` >= '2020-01-01'         AND `month` <= '2020-05-01' // 上面那两种写法是为了便于理解,实质上对应的是这样一条sql,直接查出两个值 SELECT ROUND(SUM(sales), 2) AS category_sales,          ROUND(SUM(if(LOWER(brand_name)=LOWER('HUAWEI\/\华\为'), sales, 0)), 2) AS brand_sales FROM tmall_wide_idx2 WHERE (category_id IN (50003695, 350407, 350402, 50008555, 50019790, 50012144, 50019649, 50008652, 120611, 121704, 50011867, 50008109, 50016107, 50008739, 50005266, 50024944, 1205, 50005050, 50005174, 50228001, 126412033, 125538002, 350203, 50011883, 1101))         AND (price >= 0)         AND (size >= 0)         AND `month` >= '2020-01-01'         AND `month` <= '2020-05-01’ // 优化的重点在于对应的品牌关键词的不同。 // 这些词的来源于我们对商品标题的切词,同样放在Array(T)中,需要时可直接查出) // 原本的写法是带品牌查一次,不带品牌查一次;将两次的结果join;若需要观察40个词,相当于进行了80个查询 SELECT ROUND(SUM(sales), 2) AS sales FROM tmall_wide_idx2 WHERE (category_id IN (50003695, 350407, 350402, 50008555, 50019790, 50012144, 50019649, 50008652, 120611, 121704, 50011867, 50008109, 50016107, 50008739, 50005266, 50024944, 1205, 50005050, 50005174, 50228001, 126412033, 125538002, 350203, 50011883, 1101))         AND LOWER(`brand_name`) = LOWER('HUAWEI\/\华\为')         AND LOWER(title) LIKE LOWER('%%\集\显%%')         AND (price >= 0)         AND (size >= 0)         AND `month` >= '2020-01-01'         AND `month` <= '2020-05-01' SELECT ROUND(SUM(sales), 2) AS sales FROM tmall_wide_idx2 WHERE (category_id IN (50003695, 350407, 350402, 50008555, 50019790, 50012144, 50019649, 50008652, 120611, 121704, 50011867, 50008109, 50016107, 50008739, 50005266, 50024944, 1205, 50005050, 50005174, 50228001, 126412033, 125538002, 350203, 50011883, 1101))         AND (price >= 0)         AND (size >= 0)         AND LOWER(title) LIKE LOWER('%%\集\显%%')         AND `month` >= '2020-01-01'         AND `month` <= '2020-05-01’ // 优化写法是合并,不管查多少个词,就都对应一个查询; // 还有个注意点是需要在条件中加上multisearchany,减小查询范围 SELECT ROUND(SUM(if(LOWER(brand_name)=LOWER('HUAWEI\/\华\为') AND LOWER(title) LIKE LOWER('%%\集\显%%'), sales, 0)), 2) AS sales1_brands, ROUND(SUM(if(LOWER(title) LIKE LOWER('%%\集\显%%'), sales, 0)), 2) AS sales1_no_brands, ROUND(SUM(if(LOWER(brand_name)=LOWER('HUAWEI\/\华\为') AND LOWER(title) LIKE LOWER('%%\声\纹%%'), sales, 0)), 2) AS sales2_brands, ROUND(SUM(if(LOWER(title) LIKE LOWER('%%\声\纹%%'), sales, 0)), 2) AS sales2_no_brands, ROUND(SUM(if(LOWER(brand_name)=LOWER('HUAWEI\/\华\为') AND LOWER(title) LIKE LOWER('%%\摄\像\头%%'), sales, 0)), 2) AS sales3_brands, ROUND(SUM(if(LOWER(title) LIKE LOWER('%%\摄\像\头%%'), sales, 0)), 2) AS sales3_no_brands FROM tmall_wide_idx2 WHERE (category_id IN (50003695, 350407, 350402, 50008555, 50019790, 50012144, 50019649, 50008652, 120611, 121704, 50011867, 50008109, 50016107, 50008739, 50005266, 50024944, 1205, 50005050, 50005174, 50228001, 126412033, 125538002, 350203, 50011883, 1101))      AND (price >= 0) AND (size >= 0)         AND (multiSearchAny(LOWER(title), ['集显', '声纹', '摄像头']))     AND `month` >= '2020-01-01'     AND `month` <= '2020-05-01'

     

      总结 其他的优化思路都大同小异 主要就是使用clickhouse内置方法,减少查询次数,提高查询效率
    Processed: 0.018, SQL: 9