Skip to content

Latest commit

 

History

History
648 lines (588 loc) · 24 KB

20211002090535-100knocks.org

File metadata and controls

648 lines (588 loc) · 24 KB

100knocks

概要

データサイエンス100本ノック(構造化データ加工編)はデータ解析を各種言語(SQL, R, Python)で練習するためのリポジトリ。 環境構築、データのセットがDockerで簡単に準備できる。 標準はJupyter Notebookを使うようになっているが、使いにくいのでorg-modeを使用してSQLを書く。

環境構築

Emacs / org-modeで実行できるようにする。

  • Repository: データサイエンス100本ノック
  • READMEの通りに docker-compose を実行する。
  • 標準のJupyter Notebookで問題なくアクセスできることを確認する。
  • init.elに追加。babelでSQLをロードする。これをしないと実行できない。
(org-babel-do-load-languages
 'org-babel-load-languages
 '((sql . t)))
  • 実行したいorgファイル見出しにDB接続設定を追加。100本ノックのPostgreSQLのデフォルト設定になっている(READMEに記載されてる)。
    :PROPERTIES:
      :header-args+: :results table
      :header-args+: :engine postgresql
      :header-args+: :dbhost localhost
      :header-args+: :dbuser postgres
      :header-args+: :dbpassword postgres12345
      :header-args+: :database dsdojo_db
    :END:
        
  • コードブロックで C-c C-c で実行できる。
SELECT COUNT(1) FROM store CROSS JOIN product;
  
count
531590

Memo

:header-args+: :results table :header-args+: :engine postgresql :header-args+: :dbhost localhost :header-args+: :dbuser postgres :header-args+: :dbpassword postgres12345 :header-args+: :database dsdojo_db

等しい場合は同一順位: RANK

S-019: レシート明細テーブル(receipt)に対し、1件あたりの売上金額(amount)が高い順にランクを付与し、先頭10件を抽出せよ。項目は顧客ID(customer_id)、売上金額(amount)、付与したランクを表示させること。なお、売上金額(amount)が等しい場合は同一順位を付与するものとする。

SELECT customer_id, amount, RANK() OVER(ORDER BY amount DESC) AS ranking
FROM receipt
LIMIT 10
customer_idamountranking
CS011415000006109251
ZZ00000000000068002
CS02860500000257803
ZZ00000000000054804
ZZ00000000000054804
CS01551500003454804
CS02151500008954407
ZZ00000000000054407
CS02051500010252809
CS02151500008952809

等しい場合でも別順位: ROW_NUMBER

S-020: レシート明細テーブル(receipt)に対し、1件あたりの売上金額(amount)が高い順にランクを付与し、先頭10件を抽出せよ。項目は顧客ID(customer_id)、売上金額(amount)、付与したランクを表示させること。なお、売上金額(amount)が等しい場合でも別順位を付与すること。

SELECT customer_id, amount, ROW_NUMBER() OVER(ORDER BY amount DESC) AS ranking
FROM receipt
LIMIT 10
customer_idamountranking
CS011415000006109251
ZZ00000000000068002
CS02860500000257803
CS01551500003454804
ZZ00000000000054805
ZZ00000000000054806
ZZ00000000000054407
CS02151500008954408
ZZ00000000000052809
CS009415000038528010

ユニーク件数

S-022: レシート明細テーブル(receipt)の顧客ID(customer_id)に対し、ユニーク件数をカウントせよ。

SELECT count(distinct customer_id) FROM receipt
count
8307

GROUP BY, 日付でのMAX

S-024: レシート明細テーブル(receipt)に対し、顧客ID(customer_id)ごとに最も新しい売上日(sales_ymd)を求め、10件表示せよ。

SELECT customer_id, MAX(sales_ymd)
FROM receipt
GROUP BY customer_id
LIMIT 10
customer_idmax
CS00131100005920180211
CS00461400012220181228
CS00351200004320180106
CS01161500006120190503
CS02921200003320180621
CS00751500011920190511
CS03451500012320190708
CS00431500005820170517
CS02641400001420190720
CS00161500009920170729

集計結果に対する条件: HAVING

S-026: レシート明細テーブル(receipt)に対し、顧客ID(customer_id)ごとに最も新しい売上日(sales_ymd)と古い売上日を求め、両者が異なるデータを10件表示せよ。

SELECT customer_id, MAX(sales_ymd), MIN(sales_ymd)
FROM receipt
GROUP BY customer_id
HAVING MAX(sales_ymd) != MIN(sales_ymd)
LIMIT 10
customer_idmaxmin
CS0292120000332018062120170318
CS0075150001192019051120170201
CS0345150001232019070820170527
CS0264140000142019072020170718
CS0105150000822018120420180518
CS0193150000452017092020170423
CS0085130000992019030820170722
CS0076150000702019102520170929
CS0254150001552019102620170314
CS0164140000632019061720170109

中央値: PERCENTILE_CONT

S-028: レシート明細テーブル(receipt)に対し、店舗コード(store_cd)ごとに売上金額(amount)の中央値を計算し、降順でTOP5を表示せよ。

SELECT
store_cd,
PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY amount) as amount_50per
FROM receipt
GROUP BY store_cd
ORDER BY amount_50per desc
LIMIT 5
store_cdamount_50per
S13052190
S14010188
S14050185
S13003180
S13018180

サブクエリ: WITH

S-029: レシート明細テーブル(receipt)に対し、店舗コード(store_cd)ごとに商品コード(product_cd)の最頻値を求めよ。

WITH product_mode AS (
    SELECT store_cd,product_cd, COUNT(1) as mode_cnt,
        RANK() OVER(PARTITION BY store_cd ORDER BY COUNT(1) DESC) AS rnk
    FROM receipt
    GROUP BY store_cd,product_cd
)
SELECT store_cd,product_cd, mode_cnt
FROM product_mode
WHERE rnk = 1
ORDER BY store_cd,product_cd;
store_cdproduct_cdmode_cnt
S12007P06030300172
S12013P060303001107
S12014P06030300165
S12029P06030300192
S12030P060303001115
S13001P06030300167
S13002P06030300178
S13003P07140100165
S13004P06030300188
S13005P04050300136
S13008P06030300177
S13009P06030300164
S13015P07140100134
S13016P07110200132
S13017P06010100231
S13018P07140100147
S13019P07140100170
S13020P07140100179
S13031P060303001115
S13032P06030300185
S13035P04050300139
S13037P06030300188
S13038P06030300141
S13039P07140100136
S13041P07140100170
S13043P06030300156
S13044P06030300196
S13051P0501020015
S13051P0710030015
S13051P0808040015
S13052P0501010014
S14006P06030300170
S14010P06030300168
S14011P06010100151
S14012P060303001142
S14021P06010100130
S14022P06030300171
S14023P07140100170
S14024P06030300196
S14025P06030300146
S14026P07140100140
S14027P060303001152
S14028P060303001140
S14033P07140100168
S14034P06030300171
S14036P04050300119
S14036P06010100119
S14040P06030300180
S14042P05010100134
S14045P06030300133
S14046P06030300171
S14047P06030300136
S14048P05010100117
S14049P06030300155
S14050P0603030019

標本分散: VAR_SAMP

S-030: レシート明細テーブル(receipt)に対し、店舗コード(store_cd)ごとに売上金額(amount)の標本分散を計算し、降順でTOP5を表示せよ。

SELECT store_cd, var_samp(amount) as vars_amount
FROM receipt
GROUP BY store_cd
ORDER BY vars_amount desc
LIMIT 5
store_cdvars_amount
S13052441863.252526233968
S14011306442.242431568709
S14034297068.392740060738
S13001295558.842617712478
S13015295427.197085853584

標本標準偏差: STDDEV_SAMP

S-031: レシート明細テーブル(receipt)に対し、店舗コード(store_cd)ごとに売上金額(amount)の標本標準偏差を計算し、降順でTOP5を表示せよ。

SELECT store_cd, stddev_samp(amount) as stds_amount
FROM receipt
GROUP BY store_cd
ORDER BY stds_amount desc
LIMIT 5
store_cdstds_amount
S13052664.727953772244
S14011553.572255836190
S14034545.039808399406
S13001543.653237475610
S13015543.532149082144

%刻み: PERCENTILE_CONT

S-032: レシート明細テーブル(receipt)の売上金額(amount)について、25%刻みでパーセンタイル値を求めよ。

SELECT
    PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY amount) as amount_25per,
    PERCENTILE_CONT(0.50) WITHIN GROUP(ORDER BY amount) as amount_50per,
    PERCENTILE_CONT(0.75) WITHIN GROUP(ORDER BY amount) as amount_75per,
    PERCENTILE_CONT(1.0) WITHIN GROUP(ORDER BY amount) as amount_100per
FROM receipt
amount_25peramount_50peramount_75peramount_100per
10217028810925

サブクエリ: WITH

S-034: レシート明細テーブル(receipt)に対し、顧客ID(customer_id)ごとに売上金額(amount)を合計して全顧客の平均を求めよ。ただし、顧客IDが”Z”から始まるのものは非会員を表すため、除外して計算すること。

WITH customer_amount AS (
    SELECT customer_id, SUM(amount) AS sum_amount
    FROM receipt
    WHERE customer_id not like 'Z%'
    GROUP BY customer_id
)
SELECT AVG(sum_amount) from customer_amount
avg
2547.7422345292559595

WHEREの条件に集約関数を使う

S-035: レシート明細テーブル(receipt)に対し、顧客ID(customer_id)ごとに売上金額(amount)を合計して全顧客の平均を求め、平均以上に買い物をしている顧客を抽出せよ。ただし、顧客IDが”Z”から始まるのものは非会員を表すため、除外して計算すること。なお、データは10件だけ表示させれば良い。

WITH customer_amount AS (
    SELECT customer_id, SUM(amount) AS sum_amount
    FROM receipt
    WHERE customer_id not like 'Z%'
    GROUP BY customer_id
)
SELECT customer_id, sum_amount
FROM customer_amount
WHERE sum_amount >= (SELECT AVG(sum_amount) from customer_amount)
limit 10
customer_idsum_amount
CS0292120000333604
CS0075150001197157
CS0345150001233699
CS0264140000146671
CS0076150000702975
CS0164140000636207
CS0125140000182562
CS0295150001423420
CS0152150000213090
CS0398140000118031

結合のスマートな書き方

S-036: レシート明細テーブル(receipt)と店舗テーブル(store)を内部結合し、レシート明細テーブルの全項目と店舗テーブルの店舗名(store_name)を10件表示させよ。

SELECT r.*, s.store_name
FROM receipt r
JOIN store s
ON r.store_cd = s.store_cd
LIMIT 10
sales_ymdsales_epochstore_cdreceipt_noreceipt_sub_nocustomer_idproduct_cdquantityamountstore_name
201811031541203200S140061121CS006214000001P0703050121158葛が谷店
201811181542499200S1300811322CS008415000097P070701017181成城店
201707121499817600S1402811021CS028414000014P0601010051170二ツ橋店
201902051549324800S1404211321ZZ000000000000P050301001125新山下店
201808211534809600S1402511022CS025415000050P060102007190大和店
201906051559692800S1300311121CS003515000195P0501020021138狛江店
201812051543968000S1402411022CS024514000042P080101005130三田店
201909221569110400S1404011021CS040415000178P0705010041128長津田店
201705041493856000S1302011122ZZ000000000000P0713020101770十条仲原店
201910101570665600S1402711021CS027514000015P0711010031680南藤沢店

値がないときのデフォルト値: COALESCE

S-038: 顧客テーブル(customer)とレシート明細テーブル(receipt)から、各顧客ごとの売上金額合計を求めよ。ただし、売上実績がない顧客については売上金額を0として表示させること。また、顧客は性別コード(gender_cd)が女性(1)であるものを対象とし、非会員(顧客IDが’Z’から始まるもの)は除外すること。なお、結果は10件だけ表示させれば良い。

WITH customer_amount AS (
    SELECT customer_id, SUM(amount) AS sum_amount
    FROM receipt
    GROUP BY customer_id
)
SELECT c.customer_id, COALESCE(a.sum_amount,0)
FROM customer c
LEFT JOIN customer_amount a
ON c.customer_id = a.customer_id
WHERE c.gender_cd = '1'
      and c.customer_id not like 'Z%'
LIMIT 10
customer_idcoalesce
CS0213130001140
CS0314150001725088
CS0288110000010
CS001215000145875
CS0154140001033122
CS033513000180868
CS0356140000140
CS0112150000483444
CS0094130000790
CS040412000191210

複数のWITH

S-039: レシート明細テーブル(receipt)から売上日数の多い顧客の上位20件と、売上金額合計の多い顧客の上位20件を抽出し、完全外部結合せよ。ただし、非会員(顧客IDが’Z’から始まるもの)は除外すること。

WITH customer_days AS (
    select customer_id, count(distinct sales_ymd) come_days
    FROM receipt
    WHERE customer_id NOT LIKE 'Z%'
    GROUP BY customer_id
    ORDER BY come_days DESC LIMIT 20
),
customer_amount AS (
    SELECT customer_id, sum(amount) buy_amount
    FROM receipt
    WHERE customer_id NOT LIKE 'Z%'
    GROUP BY customer_id
    ORDER BY buy_amount DESC LIMIT 20
)
SELECT COALESCE(d.customer_id, a.customer_id), d.come_days, a.buy_amount
FROM customer_days d
FULL JOIN customer_amount a
ON d.customer_id = a.customer_id;
coalescecome_daysbuy_amount
CS04021400000823
CS0154150001852220153
CS0102140000102218585
CS0284150000072119127
CS01021400000221
CS0174150000972023086
CS0164150001412018372
CS02151400004519
CS02251500022619
CS0314140000511919202
CS03941400005219
CS01421400002319
CS02151500017219
CS03141400007318
CS00751500010718
CS01441500007718
CS02151500005618
CS03241500020918
CS02151500021118
CS02251500002818
CS01141500000616094
CS01641500010116348
CS03041500003415468
CS02151500008917580
CS03441500004716083
CS00651500002318372
CS03841500010417847
CS01551500003415300
CS03241400007216563
CS01141400010618338
CS00160500000918925
CS00941400005915492
CS03541400002417615
CS00751400009415735

集計と、表示を分離するイメージ。with句で集計をやり、その結果をメインの句で好きに加工できる。

直積: CROSS JOIN

S-040: 全ての店舗と全ての商品を組み合わせると何件のデータとなるか調査したい。店舗(store)と商品(product)を直積した件数を計算せよ。

SELECT COUNT(1) FROM store CROSS JOIN product;
count
531590

差分: LAG

S-041: レシート明細テーブル(receipt)の売上金額(amount)を日付(sales_ymd)ごとに集計し、前日からの売上金額増減を計算せよ。なお、計算結果は10件表示すればよい。

WITH sales_amount_by_date AS (
    SELECT sales_ymd, SUM(amount) as amount FROM receipt
    GROUP BY sales_ymd
    ORDER BY sales_ymd
)
SELECT sales_ymd, LAG(sales_ymd, 1) OVER(ORDER BY sales_ymd) lag_ymd,
    amount,
    LAG(amount, 1) OVER(ORDER BY sales_ymd) as lag_amount,
    amount - LAG(amount, 1) OVER(ORDER BY sales_ymd) as diff_amount
FROM sales_amount_by_date
LIMIT 10;
sales_ymdlag_ymdamountlag_amountdiff_amount
2017010133723
20170102201701012416533723-9558
201701032017010227503241653338
201701042017010336165275038662
201701052017010437830361651665
20170106201701053238737830-5443
20170107201701062341532387-8972
201701082017010724737234151322
201701092017010826718247371981
20170110201701092014326718-6575

複数のJOIN条件

S-042: レシート明細テーブル(receipt)の売上金額(amount)を日付(sales_ymd)ごとに集計し、各日付のデータに対し、1日前、2日前、3日前のデータを結合せよ。結果は10件表示すればよい。

WITH sales_amount_by_date AS (
    SELECT sales_ymd, SUM(amount) as amount FROM receipt
    GROUP BY sales_ymd
    ORDER BY sales_ymd
),
sales_amount_lag_date AS (
    SELECT sales_ymd,
        COALESCE(LAG(sales_ymd, 3) OVER (ORDER BY sales_ymd),
        MIN(sales_ymd) OVER (PARTITION BY NULL)) as lag_date_3,
        amount
    FROM sales_amount_by_date
)
SELECT a.sales_ymd, b.sales_ymd as lag_ymd,
    a.amount as amount, b.amount as lag_amount
FROM sales_amount_lag_date a
JOIN sales_amount_lag_date b
ON b.sales_ymd >= a.lag_date_3
    and b.sales_ymd < a.sales_ymd
ORDER BY sales_ymd, lag_ymd
LIMIT 10;
sales_ymdlag_ymdamountlag_amount
20170102201701012416533723
20170103201701012750333723
20170103201701022750324165
20170104201701013616533723
20170104201701023616524165
20170104201701033616527503
20170105201701023783024165
20170105201701033783027503
20170105201701043783036165
20170106201701033238727503

日付の変換

S-045: 顧客テーブル(customer)の生年月日(birth_day)は日付型でデータを保有している。これをYYYYMMDD形式の文字列に変換し、顧客ID(customer_id)とともに抽出せよ。データは10件を抽出すれば良い。

SELECT customer_id, TO_CHAR(birth_day, 'YYYYMMDD') FROM customer LIMIT 10;
customer_idto_char
CS02131300011419810429
CS03761300007119520401
CS03141500017219761004
CS02881100000119330327
CS00121500014519950329
CS02040100001619740915
CS01541400010319770809
CS02940300000819730817
CS01580400000419310502
CS03351300018019620711

練習用

Tasks

1~50問

Reference

Archives