Skip to content

Latest commit

 

History

History
164 lines (103 loc) · 6.88 KB

README.md

File metadata and controls

164 lines (103 loc) · 6.88 KB

Polars fast UUID4 string generation

import polars as pl
import polars.selectors as cs
import numpy as np
import uuid

import polars_uuid4
pl.__version__
'0.20.7'
Make dataframe of with 10 million random numbers
df = pl.DataFrame({
    'Random numbers': np.random.rand(10000000),
    'A string column': "value",
}).with_row_index()
df.tail()
<style> .dataframe > thead > tr, .dataframe > tbody > tr { text-align: right; white-space: pre-wrap; } </style> shape: (5, 3)
indexRandom numbersA string column
u32f64str
99999950.342875"value"
99999960.283626"value"
99999970.91639"value"
99999980.299616"value"
99999990.460211"value"
Create 10 million UUID4s
  • with_uuid4() accepts a variable so you can set the name of the series, defaults to uuid
df.uuid.with_uuid4()
<style> .dataframe > thead > tr, .dataframe > tbody > tr { text-align: right; white-space: pre-wrap; } </style> shape: (10_000_000, 4)
indexRandom numbersA string columnuuid
u32f64strstr
00.431903"value""{57cfa3fd-01a5…
10.198707"value""{3e418a42-db42…
20.626431"value""{1e16aeb2-0675…
30.790102"value""{e1129c0a-38e1…
40.907382"value""{8ad58341-ab23…
50.995303"value""{83ed9d53-30a5…
60.998931"value""{2ce35a0f-9981…
70.836289"value""{655d0891-0f1b…
80.872352"value""{77fec4e7-1a23…
90.529137"value""{912c7ff7-0a12…
100.322931"value""{d6402d0d-b5ab…
110.456256"value""{26c89cc9-d740…
99999880.006378"value""{ddc657e6-2fa7…
99999890.50514"value""{3a7f87a4-23de…
99999900.708277"value""{b51b0665-32a0…
99999910.743679"value""{5fe2070b-9d4c…
99999920.937289"value""{11b6f029-6d44…
99999930.763785"value""{44b87135-d0a7…
99999940.913705"value""{9127c91c-2a4f…
99999950.342875"value""{4dcc6d5e-97da…
99999960.283626"value""{3b34e5ff-1047…
99999970.91639"value""{d32b1a17-50ba…
99999980.299616"value""{71ad3545-fe92…
99999990.460211"value""{5ca39c0a-9993…

Works with a lazy frame too

df = pl.LazyFrame({
    'Random numbers': np.random.rand(10000000),
    'A string column': "value",
}).with_row_index().uuid.with_uuid4().collect()
df.tail()
<style> .dataframe > thead > tr, .dataframe > tbody > tr { text-align: right; white-space: pre-wrap; } </style> shape: (5, 4)
indexRandom numbersA string columnuuid
u32f64strstr
99999950.185959"value""{c4baf1ce-98c5…
99999960.005801"value""{172ddf3c-ea9b…
99999970.606094"value""{3dc75c0d-19fd…
99999980.268984"value""{f9a4f709-a2e9…
99999990.22677"value""{75f6c83d-a693…
My old way to generate a UUID4 for each row
  • Gets job done. Creates a UUID4 for each row.
  • Uses python uuid module.
  • Takes a long time (in the polars world).
    • 20.7 s ± 91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
uuids = ["{"+str(uuid.uuid4())+"}" for i in range(len(df))]
uuid_series = pl.Series(name="python_UUID", values=uuids)
df.with_columns(
    uuid_series
)
20.4 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using pl_uuid to generate a UUID4 for each row
  • Gets job done. Creates a UUID4 for each row.
  • Uses rust uuid crate.
  • Much easier to understand/simpler code.
  • ~ 40x faster than using python's uuid module to generate UUID4 when the last column in the df is already a string
  • 512 ms ± 6.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df.uuid.with_uuid4()
512 ms ± 6.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Not quite as fast if there isnt an existing string column in the dataframe
  • 644 ms ± 6.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
df = pl.DataFrame({
    'Random numbers': np.random.rand(10000000),
}).with_row_index()
df.tail()
<style> .dataframe > thead > tr, .dataframe > tbody > tr { text-align: right; white-space: pre-wrap; } </style> shape: (5, 2)
indexRandom numbers
u32f64
99999950.313362
99999960.679717
99999970.076164
99999980.853126
99999990.892428
%%timeit
df.uuid.with_uuid4()
644 ms ± 6.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)