Compute running time for pipelines in Pycharm (pstat) #55

Lillliant · 2022-11-01T00:15:38Z

Update

Pycharm has a feature to profile function execution time, call counts and call graphs, so I've been using it to profile the different pipeline combinations. GitHub doesn't seem to support the .pstat files, so I uploaded them to MS Teams for the time being. The files can be viewed using Tools>Open CProfile snapshot in Pycharm. When you hover over the functions, you can also see the location of the code.

Current Issue

As for now, I cannot run any of the pipeline except for lda.gensim combinations after the new updates to the code. I initially thought it was because I didn't install bitermplus, so I tried to resolve the "C++ redistributable version must be 14 and above" issue by downloading the C++ build tool along with the redistributable. I don't think it solved the issue, so I'm looking through the code to see what may be the issues.

Planned Next Step

compute the lda.gensim pipeline execution time for the new revisions
Identify potential issues with the pipeline
Compute the execution time, when the rest of the pipelines becomes executable
Implement the profiling as a "native" feature like what Sharjeel did with carbon footprint computation: it seems that Pycharm uses cProfile package, so using it within the code may be a good starting point

hosseinfani · 2022-11-01T00:48:22Z

@Lillliant Thanks for the update

@soroush-ziaeinejad Why the codeline is not stable anymore?

soroush-ziaeinejad · 2022-11-02T00:46:01Z

Hi @hosseinfani and @Lillliant

I checked the latest code on GitHub and it worked for all combinations (except gsdmm due to a bug which is resolved, please pull) on toy.synthetic dataset. On bigger datasets, you might face MemoryError.

One thing that you should consider is parameter adjusting. For instance, setting the number of topics as 50 on toy.synthetic may cause very low weights for user graphs and then all connections will be cut off because weights are under the threshold. Then an error may be raised for trying to create a graph without any connections (it says the summation of weights should be non-negative and non-zero)

If you still have problems with running SEERa, please send me the ParamsTemplate.py and a screenshot of the error.

Thanks :)

hosseinfani · 2022-11-02T01:57:37Z

@Sharjeeliv fyi, see above.

hosseinfani · 2022-11-13T02:18:18Z

@Lillliant
Hi Christine, I expect you to update us weekly or so. Please work with @soroush-ziaeinejad to solve the performance issues. Thanks.

Lillliant · 2022-11-13T02:49:19Z

Hi @hosseinfani @soroush-ziaeinejad, sorry for not posting the updates sooner. I think I'll be able to post more frequently now that most of my personal and school matters have resolved.

So far:

gsdmm is working now for me, but lda.mallet and btm is still not working. I looked at Sharjeel's idea, so I was looking at my mallet setup on Windows.
- For mallet, I've been testing with the different fields in TopicModelling.py to see how they work. It looks like the program can see my mallet.bat, though the slashs (/ vs. \) are different and might contribute to errors.
- I've changes the number of topics in ParamTemplate.py from 3 to other numbers (2, 1, 0, 10, ...) and haven't been able to resolve the issue. It looks like btm and mallet both are stuck on not being able to find 3Topics.pkl, and I've attached the log files and ParamTemplate.py (as txt file since GitHub doesn't support .py files) below.
- Log_mallet.txt
- Log_btm.txt
- ParamsTemplate.txt
Uploaded the pstat runs to MS Teams. Intuitively, lda.gensim is several minutes slower than all runnable pipelines, especially at the TML stage. This might be because of printing the process of modelling the topics, however.
Currently looking at different profiling packages (cProfile and line_profiler). I've installed the line_profiler first and is trying to get the profiling to work in readable format since line_profiler produces .lprof files which I'm looking for ways to open.

Lillliant · 2022-11-19T02:25:03Z

Update

Implemented line_profiler. Since it profiles every line, I decided to make it so that the result is stored in a separate file from Log.txt. The preliminary results can be seen here:
LineProfile.txt

Next Step

try to see if line profiler can profile function imported from other layers
see if the file can be sorted so that only significant lines are shown

Lillliant · 2022-11-26T03:35:51Z

Update

added libraries in requirement.txt and environment.yml
modified previous code so line profiling is performed on each combination of the tml and gel baselines, instead of the aggregated version for all combinations in total. The result is stored in the respective folders of the combinations, instead of in the same place as Log.txt to avoid cluttering
read documentations on line profiler and cProfile. If line profiler cannot trace function calls imported from other layers, then maybe cProfile can help sort the most common function calls

hosseinfani · 2022-11-27T08:28:11Z

@Lillliant
Thank you for the PR and clean code/documentation :)

soroush-ziaeinejad · 2022-11-29T02:57:29Z

Hey @Lillliant

Thanks for your PR. I didn't look at the code carefully but I get this error when I run the code with argument -p True.

main.py: error: unrecognized arguments: True

Should I change something?

Lillliant · 2022-11-29T03:00:44Z

Hey @Lillliant

Thanks for your PR. I didn't look at the code carefully but I get this error when I run the code with argument -p True.

main.py: error: unrecognized arguments: True

Should I change something?

Hi @soroush-ziaeinejad, just the command line flag -p is sufficient to activate the profiler:
python -u main.py -r toy -t [...] -g [...] -p

Lillliant · 2022-12-02T23:35:34Z

Update

Adding the final touch by trying to implement a way to compute the top number of functions with the most execution times. It looks like I can filter certain functions by module names with cProfile, so I want to see if line-profiler can also do this and how differently it's implemented from cProfile's. The expected time for another PR is next week / the week after.

Lillliant · 2022-12-23T02:10:05Z

Update

added function profiling (currently filtered by the location of the function's code: used to profile only seera's own code for now, but can be customized to profile certain functions from external libraries using keywords)
updated log file to include the date and time of baseline execution

hosseinfani assigned Lillliant Nov 1, 2022

hosseinfani added the enhancement New feature or request label Nov 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute running time for pipelines in Pycharm (pstat) #55

Compute running time for pipelines in Pycharm (pstat) #55

Lillliant commented Nov 1, 2022 •

edited by Sharjeeliv

Loading

hosseinfani commented Nov 1, 2022

soroush-ziaeinejad commented Nov 2, 2022

hosseinfani commented Nov 2, 2022

hosseinfani commented Nov 13, 2022

Lillliant commented Nov 13, 2022

Lillliant commented Nov 19, 2022

Lillliant commented Nov 26, 2022

hosseinfani commented Nov 27, 2022

soroush-ziaeinejad commented Nov 29, 2022

Lillliant commented Nov 29, 2022

Lillliant commented Dec 2, 2022 •

edited

Loading

Lillliant commented Dec 23, 2022

Compute running time for pipelines in Pycharm (pstat) #55

Compute running time for pipelines in Pycharm (pstat) #55

Comments

Lillliant commented Nov 1, 2022 • edited by Sharjeeliv Loading

Update

Current Issue

Planned Next Step

hosseinfani commented Nov 1, 2022

soroush-ziaeinejad commented Nov 2, 2022

hosseinfani commented Nov 2, 2022

hosseinfani commented Nov 13, 2022

Lillliant commented Nov 13, 2022

Lillliant commented Nov 19, 2022

Update

Next Step

Lillliant commented Nov 26, 2022

Update

hosseinfani commented Nov 27, 2022

soroush-ziaeinejad commented Nov 29, 2022

Lillliant commented Nov 29, 2022

Lillliant commented Dec 2, 2022 • edited Loading

Update

Lillliant commented Dec 23, 2022

Update

Lillliant commented Nov 1, 2022 •

edited by Sharjeeliv

Loading

Lillliant commented Dec 2, 2022 •

edited

Loading