Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute running time for pipelines in Pycharm (pstat) #55

Open
4 tasks
Lillliant opened this issue Nov 1, 2022 · 12 comments
Open
4 tasks

Compute running time for pipelines in Pycharm (pstat) #55

Lillliant opened this issue Nov 1, 2022 · 12 comments
Assignees
Labels
enhancement New feature or request

Comments

@Lillliant
Copy link
Member

Lillliant commented Nov 1, 2022

@hosseinfani @soroush-ziaeinejad

Update

Pycharm has a feature to profile function execution time, call counts and call graphs, so I've been using it to profile the different pipeline combinations. GitHub doesn't seem to support the .pstat files, so I uploaded them to MS Teams for the time being. The files can be viewed using Tools>Open CProfile snapshot in Pycharm. When you hover over the functions, you can also see the location of the code.

Current Issue

As for now, I cannot run any of the pipeline except for lda.gensim combinations after the new updates to the code. I initially thought it was because I didn't install bitermplus, so I tried to resolve the "C++ redistributable version must be 14 and above" issue by downloading the C++ build tool along with the redistributable. I don't think it solved the issue, so I'm looking through the code to see what may be the issues.

Planned Next Step

  • compute the lda.gensim pipeline execution time for the new revisions
  • Identify potential issues with the pipeline
  • Compute the execution time, when the rest of the pipelines becomes executable
  • Implement the profiling as a "native" feature like what Sharjeel did with carbon footprint computation: it seems that Pycharm uses cProfile package, so using it within the code may be a good starting point
@hosseinfani hosseinfani added the enhancement New feature or request label Nov 1, 2022
@hosseinfani
Copy link
Member

@Lillliant Thanks for the update

@soroush-ziaeinejad Why the codeline is not stable anymore?

@soroush-ziaeinejad
Copy link
Contributor

Hi @hosseinfani and @Lillliant

I checked the latest code on GitHub and it worked for all combinations (except gsdmm due to a bug which is resolved, please pull) on toy.synthetic dataset. On bigger datasets, you might face MemoryError.

One thing that you should consider is parameter adjusting. For instance, setting the number of topics as 50 on toy.synthetic may cause very low weights for user graphs and then all connections will be cut off because weights are under the threshold. Then an error may be raised for trying to create a graph without any connections (it says the summation of weights should be non-negative and non-zero)

If you still have problems with running SEERa, please send me the ParamsTemplate.py and a screenshot of the error.

Thanks :)

@hosseinfani
Copy link
Member

@Sharjeeliv fyi, see above.

@hosseinfani
Copy link
Member

@Lillliant
Hi Christine, I expect you to update us weekly or so. Please work with @soroush-ziaeinejad to solve the performance issues. Thanks.

@Lillliant
Copy link
Member Author

Hi @hosseinfani @soroush-ziaeinejad, sorry for not posting the updates sooner. I think I'll be able to post more frequently now that most of my personal and school matters have resolved.

So far:

  • gsdmm is working now for me, but lda.mallet and btm is still not working. I looked at Sharjeel's idea, so I was looking at my mallet setup on Windows.
    • For mallet, I've been testing with the different fields in TopicModelling.py to see how they work. It looks like the program can see my mallet.bat, though the slashs (/ vs. \) are different and might contribute to errors.
    • I've changes the number of topics in ParamTemplate.py from 3 to other numbers (2, 1, 0, 10, ...) and haven't been able to resolve the issue. It looks like btm and mallet both are stuck on not being able to find 3Topics.pkl, and I've attached the log files and ParamTemplate.py (as txt file since GitHub doesn't support .py files) below.
    • Log_mallet.txt
    • Log_btm.txt
    • ParamsTemplate.txt
  • Uploaded the pstat runs to MS Teams. Intuitively, lda.gensim is several minutes slower than all runnable pipelines, especially at the TML stage. This might be because of printing the process of modelling the topics, however.
  • Currently looking at different profiling packages (cProfile and line_profiler). I've installed the line_profiler first and is trying to get the profiling to work in readable format since line_profiler produces .lprof files which I'm looking for ways to open.

@Lillliant
Copy link
Member Author

Update

  • Implemented line_profiler. Since it profiles every line, I decided to make it so that the result is stored in a separate file from Log.txt. The preliminary results can be seen here:
    LineProfile.txt

Next Step

  • try to see if line profiler can profile function imported from other layers
  • see if the file can be sorted so that only significant lines are shown

@Lillliant
Copy link
Member Author

Update

  • added libraries in requirement.txt and environment.yml
  • modified previous code so line profiling is performed on each combination of the tml and gel baselines, instead of the aggregated version for all combinations in total. The result is stored in the respective folders of the combinations, instead of in the same place as Log.txt to avoid cluttering
  • read documentations on line profiler and cProfile. If line profiler cannot trace function calls imported from other layers, then maybe cProfile can help sort the most common function calls

@hosseinfani
Copy link
Member

@Lillliant
Thank you for the PR and clean code/documentation :)

@soroush-ziaeinejad
Copy link
Contributor

Hey @Lillliant

Thanks for your PR. I didn't look at the code carefully but I get this error when I run the code with argument -p True.

main.py: error: unrecognized arguments: True

Should I change something?

@Lillliant
Copy link
Member Author

Hey @Lillliant

Thanks for your PR. I didn't look at the code carefully but I get this error when I run the code with argument -p True.

main.py: error: unrecognized arguments: True

Should I change something?

Hi @soroush-ziaeinejad, just the command line flag -p is sufficient to activate the profiler:
python -u main.py -r toy -t [...] -g [...] -p

@Lillliant
Copy link
Member Author

Lillliant commented Dec 2, 2022

Update

  • Adding the final touch by trying to implement a way to compute the top number of functions with the most execution times. It looks like I can filter certain functions by module names with cProfile, so I want to see if line-profiler can also do this and how differently it's implemented from cProfile's. The expected time for another PR is next week / the week after.

@Lillliant
Copy link
Member Author

Update

  • added function profiling (currently filtered by the location of the function's code: used to profile only seera's own code for now, but can be customized to profile certain functions from external libraries using keywords)
  • updated log file to include the date and time of baseline execution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants