What are search_pubs() best practices and expectations? #430

stanleyrhodes · 2022-06-29T19:09:19Z

stanleyrhodes
Jun 29, 2022

I have some general questions about search_pubs() that are more discussion questions for those who use search_pubs than precise QnA-style questions. It would be useful to know what others have been able to do to set new users' expectations, including mine.

Background: I've had little success (or luck) using search_pubs to build a computationally-aided lit review. In my case, I was trying to find papers that contained particular phrases that defined a technical term that, in usage, varied enough that it wasn't quite yet a technical term. Terms like these are places where researchers tend to talk past one another, so working on them holds promise for advancing research in that area. Mapping their usage requires a lot of searching and papers. In my case, depending on the combination of phrased used and phrases disallowed, it could lead to roughly 5,000 to 30,000 results. I didn't expect to do all of that in one search, I was planning to do smaller sets (~500-1000) and then combine them, but that's still way too much to do by hand. That's why I started investigating and playing with Scholarly. Although I did once succeed in retrieving ~500 results from a test search, I usually fail to get anything for even small, < 20 results searches. I went from probably a 80% failure rate to a 99% failure rate. It may be that I'm doing a lot of things wrong, or it may be that GS is just too good at blocking anything and everything search_pubs(). I don't recall having a single success with FreeProxies. I've been using BrightData (formerly Luminati) data center proxies, and I've tried both rotating and longer-term IPs. I've tried limiting the data centers to a particular country or leaving it global. I've had the most luck with global, rotating IPs.

What is your success rate with search_pubs searches? Roughly how big is your expected number of results (e.g. 10, 100, 1000, etc.)? Do you use a proxy, and if so, which, and with what settings?

arunkannawadi · 2022-06-29T21:03:07Z

arunkannawadi
Jun 29, 2022
Maintainer

There's a small chance that scholarly-BrightData integration may be broken, since we don't actively test this any more (we used to). Try running

python -m unittest test_module.TestLuminati

and see if it passes. You should have USERNAME, PASSWORD, PORT variables stored in your environment. For e.g., if you have bash shell, then you should do export PORT=<port> etc., where is the actual port number.

10 replies

stanleyrhodes Jun 30, 2022
Author

Your recommendation was excellent! I got started with ScraperAPI tonight, using my local Scholarly repo, and I've had no issues. 100% success. The queries came back very quickly. The pricing will be tolerable, although I'll have to consider how much using a lot of .fill will push up the cost.

Rather than spending any more time on Luminati, I'm going to stay with what works for now; I'm already behind on the timeline for this paper. It looks like one should expect very little from BrightData/Luminati and quite a lot from ScraperAPI. BrightData was also a bit of a pain to set up initially, and ScraperAPI couldn't have been easier.

arunkannawadi Jun 30, 2022
Maintainer

That's great to know!

It's a hard to estimate the cost up front with any service. With ScraperAPI, each page fetch costs 25 credits, and you'll get up to 20 results in one page. So it depends on how many results your query has.

stanleyrhodes Jul 1, 2022
Author

.fill() will make a call for each publication, so that increases the cost much more quickly than just the page views. So 1 page is 25 credits (actually, 26 for some reason), but then using .fill() for each pub would make it 20 pubs x 26 credits. I verified this: I started with 128, then ran a search I knew would return 3 pubs. After I had used 206 credits of 5000. 206 - 128 = 78. 78 / 3 = 26 apiece.

But I have bad news: Scholarly when ScraperAPI is now returning the dreaded MaxTriesExceededException: Cannot Fetch from Google Scholar. And on the ScraperAPI dashboard this is considered a "success", I assume because GS is returning a 404. I've read through a previous discussion that seemed relevant (#330 ) but I'm not sure it gives me a way forward. Neither am I sure any method on https://www.scraperapi.com/documentation/ gives me a way forward, in part because ScraperAPI sees a 404 as a success, and in part because how would I do something like async with Scholarly?

This reminds me of the strategic difference between Publish or Perish and Scholarly. Publish or Perish tries to err on the side of limiting rates and not dealing with proxies at all, a slow don't-rock-the-boat strategy. OTOH, Scholarly uses proxies for a fast strategy. But GS countermeasures seem to be too good. I'd like to be able to do a PoP-style slow, rate-limited request that simulated normal user behavior and traffic using Scholarly. But let's say somehow I wrote the code so that Scholarly could do that slow style. I still wouldn't, as far as I can see, have any ability to use proxies services for it. That makes sense: how could they sell me an IP that is mine alone while I need it, working slowly on what I need? I'm a fringe case, not really their intended customer buying their intended service. At the end of the day, I'm going to be caught in the middle of an arms race between proxies and GS. I still see a role for Scholarly, of course, as even without proxies it has more flexibility for scraping than PoP (with due respect to PoP; it's a good product). At this point it seems clear that the odds are heavily against me for getting even one thousand results, much less the few thousand I need. This is why I was asking about expectations. It seems like 1000+ results is outside of any previous known successes. If I need a lot of luck to get the results I need, that's a losing plan. I'll need to cut my losses ASAP.

stanleyrhodes Jul 3, 2022
Author

ScraperAPI seems to be working for me again, although I have not yet done larger (~1000 fills) test. I think there must have been an error in my code that I missed, because calling ScraperAPI should never result in a MaxTriesExceededException: Cannot Fetch from Google Scholar. From what I can tell of the code, that should ONLY result if FreeProxies() is tried. My only hunch is that that I commented out scholarly.use_proxy(pg) while reorganizing my test notebook and didn't notice. After getting ScraperAPI to work again in a clean notebook, I tried to replicate the issue by commenting out use_proxy(), and it did result in a single scraperAPI credit being used, and gave me the MaxTriesExceededException. (I'm not sure why Scholarly connecting to scraperAPI would use a credit, but that's a different question.)

arunkannawadi Jul 7, 2022
Maintainer

All proxy methods currently ping a non Google Scholar page to verify that the proxy is working. This was when we weren't too sure how well it worked, but we will remove this check in a future release.

abubelinha · 2023-05-01T19:34:53Z

abubelinha
May 1, 2023

Coming from issue #291

So I tried scraperapi like @stanleyrhodes and it worked so well.
I got my ~1K pubs and stored that output in a file ... just to realize all that you have explained: it is mostly incomplete/truncated stuff. And that consumed 1/3 of my scrapeapi credits. So I didn't dare to try scholarly.fill(pub)

Bottom line: this won't be enough but in my case I have no budget for paying a scraping api.
But at least this will save me a lot of work in terms of browsing GS pages.
I can now put this into a spreadsheet to filter out those pubs which are false positives.

One python-noob side question:
I should have used pickle or something alike, but I didn't realize and treated my pubs as strings.
i.e. I saved them into a .txt file like this:

with open(filename,"w",encoding="utf-8") as f:
   for pub in scholarly.search_pubs(query = "my query"):
      file.write(str(pub)+"\n")

But it is not that easy to reuse these lines now (if I ever want to try fill):
they were not json strings as I blindly assumed when I quickly stored them just for simple caching purposes.
I actually wonder how to turn one of them back into a publication object which I can fill. Any suggestions?

4 replies

stanleyrhodes May 1, 2023
Author

For my research project, I converted each set of search results to a bibtex file, then imported each into Zotero. Then I used Zotero's Find Available PDFs feature to get all the PDFs I could for having the full text. That allowed me to concentrate on only those entries with full text. For me, that was about 1k out of the 11k. I then programmatically generated web search hyperlinks using the titles from the Zotero to open hundreds of tabs at a time searching for those titles, and added the resulting items to Zotero using their connector, which had then had full entries with all the metadata for each item. Yes, it was a lot of by-hand work, but 1k is doable, and in your case it would probably be less if what you care about is having the full text for items. IMO the Scholarly + Zotero combination is powerful, and would be very powerful if both were improved for computer-aided systematic quantitative literature reviews. It's a niche use... for now. But its potential is incredible.

At this point, if your biggest obstacle is trying to figure out how to proceed with your research, it would probably be best if you can describe what your high-level research objective is, and perhaps I can share with you more details about what I've done, including code, if they're relevant. There may be workarounds you can do without shelling out any further money.

I don't have suggestions for your other question about turning your txt files back into JSON objects. The devs here with a deeper understanding of the object will hopefully weigh in. However, in practical terms, something to consider: when you need to buy the credits to use fill with scraperapi, the credits you use re-running the search portion will be only a fraction of the cost of using fill on all the items, so it may not be worth figuring out (if it's difficult).

abubelinha May 1, 2023

My use case is quite simple.
I want to track publications which cite items stored in several institutions' biological collections.

I.e. whenever researchers use any items for research purposes (i.e. sampling DNA in bones and things like that), the expected polite use is citing those items' numbers in any published results.
So, as a bonus, I would like to track those numbers too (i.e. regex searching full text for collection code+any number). But that's a fairly complex thing depending on whether the full text is available or not through GS.
For now I am happy with just finding the articles which cite the institution collection name (i.e. "London Natural History Museum" plus "birds collection", and searches like that).

So I basically need to search GS for the presence of the institution/collection name and create a file listing all those papers, like an ordinary references list.
Many of these references could be false positives as the institution name words could match by chance other words present in the paper (i.e. "Field Museum Chicago" may happen together, but those words can also appear separately in papers which are not actually related to that institution).

Anyway I am not rushy for doing this stuff.
I'll follow your suggestion and take a look at Zotero later on ... I had heard about it in the past but never tried.
Found some links about python integration here:
https://forums.zotero.org/discussion/89713/how-to-programatically-interface-with-zotero

1000 thanks for your advises
@abubelinha

arunkannawadi May 2, 2023
Maintainer

Thank you @stanleyrhodes for the detailed explanation! And great to see the scholarly user community engage in discussions.

@abubelinha The recommended method to save/load these objects is via the json.dump and json.load calls. In a different discussion thread, with an appropriate title, can you post an example of how it got stored. It may be possible to write a parser script that can reconstruct the scholarly objects from the text file.

stanleyrhodes May 2, 2023
Author

Google Scholar does pretty well with verbatim strings in quotes. "Field Museum Chicago" should not return results where those words are not together. I had pretty good success because my project depended on definitional phrases like "power is defined" and "define power" to pick up definitions in papers.

Second, the local Zotero API does not have any way to use Python, you have to use javascript. That said, I have javascript code that should make it doable. Zotero has a lot of web API stuff, but I haven't bothered with it because I had so many items.

arunkannawadi · 2023-05-02T14:06:00Z

arunkannawadi
May 2, 2023
Maintainer

this won't be enough but in my case I have no budget for paying a scraping api.

One alternative would be to have multiple accounts, say each person from your research group has one, and you could pool in all your credits together. If scholarly instead lets you handle captchas to prove to Google that you're a human, would that be a viable approach (I have no idea how often those captchas would pop up though)

0 replies

abubelinha · 2023-05-02T15:13:28Z

abubelinha
May 2, 2023

The captchas option might be interesting but I am not sure in which context you suggest to use them.
Do you mean not using proxies at all, but your own IP?
Also, how do you suggest to solve the captchas?

Would GS requests be sent by scholarly until a captcha appears ... and then scholarly somehow opens a web browser interface to let user solve that captcha and send the answer to Google through scholarly again?

That could be something worth to test. Of course if captchas appear too frequently then it doesn't worth the pain.
But if you can leave your Python script running in a CLI window while you keep on working with other programs until a pop-up appears not-so-frequently, then I find it very interesting.
Or do you mean captchas must be redirected to a third party python function that I need to somehow develop myself in order to solve captchas in a headless environment?
Then I would have no idea about how to do that.

1 reply

stanleyrhodes May 2, 2023
Author

In my experience, Google Scholar has a very short window of forcing you to do captchas before it simply blocks all requests. Maybe it's different for author requests, but for publication search the blocking happens almost immediately. And this captcha trigger is so sensitive that I've triggered it without programmatically querying at all, with just my normal, intense searching of literature by hand that I sometimes do. As a broadly interdisciplinary researcher sometimes I have to do a lot of searching.

You may want to consider an alternative to Scholarly for your particular use case. Publish or Perish retrieves snippet information as well, and has built-in rate limiting for Google Scholar, and has a number of export options including bibtex. I think it even has captcha support.
https://harzing.com/resources/publish-or-perish/

My own view is that researchers would most benefit from Scholarly focusing on the proxy and full item (fill) niche, while leaving the rate-limited and captcha niche to Publish or Perish.

abubelinha · 2023-05-02T20:05:14Z

abubelinha
May 2, 2023

Good to know about that program. As you say its built-in rate limiting for GS might be the key point for this issue.
Maybe that's the only thing scholarly would need for a use case like mine: I am no rushy, just lazy.
I don't need to download tons of publication references in one hour. I have many of them, but I have time.
I just need a Python script which can do the work for me at a slow rate.
Perhaps if scholarly worked at a less-than-human speed, I could use my own IP all the time without risks of being banned

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What are search_pubs() best practices and expectations? #430

{{title}}

Replies: 5 comments 15 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

What are search_pubs() best practices and expectations? #430

stanleyrhodes Jun 29, 2022

Replies: 5 comments · 15 replies

arunkannawadi Jun 29, 2022 Maintainer

stanleyrhodes Jun 30, 2022 Author

arunkannawadi Jun 30, 2022 Maintainer

stanleyrhodes Jul 1, 2022 Author

stanleyrhodes Jul 3, 2022 Author

arunkannawadi Jul 7, 2022 Maintainer

abubelinha May 1, 2023

stanleyrhodes May 1, 2023 Author

abubelinha May 1, 2023

arunkannawadi May 2, 2023 Maintainer

stanleyrhodes May 2, 2023 Author

arunkannawadi May 2, 2023 Maintainer

abubelinha May 2, 2023

stanleyrhodes May 2, 2023 Author

abubelinha May 2, 2023

stanleyrhodes
Jun 29, 2022

Replies: 5 comments 15 replies

arunkannawadi
Jun 29, 2022
Maintainer

stanleyrhodes Jun 30, 2022
Author

arunkannawadi Jun 30, 2022
Maintainer

stanleyrhodes Jul 1, 2022
Author

stanleyrhodes Jul 3, 2022
Author

arunkannawadi Jul 7, 2022
Maintainer

abubelinha
May 1, 2023

stanleyrhodes May 1, 2023
Author

arunkannawadi May 2, 2023
Maintainer

stanleyrhodes May 2, 2023
Author

arunkannawadi
May 2, 2023
Maintainer

abubelinha
May 2, 2023

stanleyrhodes May 2, 2023
Author

abubelinha
May 2, 2023