-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download Bank data into S3 location #4
Comments
you cannot save in s3, only download from s3 |
Thanks for replay @osallou . Is there any option to use some other database instead of local file system ? |
nope, the goal is to get local files. |
@osallou Thanks for help. |
@osallou can you please help me out by pointing to any such example process and bank file. I thought about using same script for copy based on destination location. But, i am not able to get the source location on the fly every different bank. |
can i create my own setting like s3.remote.location for destination folder and pass this as argument for process script ? |
Property files support interpolation, so you can create your own variables and use them elsewhere in properties, like: myvar=myvalue I have no process example for s3 but you have db and process examples at https://github.com/genouest/biomaj-data?files=1 |
It is really helpful. Thanks @osallou |
HI @osallou, thanks for your help before. I have usecase, that requires only some processing but not any download. How can i achieve in biomaj configuration. I tried to keep 'protocol' field none but it is only working if there are any depends banks. |
simply use local protocol with a fake local file to "simulate" a download. |
But local protocol will create copy of configured file before starting the post process. I don't need two copies for same file. |
just create a "fake" file ( /opt/fake/triggerbiomaj.txt for example) and use it |
ok, got it. Thanks |
@osallou ftp download is failing even with your example file swissprot.properties (https://github.com/genouest/biomaj-data/blob/master/biomaj_data/db_properties/swissprot.properties) Traceback (most recent call last): It seems obvious because, lines list has FTP Listing of /pub/databases/uniprot/knowledgebase/ at ftp.uniprot.org', '', 'Parent Directory ', ' ', 'Jul 03 2019 15:52 Link LICENSE -> ../../../LICENSE ', 'Jul 03 2019 14:00 3898 README', 'Jul 03 2019 14:00 8107 RELEASE.metalink', 'Jul 03 2019 14:00 Directory docs', 'Jul 03 2019 14:00 151 reldate.txt', 'Jul 03 2019 14:00 53536 uniprot.xsd', 'Jul 03 2019 14:00 576634324 uniprot_sprot.dat.gz', 'Jul 03 2019 14:00 88666136 uniprot_sprot.fasta.gz', 'Jul 03 2019 14:00 756218436 uniprot_sprot.xml.gz', 'Jul 03 2019 14:00 8288117 uniprot_sprot_varsplic.fasta.gz', 'Jul 03 2019 14:00 102618527180 uniprot_trembl.dat.gz', 'Jul 03 2019 14:00 37149489903 uniprot_trembl.fasta.gz', 'Jul 03 2019 14:00 120526763606 uniprot_trembl.xml.gz', '', ' ', '', '', ''] It fails in very first line parsing. |
humm.... http is returned , not ftp.... looks like the http protocol is used , though properties file specifies ftp. |
I just did a local test and it worked just fine |
I also made a quick test using https://raw.githubusercontent.com/genouest/biomaj-data/master/biomaj_data/db_properties/swissprot.properties on osallou/biomaj-docker:latest and it worked fine too. So if you are using this property file , I do not see why http protocol would be used |
I am installing latest version of biomaj. |
ok, so using the monolith install I gonna try with latest code on monolith so see if a protocol issue could occur in this case (though I do not see what could be the difference) |
I tested locally and it works fine too.
We can see in logs
ftp is correctly used In your global.properties, set (or change)
and set all logger/handler log level to DEBUG Then try to run your update and please send the resulting logs |
Even my case also, i don't think it is going to HTTP implementation, as you can see it points to ftp.h (list() function) in error trace. All i can see is it has listed html tag lines also into list, along with actual list of files/folders. For files it will work, but as it is looking to parse html tagged line, it was failing. I hope this might help. |
sure, will send Debug messages. |
2019-07-04 11:28:02,143 INFO [root][MainThread] Workflow:Skip:depends 2019-07-04 11:28:03,138 ERROR [root][MainThread] Error during task download |
hum strange. The listing is an http list, not a ftp list. The 'HTML//EN"> shows it is html. |
[root@3cd1c9c09f59 /]# curl --version |
Package python-pycurl-7.19.0-19.el7.x86_64 |
are you using python2 or 3? |
using 3 as you sent cmd pip3 .... :-) |
I did a test in a fresh docker and installed biomaj with pip3. It worked fine... :-(
In virtualenv I created to install biomaj packages
so we have same libraries, same install, and I cannot reproduce in any environment using https://raw.githubusercontent.com/genouest/biomaj-data/master/biomaj_data/db_properties/swissprot.properties Are you sure there is no pb in your global or properties file? Can you provide your global.properties? |
Looking back at issue, I just saw in your result: "... /pub/databases/uniprot/knowledgebase/ at ftp.uniprot.org" this is not the swissprot example.... it is a request to uniprot server. Which config file are you using??? or we are not redirected to the same web site.... from biomaj location, what is result of
|
I see easily how to fix this parsing issue, I just wonder why we do not get same results and some uniprot references... |
@osallou sorry for the late replay. Thanks for your help.
< HTTP/1.1 200 OK Index of /blast/db/FASTAName Last modified Size* Connection #0 to host ftp.ncbi.nih.gov left intact |
global.properties file has
|
Debug report for swissprot dataset
|
I gonna check with your global.properties and what is result of curl ftp:
please indent your results or attach files, they are hard to read... |
I got no issue with your global.properties... I think I remember a problem with someone in a company who had this kind of issue. The requests were going out through a proxy, and this proxy do not manage ftp proxy directly, it proxied the ftp request to http requests/connections, leading to different answers.... |
@osallou Thanks for your help. Sorry, next time i will take care about indentation. |
So i think the proxy is the issue
If it is the issue, and i think it is, then you cannot use ftp from your company (or ask it team a true direct ftp access to internet from your server).... Workaround is to use http as most of web sites fir vanks propose ftp and http access. However, as http listing is not standard, it means you may have to customize the http regexp properties set in global.properties in your bank property file. Regexps are used to analyse web listing page and extract file and dir info. Yoi can try however with default ones and see if they match. |
|
Yes, can we contribute to handle this type of case. Because, company proxies (policies) might not be changed for single project. |
The problem is proxies that do ftp -> http (though not always supported or allowed), have no standards. It means that returned http will not have the same look depending on used proxy. This prevents biomaj from correctly handling this use case. As I said, workaround is to use http protocol instead of ftp in those cases. The http regexp parser is not always cool/easy to setup, but usually you only need to define a few use cases. If you find a way to handle most cases, we'll be glad to get it in biomaj :-) |
I want to download bank related data into S3 bucket instead of local file system. When i tried to configure s3 path in data.dir variable, it creates that path in current directory and download data into that particular folder.
Can anyone please help me to configure AWS s3 location for downloaded data.
The text was updated successfully, but these errors were encountered: