Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download Bank data into S3 location #4

Open
nsanilkumar-valluri opened this issue Feb 18, 2019 · 41 comments
Open

Download Bank data into S3 location #4

nsanilkumar-valluri opened this issue Feb 18, 2019 · 41 comments
Assignees
Labels

Comments

@nsanilkumar-valluri
Copy link

I want to download bank related data into S3 bucket instead of local file system. When i tried to configure s3 path in data.dir variable, it creates that path in current directory and download data into that particular folder.
Can anyone please help me to configure AWS s3 location for downloaded data.

@osallou
Copy link
Contributor

osallou commented Feb 18, 2019

you cannot save in s3, only download from s3
if you want to save in s3, you should still save in local dir then via a post-process push data to s3 (and delete local data, but in this case on next update you will have to download everything)

@nsanilkumar-valluri
Copy link
Author

Thanks for replay @osallou . Is there any option to use some other database instead of local file system ?

@osallou
Copy link
Contributor

osallou commented Feb 18, 2019

nope, the goal is to get local files.
Only other way (for the moment) is to use above solution ie push data after update via a post-process

@nsanilkumar-valluri
Copy link
Author

@osallou Thanks for help.

@nsanilkumar-valluri
Copy link
Author

@osallou can you please help me out by pointing to any such example process and bank file. I thought about using same script for copy based on destination location. But, i am not able to get the source location on the fly every different bank.

@nsanilkumar-valluri
Copy link
Author

can i create my own setting like s3.remote.location for destination folder and pass this as argument for process script ?

@osallou
Copy link
Contributor

osallou commented Mar 11, 2019

Property files support interpolation, so you can create your own variables and use them elsewhere in properties, like:

myvar=myvalue
myproc.args= %(myvar)s bla bla bla

I have no process example for s3 but you have db and process examples at https://github.com/genouest/biomaj-data?files=1

@nsanilkumar-valluri
Copy link
Author

It is really helpful. Thanks @osallou

@nsanilkumar-valluri
Copy link
Author

HI @osallou, thanks for your help before. I have usecase, that requires only some processing but not any download. How can i achieve in biomaj configuration. I tried to keep 'protocol' field none but it is only working if there are any depends banks.

@osallou
Copy link
Contributor

osallou commented May 16, 2019

simply use local protocol with a fake local file to "simulate" a download.
And touch this file to update its last modified date to consider a new "workflow"
Or create it as a pre process.

@nsanilkumar-valluri
Copy link
Author

But local protocol will create copy of configured file before starting the post process. I don't need two copies for same file.
I have file called 'samp1.fasta' file, from this if i configured local copy, it will create another copy of samp1.fasta. Later it will trigger my post process script sample.sh. But i don't want any other copy of samp1.fasta in my case.

@osallou
Copy link
Contributor

osallou commented May 16, 2019

just create a "fake" file ( /opt/fake/triggerbiomaj.txt for example) and use it

@nsanilkumar-valluri
Copy link
Author

ok, got it. Thanks

@nsanilkumar-valluri
Copy link
Author

@osallou ftp download is failing even with your example file swissprot.properties (https://github.com/genouest/biomaj-data/blob/master/biomaj_data/db_properties/swissprot.properties)

Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/biomaj/workflow.py", line 130, in start
self.session.session['status'][flow['name']] = getattr(self, 'wf' + flow['name'])()
File "/usr/local/lib/python3.6/site-packages/biomaj/workflow.py", line 1168, in wf_download
(file_list, dir_list) = downloader.list()
File "/usr/local/lib/python3.6/site-packages/biomaj_download/download/ftp.py", line 296, in list
rfile['size'] = int(parts[4])
ValueError: invalid literal for int() with base 10: 'HTML//EN">'

It seems obvious because, lines list has
['', '', '', '<TITLE>FTP Listing of /pub/databases/uniprot/knowledgebase/ at ftp.uniprot.org</TITLE>', '', '', '', '

FTP Listing of /pub/databases/uniprot/knowledgebase/ at ftp.uniprot.org

', '
', 'Parent Directory
', '
', 'Jul 03 2019 15:52         Link LICENSE -> ../../../LICENSE ', 'Jul 03 2019 14:00         3898 README', 'Jul 03 2019 14:00         8107 RELEASE.metalink', 'Jul 03 2019 14:00    Directory docs', 'Jul 03 2019 14:00          151 reldate.txt', 'Jul 03 2019 14:00        53536 uniprot.xsd', 'Jul 03 2019 14:00    576634324 uniprot_sprot.dat.gz', 'Jul 03 2019 14:00     88666136 uniprot_sprot.fasta.gz', 'Jul 03 2019 14:00    756218436 uniprot_sprot.xml.gz', 'Jul 03 2019 14:00      8288117 uniprot_sprot_varsplic.fasta.gz', 'Jul 03 2019 14:00 102618527180 uniprot_trembl.dat.gz', 'Jul 03 2019 14:00  37149489903 uniprot_trembl.fasta.gz', 'Jul 03 2019 14:00 120526763606 uniprot_trembl.xml.gz', '
', '
', '', '', '']

It fails in very first line parsing.
Is there any parameter i am missing to skip this error?

@osallou
Copy link
Contributor

osallou commented Jul 4, 2019

humm.... http is returned , not ftp.... looks like the http protocol is used , though properties file specifies ftp.
I gonna have a check

@osallou
Copy link
Contributor

osallou commented Jul 4, 2019

I just did a local test and it worked just fine
Which version of biomaj do you use?
Which setup : docker micro service or monolitic install?

@osallou
Copy link
Contributor

osallou commented Jul 4, 2019

I also made a quick test using https://raw.githubusercontent.com/genouest/biomaj-data/master/biomaj_data/db_properties/swissprot.properties on osallou/biomaj-docker:latest and it worked fine too.

So if you are using this property file , I do not see why http protocol would be used

@nsanilkumar-valluri
Copy link
Author

I am installing latest version of biomaj.
Installed it using pip3 install biomaj biomaj-cli biomaj-daemon biomaj-process biomaj-download biomaj-ftp biomaj-release biomaj-user biomaj-zipkin biomaj-core

@osallou
Copy link
Contributor

osallou commented Jul 4, 2019

ok, so using the monolith install
I tried with docker setup, but anyway should use latest pip packages.

I gonna try with latest code on monolith so see if a protocol issue could occur in this case (though I do not see what could be the difference)

@osallou
Copy link
Contributor

osallou commented Jul 4, 2019

I tested locally and it works fine too.

2019-07-04 13:17:46,091 DEBUG [root][MainThread] Download:List:ftp://ftp.ncbi.nih.gov/blast/db/FASTA/
2019-07-04 13:18:03,897 DEBUG [root][MainThread] Download:File:RegExp:['^swissprot\\.gz$']
2019-07-04 13:18:03,898 DEBUG [root][MainThread] Download:File:MatchRegExp:swissprot.gz
2019-07-04 13:18:03,898 INFO  [root][MainThread] Workflow:wf_download:nb_files_to_download:1
2019-07-04 13:18:03,899 INFO  [root][MainThread] Workflow:wf_download:release:remoterelease:2019-7-2
2019-07-04 13:18:03,899 INFO  [root][MainThread] Workflow:wf_download:release:release:2019-7-2
2019-07-04 13:18:03,909 DEBUG [root][MainThread] Workflow:wf_download:offline_check_dir:/home/osallou/Development/NOSAVE/genouest/biomaj-test/test/data/biomaj/OfflineDir/swissprot_tmp
2019-07-04 13:18:03,909 DEBUG [root][MainThread] Workflow:wf_download:offline_check_file:swissprot.gz
2019-07-04 13:18:03,910 INFO  [root][MainThread] Workflow:wf_download:nb_expected_files:1
2019-07-04 13:18:03,910 INFO  [root][MainThread] Workflow:wf_download:nb_files_in_offline_dir:0
2019-07-04 13:18:03,910 DEBUG [root][MainThread] Workflow:wf_download:create_dir_structure:start
2019-07-04 13:18:03,911 DEBUG [root][MainThread] Workflow:wf_download:create_dir_structure:done
2019-07-04 13:18:04,003 INFO  [root][MainThread] Use remote: False
2019-07-04 13:18:04,004 INFO  [root][MainThread] Workflow:wf_download:DownloadSession:69a2a620-bcd1-4074-87f2-1e273f1cd869
2019-07-04 13:18:04,005 INFO  [root][MainThread] Workflow:wf_download:Download:Waiting
2019-07-04 13:18:04,005 INFO  [root][MainThread] Workflow:wf_download:RemoteDownload:Waiting
2019-07-04 13:18:04,005 INFO  [root][MainThread] Workflow:wf_download:Download:Threads:FillQueue
2019-07-04 13:18:04,006 INFO  [root][MainThread] Workflow:wf_download:Download:Threads:Start
2019-07-04 13:18:04,006 INFO  [root][Thread-5] Start download thread
2019-07-04 13:18:04,007 DEBUG [root][Thread-5] swissprot request to download from ftp://ftp.ncbi.nih.gov
2019-07-04 13:18:04,007 DEBUG [biomaj][Thread-5] Download
2019-07-04 13:18:04,008 DEBUG [root][Thread-5] FTP:Download
2019-07-04 13:18:04,008 DEBUG [root][Thread-5] FTP:Download:Progress:1/1 downloading file swissprot.gz
2019-07-04 13:18:04,008 DEBUG [root][Thread-5] FTP:Download:Progress:1/1 save as swissprot.gz

We can see in logs

2019-07-04 13:18:04,007 DEBUG [root][Thread-5] swissprot request to download from ftp://ftp.ncbi.nih.gov

ftp is correctly used

In your global.properties, set (or change)

historic.logfile.level=DEBUG

and set all logger/handler log level to DEBUG

Then try to run your update and please send the resulting logs

@nsanilkumar-valluri
Copy link
Author

Even my case also, i don't think it is going to HTTP implementation, as you can see it points to ftp.h (list() function) in error trace. All i can see is it has listed html tag lines also into list, along with actual list of files/folders. For files it will work, but as it is looking to parse html tagged line, it was failing. I hope this might help.

@nsanilkumar-valluri
Copy link
Author

sure, will send Debug messages.

@nsanilkumar-valluri
Copy link
Author

2019-07-04 11:28:02,143 INFO [root][MainThread] Workflow:Skip:depends
2019-07-04 11:28:02,143 INFO [root][MainThread] Workflow:Skip:preprocess
2019-07-04 11:28:02,143 INFO [root][MainThread] Workflow:Skip:release
2019-07-04 11:28:02,144 INFO [root][MainThread] Workflow:Start:download
2019-07-04 11:28:02,144 INFO [root][MainThread] Workflow:wf_download
2019-07-04 11:28:02,144 INFO [root][MainThread] Use remote: False
2019-07-04 11:28:02,144 INFO [root][MainThread] Workflow:wf_download:DownloadSession:500dce93-a43e-48be-b876-870c0f70f523
2019-07-04 11:28:02,144 DEBUG [biomaj][MainThread] Download
2019-07-04 11:28:02,145 INFO [root][MainThread] Workflow:DownloadService:CleanSession
2019-07-04 11:28:02,145 DEBUG [root][MainThread] Download:List:ftp://ftp.ncbi.nih.gov/blast/db/FASTA/
2019-07-04 11:28:03,135 ERROR [root][MainThread] Workflow:download:Exception:invalid literal for int() with base 10: 'HTML//EN">'
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/biomaj/workflow.py", line 130, in start
self.session.session['status'][flow['name']] = getattr(self, 'wf' + flow['name'])()
File "/usr/local/lib/python3.6/site-packages/biomaj/workflow.py", line 1168, in wf_download
(file_list, dir_list) = downloader.list()
File "/usr/local/lib/python3.6/site-packages/biomaj_download/download/ftp.py", line 296, in list
rfile['size'] = int(parts[4])
ValueError: invalid literal for int() with base 10: 'HTML//EN">'
2019-07-04 11:28:03,137 DEBUG [root][MainThread] Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/biomaj/workflow.py", line 130, in start
self.session.session['status'][flow['name']] = getattr(self, 'wf' + flow['name'])()
File "/usr/local/lib/python3.6/site-packages/biomaj/workflow.py", line 1168, in wf_download
(file_list, dir_list) = downloader.list()
File "/usr/local/lib/python3.6/site-packages/biomaj_download/download/ftp.py", line 296, in list
rfile['size'] = int(parts[4])
ValueError: invalid literal for int() with base 10: 'HTML//EN">'

2019-07-04 11:28:03,138 ERROR [root][MainThread] Error during task download
2019-07-04 11:28:03,138 INFO [root][MainThread] Workflow:wf_over
2019-07-04 11:28:03,175 INFO [root][MainThread] Notify:none
An error occured:

@osallou
Copy link
Contributor

osallou commented Jul 4, 2019

hum strange. The listing is an http list, not a ftp list. The 'HTML//EN"> shows it is html.
And I do not experience the problem both on my computer (latest code) and our prod server (little older code).
Could it be a curl /pycurl issue? which version of pycurl/curl are you using? Which os?

@nsanilkumar-valluri
Copy link
Author

[root@3cd1c9c09f59 /]# curl --version
curl 7.29.0 (x86_64-redhat-linux-gnu) libcurl/7.29.0 NSS/3.36 zlib/1.2.7 libidn/1.28 libssh2/1.4.3
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp scp sftp smtp smtps telnet tftp
Features: AsynchDNS GSS-Negotiate IDN IPv6 Largefile NTLM NTLM_WB SSL libz unix-sockets

@nsanilkumar-valluri
Copy link
Author

Package python-pycurl-7.19.0-19.el7.x86_64

@osallou
Copy link
Contributor

osallou commented Jul 4, 2019

are you using python2 or 3?

@osallou
Copy link
Contributor

osallou commented Jul 4, 2019

using 3 as you sent cmd pip3 .... :-)

@osallou
Copy link
Contributor

osallou commented Jul 4, 2019

I did a test in a fresh docker and installed biomaj with pip3. It worked fine... :-(

  • centos 7
  • curl-7.29.0-51.el7.x86_64
  • libcurl-devel-7.29.0-51.el7.x86_64
  • libcurl-7.29.0-51.el7.x86_64
  • python-pycurl-7.19.0-19.el7.x86_64

In virtualenv I created to install biomaj packages

  • pycurl==7.43.0 (install via pip, can be seen via a pip freeze | grep pycurl)

so we have same libraries, same install, and I cannot reproduce in any environment using https://raw.githubusercontent.com/genouest/biomaj-data/master/biomaj_data/db_properties/swissprot.properties

Are you sure there is no pb in your global or properties file?

Can you provide your global.properties?

@osallou
Copy link
Contributor

osallou commented Jul 4, 2019

Looking back at issue, I just saw in your result:

"... /pub/databases/uniprot/knowledgebase/ at ftp.uniprot.org"

this is not the swissprot example.... it is a request to uniprot server. Which config file are you using???

or we are not redirected to the same web site....

from biomaj location, what is result of

curl -v https://ftp.ncbi.nih.gov/blast/db/FASTA/

@osallou
Copy link
Contributor

osallou commented Jul 4, 2019

I see easily how to fix this parsing issue, I just wonder why we do not get same results and some uniprot references...

@nsanilkumar-valluri
Copy link
Author

@osallou sorry for the late replay. Thanks for your help.
Regarding URL, yes first debug statement are different. But later i tried to use same swissprot.properties to confirm the issue. Sorry for the confusion, but i can assure this problem is also with swissprot dataset.
For curl command, following is the output we are getting
[root@fa19786b45e5 /]# curl -v https://ftp.ncbi.nih.gov/blast/db/FASTA/

  • About to connect() to ftp.ncbi.nih.gov port 443 (#0)
  • Trying 130.14.250.13...
  • Connected to ftp.ncbi.nih.gov (130.14.250.13) port 443 (#0)
  • Initializing NSS with certpath: sql:/etc/pki/nssdb
  • CAfile: /etc/pki/tls/certs/ca-bundle.crt
    CApath: none
  • SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
  • Server certificate:
  •   subject: CN=*.ncbi.nih.gov,OU=Domain Control Validated
    
  •   start date: Jun 11 12:56:31 2019 GMT
    
  •   expire date: Jun 20 15:41:40 2020 GMT
    
  •   common name: *.ncbi.nih.gov
    
  •   issuer: CN=Go Daddy Secure Certificate Authority - G2,OU=http://certs.godaddy.com/repository/,O="GoDaddy.com, Inc.",L=Scottsdale,ST=Arizona,C=US
    

GET /blast/db/FASTA/ HTTP/1.1
User-Agent: curl/7.29.0
Host: ftp.ncbi.nih.gov
Accept: /

< HTTP/1.1 200 OK
< Date: Thu, 04 Jul 2019 14:01:46 GMT
< Server: Apache
< Vary: Accept-Encoding
< Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Methods: GET,POST,PUT,OPTIONS
< Access-Control-Allow-Headers: RANGE, Cache-control, If-None-Match, Content-Type
< Access-Control-Expose-Headers: Content-Length, Content-Range, Content-Type
< Content-Length: 4170
< Content-Type: text/html;charset=UTF-8
<

<title>Index of /blast/db/FASTA</title>

Index of /blast/db/FASTA

Name                    Last modified      Size  
Parent Directory - alu.a.gz 2003-11-26 11:44 89K alu.a.gz.md5 2009-06-15 10:40 43 alu.n.gz 2003-11-26 11:44 24K alu.n.gz.md5 2009-06-15 10:40 43 drosoph.aa.gz 2003-11-26 11:44 4.1M drosoph.aa.gz.md5 2009-06-15 10:40 48 drosoph.nt.gz 2003-11-26 11:44 35M drosoph.nt.gz.md5 2009-06-15 10:40 48 env_nr.gz 2019-06-22 20:04 923M env_nr.gz.md5 2019-06-22 20:04 44 env_nt.gz 2019-06-23 07:06 40G env_nt.gz.md5 2019-06-23 07:25 44 est_human.gz 2019-03-17 13:21 1.4G est_human.gz.md5 2019-03-17 13:21 47 est_mouse.gz 2019-03-17 14:10 740M est_mouse.gz.md5 2019-03-17 14:10 47 est_others.gz 2019-06-23 17:46 11G est_others.gz.md5 2019-06-23 17:51 48 gss.gz 2019-02-24 07:22 9.3G gss.gz.md5 2019-02-24 07:27 41 htgs.gz 2019-06-23 05:45 7.5G htgs.gz.md5 2019-06-23 05:49 42 igSeqNt.gz 2013-02-01 00:15 32M igSeqProt.gz 2013-02-01 00:15 4.4M mito.aa.gz 2019-07-03 23:50 15M mito.aa.gz.md5 2019-07-03 23:50 45 mito.nt.gz 2019-07-03 23:51 71M mito.nt.gz.md5 2019-07-03 23:51 45 nr.gz 2019-07-02 04:01 47G nr.gz.md5 2019-07-02 04:20 40 nt.gz 2019-06-23 09:27 54G nt.gz.md5 2019-06-23 09:52 40 other_genomic.gz 2019-06-29 17:25 279G other_genomic.gz.md5 2019-06-29 19:31 51 pataa.gz 2019-06-23 06:01 275M pataa.gz.md5 2019-06-23 06:02 43 patnt.gz 2019-06-23 08:27 5.6G patnt.gz.md5 2019-06-23 08:30 43 pdbaa.gz 2019-07-02 00:00 20M pdbaa.gz.md5 2019-07-02 00:00 43 pdbnt.gz 2019-07-01 21:00 664K pdbnt.gz.md5 2019-07-01 21:00 43 sts.gz 2019-05-19 02:02 187M sts.gz.md5 2019-05-19 02:03 41 swissprot.gz 2019-07-02 00:00 102M swissprot.gz.md5 2019-07-02 00:00 47 vector.gz 2010-01-13 10:33 860K yeast.aa.gz 2003-11-26 11:44 1.9M yeast.aa.gz.md5 2009-06-15 10:40 46 yeast.nt.gz 2003-11-26 11:44 3.6M yeast.nt.gz.md5 2009-06-15 10:40 46
* Connection #0 to host ftp.ncbi.nih.gov left intact

@nsanilkumar-valluri
Copy link
Author

nsanilkumar-valluri commented Jul 4, 2019

global.properties file has

[GENERAL]
root.dir=/********/biomaj_data
conf.dir=%(root.dir)s/conf
log.dir=/***/log
process.dir=%(root.dir)s/process
cache.dir=%(root.dir)s/cache
lock.dir=%(root.dir)s/lock
#The root directory where all databases are stored.
#If your data is not stored under one directory hirearchy
#you can override this value in the database properties file.
data.dir=/***/data

db.url=mongodb://127.0.0.1:27017
db.name=biomaj

use_ldap=0
ldap.host=localhost
ldap.port=389
ldap.dn=nodomain

use_elastic=0
#Comma separated list of elasticsearch nodes  host1,host2:port2
elastic_nodes=elasticsearch
elastic_index=biomaj
# Calculate data.dir size stats
data.stats=1

celery.queue=biomaj
celery.broker=mongodb://127.0.0.1:27017/biomaj_celery


auto_publish=1

########################
# Global properties file
#To override these settings for a specific database go to its
#properties file and uncomment or add the specific line you want
#to override.
#----------------
# Mail Configuration
#---------------
#Uncomment thes lines if you want receive mail when the workflow is finished

mail.smtp.host=
#mail.stmp.host=
mail.admin=
mail.from=biomaj@localhost
mail.user=
mail.password=
mail.tls=

#---------------------
#Proxy authentification
#---------------------
#proxyHost=
#proxyPort=
#proxyUser=
#proxyPassword=

#---------------------
# PROTOCOL
#-------------------
#possible values : ftp, http, rsync, local
port=21
username=anonymous
password=anonymous@nowhere.com

#access user for production directories
production.directory.chmod=775
#Number of thread during the download
bank.num.threads=4

#Number of threads to use for downloading and processing
files.num.threads=4

#to keep more than one release increase this value
keep.old.version=0

#Link copy property
do.link.copy=true

#The historic log file is generated in log/
#define level information for output : DEBUG,INFO,WARN,ERR
historic.logfile.level=DEBUG

http.parse.dir.line=<a[\\s]+href="([\\S]+)\\/"[\\s]*>.*([\\d]{4}-[\\w\\d]{2,5}-[\\d]{2}\\s[\\d]{2}:[\\d]{2})
http.parse.file.line=<a[\\s]+href="([\\S]+)"[\\s]*>.*([\\d]{4}-[\\w\\d]{2,5}-[\\d]{2}\\s[\\d]{2}:[\\d]{2}).*([\d\.]+[MKG]{0,1})

http.group.dir.name=1
http.group.dir.date=2
http.group.file.name=1
http.group.file.date=2
http.group.file.size=3

#Needed if data sources are contains in an archive
log.files=true

local.files.excluded=\\.panfs.*

#~40mn
ftp.timeout=2000000
ftp.automatic.reconnect=5
ftp.active.mode=false

# Bank default access
visibility.default=public

#proxy=http://localhost:3128

[loggers]
keys = root, biomaj

[handlers]
keys = console

[formatters]
keys = generic

[logger_root]
level = DEBUG
handlers = console

[logger_biomaj]
level = DEBUG
handlers = console
qualname = biomaj
propagate=0

[handler_console]
class = StreamHandler
args = (sys.stderr,)
level = DEBUG
formatter = generic

[formatter_generic]
format = %(asctime)s %(levelname)-5.5s [%(name)s][%(threadName)s] %(message)s

@nsanilkumar-valluri
Copy link
Author

nsanilkumar-valluri commented Jul 4, 2019

Debug report for swissprot dataset

2019-07-04 14:10:09,799 DEBUG [biomaj][MainThread] Download
2019-07-04 14:10:09,800 INFO  [root][MainThread] Workflow:DownloadService:CleanSession
2019-07-04 14:10:09,800 DEBUG [root][MainThread] Download:List:ftp://ftp.ncbi.nih.gov/blast/db/FASTA/
2019-07-04 14:10:10,493 ERROR [root][MainThread] Workflow:download:Exception:invalid literal for int() with base 10: 'HTML//EN">'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/biomaj/workflow.py", line 130, in start
self.session._session['status'][flow['name']] = getattr(self, 'wf_' + flow['name'])()
  File "/usr/local/lib/python3.6/site-packages/biomaj/workflow.py", line 1168, in wf_download
(file_list, dir_list) = downloader.list()
  File "/usr/local/lib/python3.6/site-packages/biomaj_download/download/ftp.py", line 296, in list
rfile['size'] = int(parts[4])
ValueError: invalid literal for int() with base 10: 'HTML//EN">'
2019-07-04 14:10:10,494 DEBUG [root][MainThread] Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/biomaj/workflow.py", line 130, in start
self.session._session['status'][flow['name']] = getattr(self, 'wf_' + flow['name'])()
  File "/usr/local/lib/python3.6/site-packages/biomaj/workflow.py", line 1168, in wf_download
(file_list, dir_list) = downloader.list()
  File "/usr/local/lib/python3.6/site-packages/biomaj_download/download/ftp.py", line 296, in list
rfile['size'] = int(parts[4])
ValueError: invalid literal for int() with base 10: 'HTML//EN">'

2019-07-04 14:10:10,495 ERROR [root][MainThread] Error during task download
2019-07-04 14:10:10,495 INFO  [root][MainThread] Workflow:wf_over
2019-07-04 14:10:10,532 INFO  [root][MainThread] Notify:none
An error occured:

Bank update request sent for swissprot
Failed to send update request for swissprot

@osallou
Copy link
Contributor

osallou commented Jul 4, 2019

I gonna check with your global.properties

and what is result of curl ftp:

curl -v ftp://ftp.ncbi.nih.gov/blast/db/FASTA/

please indent your results or attach files, they are hard to read...

@osallou
Copy link
Contributor

osallou commented Jul 4, 2019

I got no issue with your global.properties...
You still get an HTTP answer to an FTP request.... Are you behind a proxy?

I think I remember a problem with someone in a company who had this kind of issue. The requests were going out through a proxy, and this proxy do not manage ftp proxy directly, it proxied the ftp request to http requests/connections, leading to different answers....

@nsanilkumar-valluri
Copy link
Author

@osallou Thanks for your help. Sorry, next time i will take care about indentation.
Yesssss, i am behind my company proxy. Did you remember any resolution for that problem.

@osallou
Copy link
Contributor

osallou commented Jul 5, 2019

So i think the proxy is the issue
Can you try the curl ftp cmd to see what is returned?

curl -v ftp://ftp.ncbi.nih.gov/blast/db/FASTA/

If it is the issue, and i think it is, then you cannot use ftp from your company (or ask it team a true direct ftp access to internet from your server).... Workaround is to use http as most of web sites fir vanks propose ftp and http access. However, as http listing is not standard, it means you may have to customize the http regexp properties set in global.properties in your bank property file.

Regexps are used to analyse web listing page and extract file and dir info.

Yoi can try however with default ones and see if they match.

@nsanilkumar-valluri
Copy link
Author

nsanilkumar-valluri commented Jul 5, 2019

* About to connect() to proxy **************.com port 8080 (#0)
*   Trying ****************
* Connected to ***********************.com (10.127.189.154) port 8080 (#0)
> GET ftp://ftp.ncbi.nih.gov/blast/db/FASTA/ HTTP/1.1
> User-Agent: curl/7.29.0
> Host: ftp.ncbi.nih.gov:21
> Accept: */*
> Proxy-Connection: Keep-Alive
>
< HTTP/1.1 200 OK
< Content-Type: text/html
< Transfer-Encoding: chunked
< Proxy-Connection: Keep-Alive
< Connection: Keep-Alive
< Date: Fri, 05 Jul 2019 05:41:44 GMT
<
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<HTML>
<HEAD>
<TITLE>FTP Listing of /blast/db/FASTA/ at ftp.ncbi.nih.gov</TITLE>
<BASE HREF="ftp://ftp.ncbi.nih.gov/blast/db/FASTA/">
</HEAD>
<BODY>
<H2>FTP Listing of /blast/db/FASTA/ at ftp.ncbi.nih.gov</H2>
<HR>
<A HREF="../">Parent Directory</A><BR>
<PRE>
Nov 26 2003 00:00        91553 <A HREF="alu.a.gz">alu.a.gz</A>
Jun 15 2009 00:00           43 <A HREF="alu.a.gz.md5">alu.a.gz.md5</A>
Nov 26 2003 00:00        24465 <A HREF="alu.n.gz">alu.n.gz</A>
Jun 15 2009 00:00           43 <A HREF="alu.n.gz.md5">alu.n.gz.md5</A>
Nov 26 2003 00:00      4283092 <A HREF="drosoph.aa.gz">drosoph.aa.gz</A>
Jun 15 2009 00:00           48 <A HREF="drosoph.aa.gz.md5">drosoph.aa.gz.md5</A>
Nov 26 2003 00:00     36924008 <A HREF="drosoph.nt.gz">drosoph.nt.gz</A>
Jun 15 2009 00:00           48 <A HREF="drosoph.nt.gz.md5">drosoph.nt.gz.md5</A>
Jun 23 2019 00:04    967779446 <A HREF="env_nr.gz">env_nr.gz</A>
Jun 23 2019 00:04           44 <A HREF="env_nr.gz.md5">env_nr.gz.md5</A>
Jun 23 2019 11:06  43086728486 <A HREF="env_nt.gz">env_nt.gz</A>
Jun 23 2019 11:25           44 <A HREF="env_nt.gz.md5">env_nt.gz.md5</A>
Mar 17 2019 17:21   1458715296 <A HREF="est_human.gz">est_human.gz</A>
Mar 17 2019 17:21           47 <A HREF="est_human.gz.md5">est_human.gz.md5</A>
Mar 17 2019 18:10    776046470 <A HREF="est_mouse.gz">est_mouse.gz</A>
Mar 17 2019 18:10           47 <A HREF="est_mouse.gz.md5">est_mouse.gz.md5</A>
Jun 23 2019 21:46  11779604082 <A HREF="est_others.gz">est_others.gz</A>
Jun 23 2019 21:51           48 <A HREF="est_others.gz.md5">est_others.gz.md5</A>
Feb 24 2019 12:22   9999571934 <A HREF="gss.gz">gss.gz</A>
Feb 24 2019 12:27           41 <A HREF="gss.gz.md5">gss.gz.md5</A>
Jun 23 2019 09:45   8044464017 <A HREF="htgs.gz">htgs.gz</A>
Jun 23 2019 09:49           42 <A HREF="htgs.gz.md5">htgs.gz.md5</A>
Feb 01 2013 00:00     33709040 <A HREF="igSeqNt.gz">igSeqNt.gz</A>
Feb 01 2013 00:00      4654020 <A HREF="igSeqProt.gz">igSeqProt.gz</A>
Jul 05 2019 03:50     15862667 <A HREF="mito.aa.gz">mito.aa.gz</A>
Jul 05 2019 03:50           45 <A HREF="mito.aa.gz.md5">mito.aa.gz.md5</A>
Jul 05 2019 03:51     73957465 <A HREF="mito.nt.gz">mito.nt.gz</A>
Jul 05 2019 03:51           45 <A HREF="mito.nt.gz.md5">mito.nt.gz.md5</A>
Jul 02 2019 08:01  50774575876 <A HREF="nr.gz">nr.gz</A>
Jul 02 2019 08:20           40 <A HREF="nr.gz.md5">nr.gz.md5</A>
Jun 23 2019 13:27  57513447669 <A HREF="nt.gz">nt.gz</A>
Jun 23 2019 13:52           40 <A HREF="nt.gz.md5">nt.gz.md5</A>
Jun 29 2019 21:25 299422788794 <A HREF="other_genomic.gz">other_genomic.gz</A>
Jun 29 2019 23:31           51 <A HREF="other_genomic.gz.md5">other_genomic.gz.md5</A>
Jun 23 2019 10:01    288307625 <A HREF="pataa.gz">pataa.gz</A>
Jun 23 2019 10:02           43 <A HREF="pataa.gz.md5">pataa.gz.md5</A>
Jun 23 2019 12:27   6000355688 <A HREF="patnt.gz">patnt.gz</A>
Jun 23 2019 12:30           43 <A HREF="patnt.gz.md5">patnt.gz.md5</A>
Jul 02 2019 04:00     21028400 <A HREF="pdbaa.gz">pdbaa.gz</A>
Jul 02 2019 04:00           43 <A HREF="pdbaa.gz.md5">pdbaa.gz.md5</A>
Jul 02 2019 01:00       679928 <A HREF="pdbnt.gz">pdbnt.gz</A>
Jul 02 2019 01:00           43 <A HREF="pdbnt.gz.md5">pdbnt.gz.md5</A>
May 19 2019 06:02    195858975 <A HREF="sts.gz">sts.gz</A>
May 19 2019 06:03           41 <A HREF="sts.gz.md5">sts.gz.md5</A>
Jul 02 2019 04:00    106473461 <A HREF="swissprot.gz">swissprot.gz</A>
Jul 02 2019 04:00           47 <A HREF="swissprot.gz.md5">swissprot.gz.md5</A>
Jan 13 2010 00:00       881144 <A HREF="vector.gz">vector.gz</A>
Nov 26 2003 00:00      1951194 <A HREF="yeast.aa.gz">yeast.aa.gz</A>
Jun 15 2009 00:00           46 <A HREF="yeast.aa.gz.md5">yeast.aa.gz.md5</A>
Nov 26 2003 00:00      3732371 <A HREF="yeast.nt.gz">yeast.nt.gz</A>
Jun 15 2009 00:00           46 <A HREF="yeast.nt.gz.md5">yeast.nt.gz.md5</A>
</PRE>
<HR>
</BODY>
</HTML>
  • Connection #0 to host jdcproxy.phibred.com left intact

@nsanilkumar-valluri
Copy link
Author

Yes, can we contribute to handle this type of case. Because, company proxies (policies) might not be changed for single project.

@osallou
Copy link
Contributor

osallou commented Jul 5, 2019

The problem is proxies that do ftp -> http (though not always supported or allowed), have no standards. It means that returned http will not have the same look depending on used proxy. This prevents biomaj from correctly handling this use case.

As I said, workaround is to use http protocol instead of ftp in those cases. The http regexp parser is not always cool/easy to setup, but usually you only need to define a few use cases.
The ones provided in global.properties match some servers, not all... As http file listing is only non standard.
If it does not match, you need to find the correct regexp and set them in your bank property file.

If you find a way to handle most cases, we'll be glad to get it in biomaj :-)

@osallou osallou self-assigned this Jul 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants