File Manipulation with Azure Blob StorageΒΆ
Links: notebook
, html, PDF
, python
, slides, GitHub
We try a few file manipulation between a local computer and a blob storage on Azure. It requires azure-sdk-for-python and pyenbc. We first create a dummy file.
import pandas, random
mat = [ {"x":random.random(), "y":random.random()} for i in range(0,1000)]
df = pandas.DataFrame(mat)
df.to_csv("randomxy.txt", sep="\t", encoding="utf8")
We need credentials and to avoid having them in clear in the notebook, we use a HTML form:
import pyquickhelper.ipythonhelper as ipy
params={"blob_storage":"hdblobstorage", "password":""}
ipy.open_html_form(params=params,title="credentials",key_save="blobservice")
blob_storage
password
We stored the values in two variables in the workspace:
blobstorage = blobservice["blob_storage"]
blobpassword = blobservice["password"]
We need pyensae >= 1.2:
import pyensae
import pyenbc
%load_ext pyensae
%load_ext pyenbc
pyensae.__version__, pyenbc.__version__
The pyensae extension is already loaded. To reload it, use:
%reload_ext pyensae
The pyenbc extension is already loaded. To reload it, use:
%reload_ext pyenbc
'1.2'
%blob_open --help
usage: blob_open [-h] [-b BLOBSTORAGE] [-p BLOBPASSWORD]
open a connection to an Azure blob storage, by default, the magic command
takes blobstorage and blobpassword local variables as default values
optional arguments:
-h, --help show this help message and exit
-b BLOBSTORAGE, --blobstorage BLOBSTORAGE
blob storage name
-p BLOBPASSWORD, --blobpassword BLOBPASSWORD
blob password
usage: blob_open [-h] [-b BLOBSTORAGE] [-p BLOBPASSWORD]
We open a connection to the blob storage:
cl, bs = %blob_open
cl, bs
(<pyenbc.remote.azure_connection.AzureClient at 0xa4a2a20>,
<azure.storage.blob.blobservice.BlobService at 0xa4a2b00>)
We extract the available containers:
l = %blob_containers
l
['clusterensaeazure1',
'clusterensaeazure2',
'clusterensaeazure2-1',
'hdblobstorage',
'petittest',
'sparkclus',
'sparkclus2',
'testhadoopensae']
We get the content of one container:
df = %blob_ls hdblobstorage
df.tail(n=5)
name | last_modified | content_type | content_length | blob_type | |
---|---|---|---|---|---|
4995 | velib_several_days/paris.2014-11-14_15-54-58.6... | Fri, 28 Nov 2014 10:34:15 GMT | application/octet-stream | 524941 | BlockBlob |
4996 | velib_several_days/paris.2014-11-14_15-55-57.8... | Fri, 28 Nov 2014 10:34:16 GMT | application/octet-stream | 524944 | BlockBlob |
4997 | velib_several_days/paris.2014-11-14_15-56-58.5... | Fri, 28 Nov 2014 10:34:17 GMT | application/octet-stream | 522499 | BlockBlob |
4998 | velib_several_days/paris.2014-11-14_15-57-57.8... | Fri, 28 Nov 2014 10:34:17 GMT | application/octet-stream | 524958 | BlockBlob |
4999 | velib_several_days/paris.2014-11-14_15-58-58.5... | Fri, 28 Nov 2014 10:34:18 GMT | application/octet-stream | 523757 | BlockBlob |
%hd_wasb_prefix
'wasb://hdblobstorage@hdblobstorage.blob.core.windows.net/'
cl.wasb_to_file("hdblobstorage", "velib_several_days")
'wasb://hdblobstorage@hdblobstorage.blob.core.windows.net/velib_several_days'
We upload the file we created in the first cell:
%blob_up randomxy.txt clusterensaeazure1/testpyenbc/randomxy.txt
'testpyenbc/randomxy.txt'
We check the file is over there:
%blob_ls clusterensaeazure1/testpyenbc
name | last_modified | content_type | content_length | blob_type | |
---|---|---|---|---|---|
0 | testpyenbc/randomxy.txt | Sat, 26 Sep 2015 22:05:12 GMT | application/octet-stream | 43483 | BlockBlob |
1 | testpyenbc/randomxy2.txt | Sat, 26 Sep 2015 21:50:55 GMT | application/octet-stream | 43456 | BlockBlob |
We try an extended version:
%blob_lsl clusterensaeazure1/testpyenbc
blob_type | content_encoding | content_language | content_length | content_md5 | content_type | copy_completion_time | copy_id | copy_progress | copy_source | copy_status | copy_status_description | etag | last_modified | lease_duration | lease_state | lease_status | name | url | xms_blob_sequence_number | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | BlockBlob | 43483 | application/octet-stream | 0x8D2C6BE8D4DEB43 | Sat, 26 Sep 2015 22:05:12 GMT | available | unlocked | testpyenbc/randomxy.txt | 0 | |||||||||||
1 | BlockBlob | 43456 | application/octet-stream | 0x8D2C6BC8E2C38FB | Sat, 26 Sep 2015 21:50:55 GMT | available | unlocked | testpyenbc/randomxy2.txt | 0 |
If you need information not accessible through a magic command, you can
use the variable bs
(type
azure.storage.blobservice.BlobService):
l=bs.get_block_list("clusterensaeazure1", "testpyenbc/randomxy.txt")
for _ in l.committed_blocks:
print("size=",_.size, "id=",_.id)
size= 43483 id= 00000000
We download this again to the local computer:
%blob_down clusterensaeazure1/testpyenbc/randomxy.txt randomxx_copy.txt --overwrite
'randomxx_copy.txt'
%lsr r.*[.]txt
directory | last_modified | name | size | |
---|---|---|---|---|
0 | False | 2015-09-26 23:50:56.776239 | .\randomall.txt | 84.88 Kb |
1 | False | 2015-09-27 00:05:14.546891 | .\randomxx_copy.txt | 42.46 Kb |
2 | False | 2015-09-27 00:04:55.847278 | .\randomxy.txt | 42.46 Kb |
PIG scripts usually produce more than one output and it is convenient to merge them while downloading them. To test that, we upload a second time our file with a different names:
%blob_up randomxy.txt clusterensaeazure1/testpyenbc/randomxy2.txt
'testpyenbc/randomxy2.txt'
%blob_ls clusterensaeazure1/testpyenbc
name | last_modified | content_type | content_length | blob_type | |
---|---|---|---|---|---|
0 | testpyenbc/randomxy.txt | Sat, 26 Sep 2015 22:05:12 GMT | application/octet-stream | 43483 | BlockBlob |
1 | testpyenbc/randomxy2.txt | Sat, 26 Sep 2015 22:05:18 GMT | application/octet-stream | 43483 | BlockBlob |
And we merge them:
%blob_downmerge clusterensaeazure1/testpyenbc randomall.txt --overwrite
'randomall.txt'
We check the size of file randomall.txt
is twice bigger:
%lsr r.*[.]txt
directory | last_modified | name | size | |
---|---|---|---|---|
0 | False | 2015-09-27 00:05:32.134221 | .\randomall.txt | 84.93 Kb |
1 | False | 2015-09-27 00:05:14.546891 | .\randomxx_copy.txt | 42.46 Kb |
2 | False | 2015-09-27 00:04:55.847278 | .\randomxy.txt | 42.46 Kb |
We finally remove the files from the blob storage:
%blob_delete clusterensaeazure1/testpyenbc/randomxy.txt
%blob_delete clusterensaeazure1/testpyenbc/randomxy2.txt
True
We check it disappeared:
%blob_ls clusterensaeazure1/testpyenbc/
name | url |
---|
And we close the connection:
%blob_close
True
END