File Manipulation with Azure Blob StorageΒΆ

Links: notebook, html, PDF, python, slides, GitHub

We try a few file manipulation between a local computer and a blob storage on Azure. It requires azure-sdk-for-python and pyenbc. We first create a dummy file.

import pandas, random
mat = [ {"x":random.random(), "y":random.random()} for i in range(0,1000)]
df = pandas.DataFrame(mat)
df.to_csv("randomxy.txt", sep="\t", encoding="utf8")

We need credentials and to avoid having them in clear in the notebook, we use a HTML form:

import pyquickhelper.ipythonhelper as ipy
params={"blob_storage":"hdblobstorage", "password":""}
ipy.open_html_form(params=params,title="credentials",key_save="blobservice")
credentials
blob_storage
password

We stored the values in two variables in the workspace:

blobstorage = blobservice["blob_storage"]
blobpassword = blobservice["password"]

We need pyensae >= 1.2:

import pyensae
import pyenbc
%load_ext pyensae
%load_ext pyenbc
pyensae.__version__, pyenbc.__version__
The pyensae extension is already loaded. To reload it, use:
  %reload_ext pyensae
The pyenbc extension is already loaded. To reload it, use:
  %reload_ext pyenbc
'1.2'
%blob_open --help
usage: blob_open [-h] [-b BLOBSTORAGE] [-p BLOBPASSWORD]
open a connection to an Azure blob storage, by default, the magic command
takes blobstorage and blobpassword local variables as default values
optional arguments:
  -h, --help            show this help message and exit
  -b BLOBSTORAGE, --blobstorage BLOBSTORAGE
                        blob storage name
  -p BLOBPASSWORD, --blobpassword BLOBPASSWORD
                        blob password
usage: blob_open [-h] [-b BLOBSTORAGE] [-p BLOBPASSWORD]

We open a connection to the blob storage:

cl, bs = %blob_open
cl, bs
(<pyenbc.remote.azure_connection.AzureClient at 0xa4a2a20>,
 <azure.storage.blob.blobservice.BlobService at 0xa4a2b00>)

We extract the available containers:

l = %blob_containers
l
['clusterensaeazure1',
 'clusterensaeazure2',
 'clusterensaeazure2-1',
 'hdblobstorage',
 'petittest',
 'sparkclus',
 'sparkclus2',
 'testhadoopensae']

We get the content of one container:

df = %blob_ls hdblobstorage
df.tail(n=5)
name last_modified content_type content_length blob_type
4995 velib_several_days/paris.2014-11-14_15-54-58.6... Fri, 28 Nov 2014 10:34:15 GMT application/octet-stream 524941 BlockBlob
4996 velib_several_days/paris.2014-11-14_15-55-57.8... Fri, 28 Nov 2014 10:34:16 GMT application/octet-stream 524944 BlockBlob
4997 velib_several_days/paris.2014-11-14_15-56-58.5... Fri, 28 Nov 2014 10:34:17 GMT application/octet-stream 522499 BlockBlob
4998 velib_several_days/paris.2014-11-14_15-57-57.8... Fri, 28 Nov 2014 10:34:17 GMT application/octet-stream 524958 BlockBlob
4999 velib_several_days/paris.2014-11-14_15-58-58.5... Fri, 28 Nov 2014 10:34:18 GMT application/octet-stream 523757 BlockBlob
%hd_wasb_prefix
'wasb://hdblobstorage@hdblobstorage.blob.core.windows.net/'
cl.wasb_to_file("hdblobstorage", "velib_several_days")
'wasb://hdblobstorage@hdblobstorage.blob.core.windows.net/velib_several_days'

We upload the file we created in the first cell:

%blob_up randomxy.txt clusterensaeazure1/testpyenbc/randomxy.txt
'testpyenbc/randomxy.txt'

We check the file is over there:

%blob_ls clusterensaeazure1/testpyenbc
name last_modified content_type content_length blob_type
0 testpyenbc/randomxy.txt Sat, 26 Sep 2015 22:05:12 GMT application/octet-stream 43483 BlockBlob
1 testpyenbc/randomxy2.txt Sat, 26 Sep 2015 21:50:55 GMT application/octet-stream 43456 BlockBlob

We try an extended version:

%blob_lsl clusterensaeazure1/testpyenbc
blob_type content_encoding content_language content_length content_md5 content_type copy_completion_time copy_id copy_progress copy_source copy_status copy_status_description etag last_modified lease_duration lease_state lease_status name url xms_blob_sequence_number
0 BlockBlob 43483 application/octet-stream 0x8D2C6BE8D4DEB43 Sat, 26 Sep 2015 22:05:12 GMT available unlocked testpyenbc/randomxy.txt 0
1 BlockBlob 43456 application/octet-stream 0x8D2C6BC8E2C38FB Sat, 26 Sep 2015 21:50:55 GMT available unlocked testpyenbc/randomxy2.txt 0

If you need information not accessible through a magic command, you can use the variable bs (type azure.storage.blobservice.BlobService):

l=bs.get_block_list("clusterensaeazure1", "testpyenbc/randomxy.txt")
for _ in l.committed_blocks:
    print("size=",_.size, "id=",_.id)
size= 43483 id= 00000000

We download this again to the local computer:

%blob_down clusterensaeazure1/testpyenbc/randomxy.txt randomxx_copy.txt --overwrite
'randomxx_copy.txt'
%lsr r.*[.]txt
directory last_modified name size
0 False 2015-09-26 23:50:56.776239 .\randomall.txt 84.88 Kb
1 False 2015-09-27 00:05:14.546891 .\randomxx_copy.txt 42.46 Kb
2 False 2015-09-27 00:04:55.847278 .\randomxy.txt 42.46 Kb

PIG scripts usually produce more than one output and it is convenient to merge them while downloading them. To test that, we upload a second time our file with a different names:

%blob_up randomxy.txt clusterensaeazure1/testpyenbc/randomxy2.txt
'testpyenbc/randomxy2.txt'
%blob_ls clusterensaeazure1/testpyenbc
name last_modified content_type content_length blob_type
0 testpyenbc/randomxy.txt Sat, 26 Sep 2015 22:05:12 GMT application/octet-stream 43483 BlockBlob
1 testpyenbc/randomxy2.txt Sat, 26 Sep 2015 22:05:18 GMT application/octet-stream 43483 BlockBlob

And we merge them:

%blob_downmerge clusterensaeazure1/testpyenbc randomall.txt --overwrite
'randomall.txt'

We check the size of file randomall.txt is twice bigger:

%lsr r.*[.]txt
directory last_modified name size
0 False 2015-09-27 00:05:32.134221 .\randomall.txt 84.93 Kb
1 False 2015-09-27 00:05:14.546891 .\randomxx_copy.txt 42.46 Kb
2 False 2015-09-27 00:04:55.847278 .\randomxy.txt 42.46 Kb

We finally remove the files from the blob storage:

%blob_delete clusterensaeazure1/testpyenbc/randomxy.txt
%blob_delete clusterensaeazure1/testpyenbc/randomxy2.txt
True

We check it disappeared:

%blob_ls clusterensaeazure1/testpyenbc/
name url

And we close the connection:

%blob_close
True

END