Azure Data Engineering

Monday, July 28, 2025

Differences between Parquet and Delta Lake (Delta)

differences between Parquet and Delta Lake (Delta) in the context of data storage and processing in systems like Databricks:

what is ACID ?

is a set of properties that ensure reliable processing of database transactions. It stands for:

A – Atomicity

"All or nothing"
A transaction is treated as a single unit, which either completes entirely or does not happen at all.
Example: If you're buying products from an Ecommerce domain and while making payment to vendor from one account to another, the debit and credit must both succeed. If one fails, the whole transaction is rolled back.

C – Consistency

- Ensures that a transaction brings the database from one valid state to another.
- All rules (constraints, cascades, etc.) are maintained before and after the transaction.
- Prevents invalid data or corruption.

I – Isolation

Ensures that concurrent transactions do not interfere with each other.
Each transaction should appear as if it's the only one running, even when others are executing at the same time.
Prevents issues like dirty reads, non-repeatable reads, or phantom reads (depending on isolation level).

D – Durability

Once a transaction is committed, the changes are permanent, even in the event of a system crash or power failure.
Typically achieved through write-ahead logs, checkpoints, or journaling.

Conclusion:

· Use Parquet when you need a fast, compact, read-optimized format without transactional support.

· Use Delta when you need reliability, auditability, and flexibility (like schema evolution, versioning, upserts, deletes).

Databricks mount point creation step by step

Creating a mount point in Databricks allows you to access external storage (like Azure Blob Storage, ADLS, or AWS S3) as if it were part of your Databricks file system (DBFS). Here’s a step-by-step guide on how to create a mount point in Databricks: For Azure Data Lake Storage Gen2 or Blob Storage 1. Get your storage account information and credentials You will need: o Storage account name o Container name o Access key or SAS token or OAuth credentials 2. Mount the storage using Databricks CLI or notebook code You typically do this in a Databricks notebook by using the dbutils.fs.mount() function. Here is an example mounting an Azure Blob Storage container using an access key:

For Azure Data Lake Storage Gen2 with OAuth (more secure) You configure a service principal and use OAuth tokens for mounting. Example code snippet:

configs = {

"fs.azure.account.auth.type": "OAuth",

"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",

"fs.azure.account.oauth2.client.id": "<application-id>",

"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>", key="<secret-key>"),

"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"

}

dbutils.fs.mount(

source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",

mount_point = "/mnt/<mount-name>",

extra_configs = configs

)

Thursday, October 31, 2024

Azure Synapse Course

What is Azure Synapse Analytics?

Azure Synapse is an enterprise analytics service that accelerates time to insight across data warehouses and big data systems. Azure Synapse brings together the best of SQL technologies used in enterprise data warehousing, Spark technologies used for big data, Data Explorer for log and time series analytics, Pipelines for data integration and ETL/ELT, and deep integration with other Azure services such as PowerBI, CosmosDB, and AzureML.

Azure Synapse Analytics

What is Azure Synapse Analytics

Descriptive analytics, which answers the question “What is happening in my business?”. The data to answer this question is typically answered through the creation of a data warehouse in which historical data is persisted in relational tables for multidimensional modeling and reporting.

Diagnostic analytics, which deals with answering the question “Why is it happening?”. This may involve exploring information that already exists in a data warehouse, but typically involves a wider search of your data estate to find more data to support this type of analysis.

Predictive analytics, which enables you to answer the question “What is likely to happen in the future based on previous trends and patterns?”

Prescriptive analytics, which enables autonomous decision making based on real-time or near real-time analysis of data, using predictive analytics.

Tuesday, December 28, 2021

AzureDataFactory Overview

Azure Data Factory

1.Introduction

Azure Data Factory is a cloud-based ETL and data integration service to create workflows for moving and transforming data. With Data Factory you can create scheduled workflows (pipelines) in a code-free manner.

2.Azure ADF Architecture

:Architecture diagram contain below objects

1.Data source

2.orchestration

3.Analysis service

4.Power BI Reports

3.ADF Relationship between Data Factory entities

It consists of below entities

1.Pipeline

2.Data Set

3.Activity

4.Linked Service

Tuesday, July 6, 2021

Azure Data Factory Fundamentals

1.Azure datafactory is ETL tool likeSSIS and whcih is hosted on azure environment.

2.There are many companys now trying to migrate their SSIS apllications into azure.

3.Its scalable and ralible to work and its flaxible.

Sunday, March 21, 2021

How to identify duplicates in SQL

Step1.

There are multiple ways which can be used to identify the duplicate records .

By using row_number function is the best and efficient way.

Example:

Table 1

ID Name LOC Amount

1 GHJ UA 200

2 MKL IN 300

3 PER Aus 400

Identify duplicate records by using below query:

Query:

Select ID,Name,Loc,count(*) from EMP

Group by ID,Name,Loc Having count(*)>1

Query:

With Duplicate

select ID,name,Loc,Row_number() over (partition by ID,Name,LOC order by ID)As dup

delete from Duplicate where dup>1

Resource List

1.Monitor resources like belwo

Azure Sql database config

Step1

1.Select the database and click create

2.Select the resource group name& select the subscription

3.Enter the database name and server name

4.If you dont have server name already create one by following below steps

5.There is a question wanted to use SQL elastic pool option(Y/N)

6.Based on your requirement need to select the pool,If you select Yes,By default it will assign below database configuration and it assign 100 GB

7.If you select NO option by default it assign 250 GB and its assign 10 eDTU

8.Database storage also can be adjusted by selecting config database.

9.Make sure when selecting the 100GB&200GB is to cost expensive.and its for organization use.

10.For your study or learning purpose select 1GB it costs appoz(1080

11.Configuration.

12.Below is the cost summary details(number of DTS multiplied with storagecost )

Azure Analysis service Steps

Step1:

Connect azure portal and create resource

2.Select -->Analytics-->Analysis Service

3.Select analysis service server name & storage pricing option,Here i have selected D1,It will process 20 query's per unit and we have many other option based on your Azure licence.

4.Here administration account details also can be added.

5.Once done above changes then click create.

6.Once the SSAS model successfully deployed in to azure it will show like below.

Friday, March 19, 2021

Data Modelling In power bi

1.Power Bi Data modeling is the model which is similar to our SSAS cube.

2. Can assign the relation ships like(one to many or many to one).

3.If we need to filter multiple column data can apply bi-directional cross filters.

4. Manully create calculated measures and customized columns which helpfull for run time.