Unstructued Notes on Data Engineering

January 06, 2022

Data is engineering is a method, technique, and position, where data engineering maintains data, handle data, stores data, and converts raw data into useful information.

Data Modeling

Its concept of breaking down an entity. Complex problem into smaller problem

What are design schemas

1) star and snowflake schema s, how data can be stored, fact tables in middle, and dimension tables around. a lot of redundancy. Fast processing

The main table in the center has multiple other tables. Slower data processing

Snowflake schemas are distributed to work with scaling, star 2.0 ian work as more dimensions, more starts connected. no data redundancy

Structured Data: Data in a structured schema in a database.

Semi-Structured Data: CSV files, JSON files

Unstructured Data: Images, videos, social media text posts (tweets, Reddit, etc), and so on.

Structured and unstructured data

sorted data based on specific objects, organized data in a file, using DBMS to work with the structure data.

ODBC, SQL for structured data. XML, CSV for unstructured data. Unstructured data can be expanded easily.

Hadoop

Very important, most used framework to handle structured data, data manipulation works on clusters.

Hadoop common: All utilities, tools, and libraries, and sub frameworks

Hadoop File System:-Distributed file system efficiently manage data

YARN: Scheduling

Hadoop MapReduce: Users can have access to large-scale data.

Name Node: Entity consists of metadata of all the files. of data nodes

Hadoop streaming: Utility for users to do mapping and reducing and work with the data, raw to information conversion

1). Its open-source used by millions of users

2) parallel, distributed computing, multiple machines having the data

3). Separate clusters

Volume, Veracity: Quality of data

Velocity:- Frequency of data arrival

Variety:- Various types of data

Block and block scanner:- It's a singular entity of data when storing Hadoop data. Hadoop slices data into smaller objects into smaller pieces. It is stored at the data node.

Corrupted Files Handling:- Data node will report,

Data and Name Node Communication:- This is done using messages

COSHH:-Scheduling operations

XML configuration naming files:- HDFS SIDE,

FSCK:-File system check used to work with HADOOP system when analyzing the data if it has any problems.

Methods of Reducer:-Setup method:-What are the parameters of input data

Cleanup:-Removal of temporary files

Reduce:-Actual reduce operation

Hadoop three mode: Standalone local machine data

Pseudo Distributed mode:-Local system configuration files in the local system

Distributed mode:-

How data security:-Data needs to be secured

Create a channel for data flow and it needs to be secured

The stamp is a service request enforcing the clients requesting the data

Authentication using service tickets

What are port numbers?

Job tracker 50030

Task Tracker 50060

NameNode 50070

Primary concerns of revenue:- Data effectiveness is success and failure of a company help structure growth, customer retention rate, manpower usage, HRM methodologies, big data analytics reduce the production cost in an exponential way.

What data engineer does

Responsible of to handle the inflow of data and creating data pipelines, maintaining data, entity transformation, removal of noise, redudancies, doing preprocessing of any un relevant data ETL

Technolgoes and skills

Mathematics, probability and statistics, machine learning, Python, R, Hadopp, Sql,

Data Architect

Data architect who is responsbile of data that is coming from any source ie, social media, sensors, and create architecture and its smooth for the piple line of data. Data warehousing pipeling, data hubs, and protocols for the working of data

Distance in the nodes in Hadoop

Nodes are kept in a way that there is a distance, sum of distance of current and the enxt node using get distance method

What is data in name node

Having the data you are working with i.e., meta data

Rack awareness

Name node using data node. Read and write operation and it associates rank, closest rack where data operation is performed.

Heartbeat method

name and data node communication it is sent by data node. data node telling the name node that its working fine.

Context object in hadoop used with mapper class creating a communication path, data communication with other entities, send information to other methods,

Hive is an interface, its a query language, using map reduce. it creates a simple query which is translated to mapreduce.

Meta store in Hive:-Its used to store schemas and Hive tables,

HIve componets:-Bucket, table and partition

More than one table for one data table

yes, more than one table for one data

meaning skewed table

entities in Hive contain data the is represented more sskewnessmore repetition

collections in HIVE:-

arrays, maps, structs, and union,

serde, serialization

deserilization, Java object

Table creation function of HIVE:-stack, Explore Array, explide map, ,JSON tuple, Stack

ROle of DOT hiverec

first file to load when working with hive model. intialization, command

args, kwargs

args, argument function, define set function,

kwargs:-denote the setup of arguments

structure using mysql

describe command

describe table name

string in a column

regular expression oprator

difference between daarehouse and DB

dwarehousing certain fucntion such as aggregations min, max, avg, to perfrom certain functions,

database:-data manipulation and deletion operations, more related to efficiency and speed

certification:-

Give Your Hands to Serve

Unstructued Notes on Data Engineering

Comments

Post a Comment

Popular posts from this blog

Guidelines for Effective Academic Writing

A Shadow That Remains

My Experience Publishing with IEEE Internet of Things Journal