Unstructued Notes on Data Engineering

 Data is engineering is a method, technique, and position, where data engineering maintains data, handle data, stores data, and converts raw data into useful information.


Data Modeling


Its concept of breaking down an entity. Complex problem into smaller problem


What are design schemas


1) star and snowflake schema s, how data can be stored, fact tables in middle, and dimension tables around. a lot of redundancy. Fast processing


The main table in the center has multiple other tables. Slower data processing


Snowflake schemas are distributed to work with scaling, star 2.0 ian work as more dimensions, more starts connected. no data redundancy


Structured Data: Data in a structured schema in a database.

Semi-Structured Data: CSV files, JSON files

Unstructured Data: Images, videos, social media text posts (tweets, Reddit, etc), and so on.


Structured and unstructured data


sorted data based on specific objects, organized data in a file, using DBMS to work with the structure data. 


ODBC, SQL for structured data. XML, CSV for unstructured data. Unstructured data can be expanded easily.


Hadoop


Very important, most used framework to handle structured data, data manipulation works on clusters.


Hadoop common: All utilities, tools, and libraries, and sub frameworks


Hadoop File System:-Distributed file system efficiently manage data


YARN: Scheduling 


Hadoop MapReduce: Users can have access to large-scale data. 


Name Node: Entity consists of metadata of all the files. of data nodes


Hadoop streaming: Utility for users to do mapping and reducing and work with the data, raw to information conversion


1). Its open-source used by millions of users 


2) parallel, distributed computing, multiple machines having the data


3). Separate clusters


Volume, Veracity: Quality of data


Velocity:- Frequency of data arrival


Variety:- Various types of data


Block and block scanner:- It's a singular entity of data when storing Hadoop data. Hadoop slices data into smaller objects into smaller pieces. It is stored at the data node.


Corrupted Files Handling:- Data node will report, 


Data and Name Node Communication:- This is done using messages


COSHH:-Scheduling operations


XML configuration naming files:- HDFS SIDE, 


FSCK:-File system check used to work with HADOOP system when analyzing the data if it has any problems. 


Methods of Reducer:-Setup method:-What are the parameters of input data


Cleanup:-Removal of temporary files


Reduce:-Actual reduce operation


Hadoop three mode: Standalone local machine data


Pseudo Distributed mode:-Local system configuration files in the local system


Distributed mode:-


How data security:-Data needs to be secured


Create a channel for data flow and it needs to be secured


The stamp is a service request enforcing the clients requesting the data


Authentication using service tickets


What are port numbers?


Job tracker 50030


Task Tracker 50060


NameNode 50070


Primary concerns of revenue:- Data effectiveness is success and failure of a company help structure growth, customer retention rate, manpower usage, HRM methodologies, big data analytics reduce the production cost in an exponential way.


What data engineer does


Responsible of to handle the inflow of data and creating data pipelines, maintaining data, entity transformation, removal of noise, redudancies, doing preprocessing of any un relevant data ETL


Technolgoes and skills


Mathematics, probability and statistics, machine learning, Python, R, Hadopp, Sql, 


Data Architect


Data architect who is responsbile of data that is coming from any source ie, social media, sensors, and create architecture and its smooth for the piple line of data. Data warehousing pipeling, data hubs, and protocols for the working of data


Distance in the nodes in Hadoop


Nodes are kept in a way that there is a distance, sum of distance of current and the enxt node using get distance method


What is data in name node


Having the data you are working with i.e., meta data 


Rack awareness


Name node using data node. Read and write operation and it associates rank, closest rack where data operation is performed. 


Heartbeat method


name and data node communication it is sent by data node. data node telling the name node that its working fine. 


Context object in hadoop used with mapper class creating a communication path, data communication with other entities, send information to other methods, 


Hive is an interface, its a query language, using map reduce. it creates a simple query which is translated to mapreduce.


Meta store in Hive:-Its used to store schemas and Hive tables, 


HIve componets:-Bucket, table and partition


More than one table for one data table


yes, more than one table for one data


meaning skewed table


entities in Hive contain data the is represented more sskewnessmore repetition


collections in HIVE:- 


arrays, maps, structs, and union, 


serde, serialization 

deserilization, Java object

Table creation function of HIVE:-stack, Explore Array, explide map, ,JSON tuple, Stack

ROle of DOT hiverec

first file to load when working with hive model. intialization, command

args, kwargs

args, argument function, define set function,

kwargs:-denote the setup of arguments

structure using mysql

describe command

describe table name

string in a column

regular expression oprator

difference between daarehouse and DB

dwarehousing certain fucntion such as aggregations min, max, avg, to perfrom certain functions, 

database:-data manipulation and deletion operations, more related to efficiency and speed

certification:-


Comments

Popular posts from this blog

Guidelines for Effective Academic Writing

Living Memories of the Punjab (A Visit to a Gurdwara)

My Experience Publishing with IEEE Internet of Things Journal