Unstructued Notes on Data Engineering
Data is engineering is a method, technique, and position, where data engineering maintains data, handle data, stores data, and converts raw data into useful information.
Data Modeling
Its concept of breaking down an entity. Complex problem into smaller problem
What are design schemas
1) star and snowflake schema s, how data can be stored, fact tables in middle, and dimension tables around. a lot of redundancy. Fast processing
The main table in the center has multiple other tables. Slower data processing
Snowflake schemas are distributed to work with scaling, star 2.0 ian work as more dimensions, more starts connected. no data redundancy
Structured Data: Data in a structured schema in a database.
Semi-Structured Data: CSV files, JSON files
Unstructured Data: Images, videos, social media text posts (tweets, Reddit, etc), and so on.
Structured and unstructured data
sorted data based on specific objects, organized data in a file, using DBMS to work with the structure data.
ODBC, SQL for structured data. XML, CSV for unstructured data. Unstructured data can be expanded easily.
Hadoop
Very important, most used framework to handle structured data, data manipulation works on clusters.
Hadoop common: All utilities, tools, and libraries, and sub frameworks
Hadoop File System:-Distributed file system efficiently manage data
YARN: Scheduling
Hadoop MapReduce: Users can have access to large-scale data.
Name Node: Entity consists of metadata of all the files. of data nodes
Hadoop streaming: Utility for users to do mapping and reducing and work with the data, raw to information conversion
1). Its open-source used by millions of users
2) parallel, distributed computing, multiple machines having the data
3). Separate clusters
Volume, Veracity: Quality of data
Velocity:- Frequency of data arrival
Variety:- Various types of data
Block and block scanner:- It's a singular entity of data when storing Hadoop data. Hadoop slices data into smaller objects into smaller pieces. It is stored at the data node.
Corrupted Files Handling:- Data node will report,
Data and Name Node Communication:- This is done using messages
COSHH:-Scheduling operations
XML configuration naming files:- HDFS SIDE,
FSCK:-File system check used to work with HADOOP system when analyzing the data if it has any problems.
Methods of Reducer:-Setup method:-What are the parameters of input data
Cleanup:-Removal of temporary files
Reduce:-Actual reduce operation
Hadoop three mode: Standalone local machine data
Pseudo Distributed mode:-Local system configuration files in the local system
Distributed mode:-
How data security:-Data needs to be secured
Create a channel for data flow and it needs to be secured
The stamp is a service request enforcing the clients requesting the data
Authentication using service tickets
What are port numbers?
Job tracker 50030
Task Tracker 50060
NameNode 50070
Primary concerns of revenue:- Data effectiveness is success and failure of a company help structure growth, customer retention rate, manpower usage, HRM methodologies, big data analytics reduce the production cost in an exponential way.
What data engineer does
Responsible of to handle the inflow of data and creating data pipelines, maintaining data, entity transformation, removal of noise, redudancies, doing preprocessing of any un relevant data ETL
Technolgoes and skills
Mathematics, probability and statistics, machine learning, Python, R, Hadopp, Sql,
Data Architect
Data architect who is responsbile of data that is coming from any source ie, social media, sensors, and create architecture and its smooth for the piple line of data. Data warehousing pipeling, data hubs, and protocols for the working of data
Distance in the nodes in Hadoop
Nodes are kept in a way that there is a distance, sum of distance of current and the enxt node using get distance method
What is data in name node
Having the data you are working with i.e., meta data
Rack awareness
Name node using data node. Read and write operation and it associates rank, closest rack where data operation is performed.
Heartbeat method
name and data node communication it is sent by data node. data node telling the name node that its working fine.
Context object in hadoop used with mapper class creating a communication path, data communication with other entities, send information to other methods,
Hive is an interface, its a query language, using map reduce. it creates a simple query which is translated to mapreduce.
Meta store in Hive:-Its used to store schemas and Hive tables,
HIve componets:-Bucket, table and partition
More than one table for one data table
yes, more than one table for one data
meaning skewed table
entities in Hive contain data the is represented more sskewnessmore repetition
collections in HIVE:-
arrays, maps, structs, and union,
serde, serialization
deserilization, Java object
Table creation function of HIVE:-stack, Explore Array, explide map, ,JSON tuple, Stack
ROle of DOT hiverec
first file to load when working with hive model. intialization, command
args, kwargs
args, argument function, define set function,
kwargs:-denote the setup of arguments
structure using mysql
describe command
describe table name
string in a column
regular expression oprator
difference between daarehouse and DB
dwarehousing certain fucntion such as aggregations min, max, avg, to perfrom certain functions,
database:-data manipulation and deletion operations, more related to efficiency and speed
certification:-
Comments
Post a Comment