Emerging Technologies Chapter 2 Note

Emerging Technologies  Chapter 2

Chapter 2: Data Science

2.1. An Overview of Data Science

  • Data science
    • multi-disciplinary field that uses scientific methods, processes, algorithms,
      and systems to
      extract knowledge and insights from structured, semi-structured and unstructured data
    • is much more than simply analyzing data
    • one of the most promising and in-demand career paths for skilled
      professionals
  • Data scientists need to be
    • curious and result-oriented
    • have exceptional industry-specific knowledge
    • communication skills
    • strong quantitative background in statistics and linear algebra
    • programming knowledge focus on Data:
      • warehousing
      • mining
      • modeling

2.1.1. What are data and information?

  • Data
    • unprocessed facts and figures
    • represented with the help of characters
  • information
    • processed/ interpreted data
    • base for decisions and actions

2.1.2. Data Processing Cycle

  • Data processing: re-structuring or re-ordering of data
  • Data processing consists three steps
    • Input
    • Processing
    • Output

2.3 Data types and their representation

2.3.1. Data types from Computer programming perspective

  • data type is simply an attribute of data that tells the compiler or
    interpreter how the programmer intends to use the data
  • Common data types include:
    • Booleans(bool)– is used to represent restricted to one of two values: true or
      false
    • Characters(char)– is used to store a single character
    • Floating-point numbers(float)– is used to store real numbers
    • Alphanumeric strings(string)– used to store a combination of characters and
      numbers
  • data type defines:
    • operations that can be done on the data
    • meaning of the data
    • the way values of that type can be stored

2.3.2. Data types from Data Analytics perspective

  • there are three common types of data types/ structures:
  • Structured Data
    • adheres to a pre-defined data model
    • straightforward to analyze
    • conforms to a tabular format with rows and columns
    • Example: Excel files or SQL databases
  • Semi-structured Data
    • also known as a self-describing structure
    • form of structured data that does not conform with the formal structure
      of data models: relational databases, tables
    • contains tags or other markers
    • Example: JSON and XML
  • Unstructured Data
    • information that either does not have a predefined data model or is not
      organized in a pre-defined manner
    • typically text-heavy
    • Examples: audio, video files, No-SQL databases
Metadata – Data about Data
  • this is not a separate data structure
  • provides additional information about a specific set of data
  • Example metadata provides fields for dates and locations for a
    photograph taken

2.4. Data value Chain

describe the information flow within a big data system
  • Data Acquisition
    • process of gathering, filtering, and cleaning data before it is put in a data warehouse or any other
      storage
    • major big data challenges in terms of infrastructure requirements
  • Data Analysis
    • making the raw data acquired amenable to use in decision-making as well as domain-specific
      usage
    • Involves: exploring, transforming, and modeling data
  • Data Curation
    • the active management of data over its life cycle to ensure it meets the
      necessary
      data quality requirements for its effective usage
    • ensuring that data are trustworthy, discoverable, accessible, reusable and fit their purpose
  • Data Storage
    • It is the persistence and management of data in a scalable way that satisfies the needs of
      applications that
      require fast access to the data
    • NoSQL technologies present a wide range of solutions based on
      alternative data models
  • Data Usage
    • It covers the data-driven business activities that need access to data, its
      analysis, and the tools needed to integrate the data analysis within the
      business activity

2.5. Basic concepts of big data

  • a blanket term for the non-traditional strategies and technologies needed to gather,
    organize, process, and gather insights from
    large datasets
  • data that exceeds the computing power or storage of a single computer

2.5.1. What Is Big Data?

  • large and complex data that it becomes difficult to process using on-hand database
    management tools or
    traditional data processing applications
  • Large dataset
  • Big data is characterized by 3V and more:
    • Volume  – large
    • Velocity – live streaming or in motion
    • Variety  – In many different forms from diverse sources
    • Veracity – trust the data? How accurate
      is it?

2.5.2. Clustered Computing and Hadoop Ecosystem

2.5.2.1.Clustered Computing

  • Big data clustering software combines the resources of many smaller
    machines
  • provide a number of benefits:
    • Resource Pooling – Combining the available storage, CPU, and memory
    • High Availability – provide varying levels of fault tolerance and availability
      guarantees
    • Easy Scalabilityeasy to scale horizontally by adding additional machines

2.5.2.2.Hadoop and its Ecosystem

  • It Is an open-source framework allows for the distributed processing of large
    datasets across clusters of computers
  • four key characteristics of Hadoop are:
    • Economical
    • Reliable – stores copies of the data on different machines and is resistant to hardware failure
    • Scalable – easily scalable both, horizontally and vertically
    • Flexible – you can store as much structured and unstructured data as you need
  • Hadoop has an ecosystem that has evolved from its four core components:
    • data management – e.g Zookeeper
    • access – e.g PIG, HIVE
    • processing – e.g YARN
    • storage – e.g HDFS
  • It is continuously growing to meet the needs of Big Data

2.5.3. Big Data Life Cycle with Hadoop

2.5.3.1. Ingesting data into the system

  • data is ingested or transferred to Hadoop from various sources

2.5.3.2. Processing the data in storage

  • data is stored and processed

2.5.3.3. Computing and analyzing data

  • data is analyzed by processing frameworks

2.5.3.4. Visualizing the results

  • analyzed data can be accessed by users

Leave a Comment

Your email address will not be published. Required fields are marked *