Select Page

Big Data in banking

One of Poland’s largest banks had decided to launch a Big Data project. The aim of the project was to strengthen the bank’s market position by allowing corporate departments to develop additional value through data analysis. This would not have been possible without comprehensive access to hundreds of terabytes of data generated by dozens of IT systems.

One of Poland’s largest banks had decided to launch a Big Data project. The aim of the project was to strengthen the bank’s market position by allowing corporate departments to develop additional value through data analysis. This would not have been possible without comprehensive access to hundreds of terabytes of data generated by dozens of IT systems.

Owing to our competence and experience from earlier projects implemented in the Apache Hadoop technology, 3Soft was selected by the Bank to design and launch a HDP cluster serving as a key component of the analytics environment.

Launching a Data Lake cluster that consolidates data from multiple source systems and provides Data Scientists with unified access to them required taking into account specific needs (e.g. security needs) related to the characteristics of the financial industry. The design stage of the cluster involved  selecting essential components of the Hadoop ecosystem, which, once installed and configured, serve as a basis for the operational start-up of subsequent analytical models and, at the same time, allow Data Scientists to access the data for the purpose of quick verification of analytical hypotheses.

The multi-tenancy requirement for the Hadoop ecosystem was met through the following:

  • configuration of YARN schedulers that ensure specific levels of resources for respective types of operations
  • integration with the central provider of credentials (Active Directory)
  • implementation of tools and policies for access to data stored on HDFS and available through Hive
  • configuration of mechanisms for monitoring and auditing user activities within the cluster.

Data Lake architecture allows reducing the so-called analytical life cycle, i.e. the time period from making a hypothesis, through implementation of the analytical model, to its operationalization and evaluation. This is possible by providing access to data stored in a single place and ensuring the appropriate level of available resources for processes launched in the production environment.

Services provided by 3Soft

Data modelling and analysis
  • workshops on data use with department representatives
  • defining access to data collected on the Hadoop platform for the needs of analytical systems
  • determining requirements in terms of NoSQL databases implemented within the Hadoop ecosystem (e.g. HBase)
  • defining data structures on HDFS
  • specifying the method of file encoding on HDFS and optimal file size for the amount of data in the cluster
  • specifying requirements in terms of cluster security and availability
Supplying the cluster with data from domain systems and data warehouses
  • specifying the method of data download from domain systems
  • designing and implementation of communication interfaces with external systems
  • archived data migration
Implementation and testing of distributed data processing algorithms
  • staging of data stored in the cluster
  • stream data processing (e.g. with Apache Spark)
  • implementation and verification of data processing mechanisms in the test environment, including unit tests
  • verification of the performance of Hadoop frameworks
Apache Hadoop administration
  • configuration of the roles of Hadoop ecosystem components (NameNode, DataNode, RegionServer, etc.)
  • implementation of High Availability mode for individual services
  • implementation of permission mechanisms (access to data on HDFS and Hive, integration with Active Directory, configuration with Apache Ranger, etc.)
  • data processing principles (execution of workflow in Oozie and Luigi)
  • deployment of scripts responsible for the implementation of data retention and archiving policy
Monitoring and maintenance of Apache Hadoop cluster infrastructure
  • network and hardware maintenance
  • operating system maintenance (compliance with OSG, patching, maintaining consistency, etc.).
  • maintenance of Hadoop components (HDFS, Hive, YARN, HBase, Kafka, Spark, etc.), including upgrades to newer versions of HDP
  • maintenance of custom applications run on the Hadoop platform
  • monitoring cluster status and providing maintenance services with a required SLA

Advantages of Hadoop implementation

CIO/CTO

  • shorter time needed to implement new business needs
  • low cost of entry (easy scalability later on)
  • quick results (iterative model implementation)

Data Scientist

  • quick access to all data produced by the organization, stored in one place
  • unified interfaces (e.g. JDBC) allow connecting to data with a favorite tool
  • instant verification of hypotheses through ad-hoc queries (e.g. HiveQL)

Data Steward

  • built-in Data Governance tools
  • mechanisms allowing for the implementation of retention policies
  • data access monitoring and auditing

Architekt

  • integrated data platform based on a coherent policy and harmonized standards
  • uncomplicated physical architecture, software mechanisms to ensure data consistency and availability
  • easy scalability with low TCO