Ayce

Blog

Quality Assurance Engineer Role In Data Science Project

Introduction:

Data science is not just statistics. It is an interdisciplinary field like bioinformatics, combining mathematics, statistics, computer science, information science, etc. Just like Big Data, it’s a buzzword.

A QA company carrying out data centric testing today relies on data science to solve complex analytical issues with the help of algorithm development, data inference and other emerging technologies.

Despite the existence of best practices in software testing for operational applications, there is a remarkable lack of established QA practices for advanced analytics and data science. Today, as the practice of data science proliferates across businesses, conducted by a broadening variety of analytics specialists and data scientists, the number of insufficiently tested solutions is growing rapidly.

QA Role In Data Science:

QA’s role in big data is radically different than that of other streams of work. QA engineers have to unlike in other streams build counter systems that validate if raw data and any aggregated data is being folded correctly –

·    simple count checks and mismatches.

·    any other calculations/computes being done correctly – if you run against raw vs aggregated data for example.

·    Missing data – if the source of ingestion has some new timestamped data available is it available in application’s datasource.

·    Big data adhoc querying and web applications are radically different than regular queries, e.g. most big data queries often gather data from columnar storage because they attempt to do projections on a few columns and get massive dataset on a date range with some aggregations.

·    A corollary on above point – often a lot of big data queries are interactive so testing has to be on websockets, and so it needs a lot of involved tooling around the usecase that is flexible (again engineering skills here).

·    Knowledge of how histograms and some basic statistics work.

·    Simulation of dataset – knowledge of perhaps how to do monte carlo simulation, to produce artificial dataset for smoke testing.

·    Streaming – if data is streaming data, knowing what is a sample (know reservoir sampling).

In short QA engineer in Big data space is not a test engineer, she/he is a software development engineer who is capable of understanding functional programming constructs, know technologies currently available at hand like Spark, HDFS, parquet, kafka for example; or at the very least capable of knowing and writing code in “a” particular language (could be scala, python, java, go , Rust or C or whatever is the language of the current day) and write automated testing templates for above scenarios easily.

Challenges of testing with data science:

Many advanced analytics practitioners and data scientists rely on code reviews by team members, because typical software testing methodologies cannot accommodate the special needs of their models and applications.

As an example, simple changes in data can adversely affect the performance of analytics models. The uniqueness and size of an advanced analytics software solution can make it very challenging to test scalability and prepare for successful implementation.

Regular testing of production analytics is required, as models may not have been examined for many years, while the business processes and software environments evolved.

The following steps can be recommended for advanced analytics QA:

  • interview stakeholders from business and analytics development to understand the business problem and context
  • review existing models and procedures
  • review data sources
  • implement models in alternative technologies to compare results — languages, solvers, analytics engines
  • experiment with models and a variety of test data sets to uncover issues and stress the model implementation
  • suggest improvements and recommend possible further investigation.

An advanced analytics QA team requires expertise in modelling, advanced analytics algorithms, numerical computing, commercial and open source packages for analytics and data science, and deployment of systems embedding advanced analytics.

Conducting a review entails vital questions about the correctness of the model, data sourcing and integration, publishing and use of solutions in the business, sensitivity of the answers to the inputs, and other issues. These questions often can’t be answered internally for a variety of reasons. An independent testing team may need to be supplemented by third-party experts.

Any organisation that relies on advanced analytics for core processes must determine if suitable quality assurance has been conducted. A formal process should be established for testing advanced analytics, in-line with testing of other operational software. The failure to do so could reduce the potential impact of advanced analytics and data science in the business environment.