Skip to main content

Talking Data: Some Basic yet Useful Statistics in Data Analysis

· 5 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

"All models are wrong, but some are useful."--George Box

For many people, data analysis is familiar but inaccessible. Data rookies think that fancy dashboards look cool. Operational managers think that time-series charts can help them make business decisions. Programmers think that data analysis is nothing more than fetching the data of target fields from the database according to certain requirements. These views are all correct, but incomplete. The really useful data analysis is not only to present charts and numbers, but also to fully combine data insights with the business knowledge, which is meaningful to add values to the business. Understanding some basic statistics knowledge is helpful to discover insights.

Unreliable Averages

We can often see that many data reports display daily, weekly, or monthly averages, such as the daily average sales of the current month, the monthly average number of visits last year, and so on. The statistics of the average value will be helpful for some specific situations, such as the time of getting up every morning, and the offset of hitting the target. But more often, you are likely to be skeptical about the average value, because it fluctuates up and down quite a lot, and the fluctuation range is not small. The root cause here comes from the Non Linear Distribution in the real world. The distribution of websites' response time, number of web page visits, and stock trend is non-linear. In these non-linear distributions, the average value fails because there are a large number of outliers that cause the average value to be seriously skewed. As shown in the figure below, for normal distribution or gaussian distribution, it is linear, so the average value is at the peak in the middle; But for gamma distribution, given that it is a nonlinear, its average value seriously deviates from its peak, and when there are more outliers, its average value will further deviate from its central position.

Gaussian and Gamma Distributions

Therefore, for these non-linear distributions, the average value is not a reasonable indicator, but we can use median instead to describe its overall distribution. There are many tools to deal with non-linear distributions, one of which is the Box Plot. As shown in the figure below, the two distributions are abstracted into a box and several lines, where the box center line is the median, and the edges are first-quartile and third-quartile lines. In this way, it is not necessary to do too much complicated analysis to quickly get a general idea about the distribution.

Talking Data: Why data governance is so important in digital transformation?

· 5 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

"Organisms are living upon negentropy."--What is Life? --by Erwin Schrödinger

In today's Internet era, data is an important asset for enterprises. We generate data all the time: every time we open a mobile app, place an order online, or even drive through traffic lights, data will be generated. Data is everywhere. This is especially true in enterprises. With so much raw data and increasingly mature data analysis technology, entrepreneurs are excited, because this is the gold mine piled up for companies. However, things are not as simple as imagined. It is not easy to extract valuable treasures from these messy so-called "gold mines" that look like garbage dumps. In my previous article Talking Data: What do we need for engaging data analytics?, the concept of Data Governance was mentioned. This article will introduce how data governance creates value in the chaos of enterprise data from the perspective of enterprise data management.

Isolated Data Island

Large and medium-sized enterprises (generally more than 100 people with multiple departments) will encounter the problem of management chaos while their business grows fast. The sales department has its own sales statistics, typically in the form of large and scattered Excel files; The IT department manages the inventory systems by themselves; The HR department maintains an entire personnel statistics list. This situation will lead to many headaches: the boss often complains that he or she can only receive management reports every week; Managers looked at the up and down data in the report, doubting the data integrity; employees work overtime to sort out the data for the reports, but they are often questioned about data quality. Sounds familiar? These are very common issues in companies. The direct cause is the so-called Isolated Data Island problem.

Isolated Data Island

The main reason for isolated data island issues is that data from various departments or teams is disconnected. Many times, because of the rapid growth of business, some teams need to quickly set up a data system, but are incapable of developing it in time, so they can only use Excel or some fast and effective data entry tools to ensure the efficiency of business operations. With the growth of business, these hacking workarounds continue to derive new internal processes, and bottlenecks will gradually emerge after reaching a certain scale, especially when integration with other departments or external systems is required.

Golang in Action: How to implement a simple distributed system

· 9 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

Nowadays, many cloud-native and distributed systems such as Kubernetes are written in Go. This is because Go natively supports not only asynchronous programming but also static typing to ensure system stability. My open-source project Crawlab, a web crawler management platform, has applied distributed architecture. This article will introduce about how to design and implement a simple distributed system.

Ideas

Before we start to code, we need to think about what we need to implement.

  • Master Node: A central control system, similar to a troop commander to issue orders
  • Worker Node: Executors, similar to soldiers to execute tasks

Apart from the concepts above, we would need to implement some simple functionalities.

CI/CD in Action: Manage auto builds of large open-source projects with GitHub Actions?

· 5 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

In the previous article about CI/CD in Action: How to use Microsoft's GitHub Actions in a right way?, we introduced how to use GitHub Actions workflows with a practical Python project. However, this is quite simple and not comprehensive enough for large projects.

This article introduces practical CI/CD applications with GitHub Actions of my open-source project Crawlab. For those who are not familiar with Crawlab, you can refer to the official site or documentation. In short, Crawlab is a web crawler management platform for efficient data collection.

Overall CI/CD Architecture

The new version of Crawlab v0.6 split general functionalities into separated modules, so that the whole project is consisted of a few dependent sub-projects. For example, the main project crawlab depends on the front-end project crawlab-ui and back-end project crawlab-core. Higher decoupling and maintainability are the benefits.

Below is the diagram of the overall CI/CD architecture.

Talking Algorithm: The hidden secret of nature in the divide-and-conquer algorithm

· 3 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

"The empire, long divided, must unite; long united, must divide. " -- The Romance of the Three Kingdoms

It is hard to deny the importance of algorithms in our modern IT industry. A well designed algorithm can run software programs with the minimal resources in the most efficient way. Therefore, algorithms are important, so that IT companies set high standards in the hiring process. Think about the algorithm tests in the technical interviews. Many people might feel that algorithms are quite distant from us, but I think its efficiency enhancement comes naturally, as the reason behind can be found in nature.

From a snowflake

snowflake

As we all know, a snowflake is beautiful, because of not only its shining appearing, but also the shape. It looks like a polished hexagon, and each branch is a snowflake-like sub-hexagon. The term for this structure with recursive self similarity is Fractal. Fractals are so common in nature, such as tree roots and tree leaf branches, pulmonary capillaries, natural coastal lines, or even broken glasses.

So, why? What is the reason that fractals are so common in nature? Is it the design from Gods, or some fundamental mathematical laws behind it? Academic researchers believe in the latter. According the thermodynamics, snowflakes are formed when water vapor encounters an abrupt drop in temperature; according to fluid mechanics, the tree-like structure of pulmonary capillaries would allow the oxygen to be absorbed by red blood cells. In conclusion, fractals exist for efficiencies.

CI/CD in Action: How to use Microsoft's GitHub Actions in a right way?

· 6 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

GitHub Actions is the official CI/CD workflow service provided by GitHub. It is aimed at making it easy for open-source project contributors to manage operational maintenance, and enable open-source communities to embrace cloud-native DevOps. GitHub Actions is integrated into most of my open-source projects including Crawlab and ArtiPub. As a contributor, I think GitHub Actions is not only easy to use, but also free (which is the most important). Therefore, I hope this article will allow open-source project contributors who are not familiar with GitHub Actions, to really get ideas on how to utilize it and make an impact.

Starting from documentation

For those who are not familiar with GitHub Actions, it is strongly recommended that you read the official documentation first, where you can find Introduction Video, Quick Start, Examples, concepts, how it works, etc. If you read through the docs, you can easily do GitHub DevOps with your own experience in CI/CD. References of all codes in this article can be found on the official documentation,

GitHub Actions Docs

Ideas

Let's first figure out what we would like to implement, i.e. using GitHub Actions to run a web crawler to get daily ranking from GitHub Trending.

Talking Testing: the love and hate of Unit Tests

· 4 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

"No code is the best way to write secure and reliable applications."--Kelsey Hightower

Many developers have probably more or less heard of Unit Tests, and even written one and are familiar with it. However, in the volatile and fast changing environment, unit tests seem to be in an embarrassed situation. Developers know it is useful, but treat it with neglect. "The schedule is tight. What time do we have for unit tests?" Does it sound familiar?

What is Unit Test?

Unit Test is some lines of code written by developers to validate whether their own functional codes can run as expected. If the code is not passed, it means the functional codes are problematic.

This self-testing method looks self-deceiving, similar to taking an exam with official answers. In validation area, this term is called White Box Test. The counterpart of White Box Test is Black Box Test which uses other methods to validate things. Unit Test is White Box Test, and higher-level testing methods such as Integration Test, End to End Test, and UI Test are all Black Box Tests. Unit Tests only test code itself.

Testing Pyramid

What are the benefits of unit testing?

Unit testing is a very useful tool in Agile Development. Some agile frameworks, such as eXtreme Programming (XP), requires that every feature must be covered by unit test cases. My previous article Talking Agile: Are you sure your team is practicing Agile properly mentioned the importance of unit tests.

Talking Data: What do we need for engaging data analytics?

· 5 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

"According to incomplete statistics, the proportion of white hair of data workers is higher than the average of the same age group." by a data worker

Data, a familiar but mysterious word, has become a totem pursued by everyone. Managers love fancy data reports, data analysts are keen on building complicated statistical models, and salesmen take dashboards as compasses to see whether they can complete their KPIs. Since over ten years ago, the data industry has been developing fast, and there have been some novel yet formidable jargons, such as Big Data, Data Science, Data Lake, Data Mesh, Data Governance. Yet the "traditional" terms are still abstruse: Data Warehouse, Business Intelligence, Data Mart, Data Mining. What is more headachy is that many people are still unable to understand their relationship with recently popular concepts such as Artificial Intelligence, Machine Learning, and Deep Learning. These hot buzzwords are the results of aggressive development in the data area.

Professional Doctor or Fortune Teller?

Years ago, with the rapid development of the Internet industry, the bubble of the data industry was getting larger. Data, the by-product of the Internet applications, has large volumes and diversities. Data owners would like to get the most out of it and regard it as the gold mine. Therefore, data mining engineers became one of the most popular professionals. Later, a brand new yet more popular position Data Scientist emerged as "the sexiest job in the 21st century".

data-science

The popularity of data scientists is its requirement for abilities and experience in various areas:

  • Programming Skills: at least able to use Python or R to do data cleansing, analysis and modeling.
  • Mathematics and Statistics: familiar with probability theory, calculus, and discrete mathematics.
  • Business Knowledge: deep understanding of market, process and macro trends in related areas.
  • Communication Skills: able to convey insights and analysis results in a human-friendly way.

Golang in Action: How to quickly implement a minimal task scheduling system

· 5 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

Task Scheduling is one of the most important features in software systems, which literally means assigning and executing long tasks or scripts according to certain specifications. In the web crawler management platform Crawlab, task scheduling serves as a core module, which you may wonder how to build it from scratch. This article will introduce you how to build a simple but useful task scheduler with Go.

Idea

Let's focus on what we need for the task scheduling system.

  • User Interface: API
  • Scheduler: Cron
  • Execute Tasks: Executor

Below is the basic process.

image-20221003094216157

We can use HTTP API to create scheduled tasks, and the executor will execute scripts periodically based on their specifications.

Talking Agile: Are you sure your team is practicing Agile properly?

· 6 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

"You'd better stay late tonight. Our boss is angry about the progress." -- An ambitious but worrying project manager

Agile, a word once standing for agility and speed, has become one of the sexiest but seemingly abstruse jargons often mentioned by software developers in the IT industry over past 20 years. Agile Development is the primary option of project management methodologies for many teams, because it is simple, agile (literally), and more importantly implying fast. Think about countless programming labors who are willing to work overtime voluntarily because of hard deadlines.

Agile = Deliver Fast?

When your dev team suffers deferred releases, product quality degradation, poor team moral and fading customer relationship, your friends probably will recommend you using agile development, "Hey, buddy. I heard that agile is pretty awesome. You should probably take a shot." However, after your team has been practicing agile for a while, you are very likely to find it not good enough: loads of bugs after production releases, increasing number of overtime hours, and endless new requirements each of which is the highest priority. You are trembling with fear in the regular meeting with your boss who may ask, "I heard that you guys are improving our capability of delivering with Agile. This is great! So can those important features be available next week?"

Agile is the same as delivering fast, this is one of the biggest myths from many agile practitioners. Literally, Agile doesn't necessarily mean fast; instead, introduction of Agile will lead to more work and therefore lower delivering speed. Yeah, you are right. Technically speaking, Agile will reduce the speed of delivery. For example, Extreme Programming (XP, one of the most popular agile frameworks) requires unit test cases to cover all features and functions, which would result in 1-2 times of extra code.

So, why do people claim that Agile works great in software development? As it can neither increase development efficiency, nor does it reduce workload, what on earth is this thing useful? This is a typical and the most critical question for Agile practitioners. If you cannot provide timely answers that is satisfied to the team, they will ultimately drop Agile and head back to their old way.