Skip to main content

12 posts tagged with "Data Analysis"

Data analysis and data science

View All Tags

POML: The Rise of Structured Prompt Engineering and the Prospect of AI Application Architecture's 'New Trinity'

· 11 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

In today's rapidly advancing artificial intelligence (AI) landscape, prompt engineering is transforming from an intuition-based "art" into a systematic "engineering" practice. POML (Prompt Orchestration Markup Language), launched by Microsoft in 2025 as a structured markup language, injects new momentum into this transformation. POML not only addresses the chaos and inefficiency of traditional prompt engineering but also heralds the potential for AI application architecture to embrace a paradigm similar to web development's "HTML/CSS/JS trinity." Based on an in-depth research report, this article provides a detailed analysis of POML's core technology, analogies to web architecture, practical application scenarios, and future potential, offering actionable insights for developers and enterprises.

POML Ushers in a New Era of Prompt Engineering

POML, launched by Microsoft Research, draws inspiration from HTML and XML, aiming to decompose complex prompts into clear components through modular, semantic tags (such as <role>, <task>), solving the pain points of traditional "prompt spaghetti." It reshapes prompt engineering through the following features:

  • Semantic tags: Improve prompt readability, maintainability, and reusability.
  • Multimodal support: Seamlessly integrate text, tables, images, and other data.
  • Style system: Inspired by CSS, separate content from presentation, simplifying A/B testing.
  • Dynamic templates: Support variables, loops, and conditions for automation and personalization.

POML is not just a language but the structural layer of AI application architecture, forming the "new trinity" together with optimization tools (like PromptPerfect) and orchestration frameworks (like LangChain). This architecture highly aligns with the academically proposed "Prompt-Layered Architecture" (PLA) theory, elevating prompt management to "first-class citizen" status equivalent to traditional software development.

In the future, POML is expected to become the "communication protocol" and "configuration language" for multi-agent systems, laying the foundation for building scalable and auditable AI applications. While the community debates its complexity, its potential cannot be ignored. This article will provide practical advice to help enterprises embrace this transformation.

Stanford University Study Reveals Real Impact of AI on Developer Productivity: Not a Silver Bullet

· 8 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

This article is based on a presentation by Stanford University researcher Yegor Denisov-Blanch at the AIEWF 2025 conference, which analyzed real data from nearly 100,000 developers across hundreds of companies. Those interested and able can watch the full presentation on YouTube.

Recently, claims that "AI will replace software engineers" have been gaining momentum. Meta's Mark Zuckerberg even stated earlier this year that he plans to replace all mid-level engineers in the company with AI by the end of the year. While this vision is undoubtedly inspiring, it also puts pressure on technology decision-makers worldwide: "How far are we from replacing all developers with AI?"

The latest findings from Stanford University's software engineering productivity research team provide a more realistic and nuanced answer to this question. After in-depth analysis of nearly 100,000 software engineers, over 600 companies, tens of millions of commits, and billions of lines of private codebase data, this large-scale study shows that: Artificial intelligence does indeed improve developer productivity, but it's far from a "one-size-fits-all" universal solution, and its impact is highly contextual and nuanced. While average productivity increased by about 20%, in some cases, AI can even be counterproductive, reducing productivity.

Crawlab AI: Building Intelligent Web Scrapers with Large Language Models (LLM)

· 6 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

"If I had asked people what they wanted, they would have said faster horses" -- Henry Ford

Preface

When I first entered the workforce, as a data analyst, I accidentally experienced the ability of web crawlers to automatically extract webpage data, and since then I've been fascinated by this magical technology. As I continued to delve into web scraping technology, I gradually understood the core technologies of web crawling, including web parsing - the process of analyzing webpage HTML structure to build data extraction rules based on XPath or CSS Selectors. This process has long required manual intervention. While relatively simple for scraping engineers, if large-scale extraction is needed, this process is very time-consuming, and as webpage structures change, it increases crawler maintenance costs. This article will introduce my LLM-based intelligent web scraping product: Crawlab AI. Although it's still in early development, it has already shown great potential and promises to make data acquisition easy for data practitioners.

As the founder of the web scraping management platform Crawlab, I've always been passionate about making data acquisition simple and easy. Through constant communication with data practitioners, I realized the massive demand for intelligent scrapers (or universal scrapers) - extracting target data from any website without manually writing parsing rules. Of course, I'm not the only one researching and trying to solve this problem: In January 2020, Qingnan released the universal article parsing library GeneralNewsExtractor based on punctuation density, which can implement universal news crawlers with 4 lines of code; In July 2020, Cui Qingcai released GerapyAutoExtractor, implementing list page data extraction based on SVM algorithms; In April 2023, I developed Webspot through high-dimensional vector clustering algorithms, which can also automatically extract list pages. The main problem with these open-source software is that their recognition accuracy has some gaps compared to manually written crawler rules.

Additionally, commercial scraping software Diffbot and Octoparse have also implemented some universal data extraction functionality through proprietary machine learning algorithms. Unfortunately, their usage costs are relatively high. For example, Diffbot's lowest plan requires a monthly subscription fee of $299.

Exploring Crawlab: Your New Enterprise Web Scraping Management Choice

· 3 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

In the modern data-driven era, acquiring and managing online information has become crucial. To provide powerful support for enterprises and developers, Crawlab has emerged as an enterprise-level web scraping management platform characterized by being ready-to-use out of the box. Regardless of your team size, Crawlab can provide professional and efficient web scraping management solutions.

Core Features

Crawlab's core features include distributed system management, spider task management and scheduling, file editing, message notifications, dependency management, Git integration, and performance monitoring, among others. Its distributed node management allows spider programs to run efficiently across multiple servers. No more worrying about manual uploading, monitoring, and deployment hassles - Crawlab automates all of this, ensuring you can easily schedule spider tasks and view spider program running status and task logs in real-time.

Spider List

Key Highlights

On Generative AI Technology: Retrieval-Augmented Generation (RAG)

· 4 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

Nowadays, generative AI applications are emerging like mushrooms after rain, overwhelming in their abundance. Large Language Models (LLMs) have become exceptionally popular with the release of ChatGPT and are a typical example of generative AI applications. However, LLMs have flaws. One significant problem is hallucination: for unfamiliar questions, LLMs fabricate answers that appear professional but have no factual basis. To solve this problem, many AI-based knowledge Q&A systems adopt Retrieval-Augmented Generation (RAG) technology, enabling LLMs to provide fact-based answers and eliminate hallucinations. This article will briefly introduce how RAG works in knowledge Q&A systems.

LLMs

To understand RAG, we first need to briefly understand LLMs. Actually, through extensive parameter training, LLMs can already complete many incredible NLP tasks, such as Q&A, writing, translation, code understanding, etc. However, since LLM "memory" remains at the pre-training moment, there will definitely be knowledge and questions it doesn't know. For example, ChatGPT developed by OpenAI cannot answer questions after September 2021. Additionally, due to the existence of hallucinations, LLMs appear very imaginative but lack factual basis. Therefore, we can compare LLMs to knowledgeable and versatile sages who can do many things but have amnesia, with memories only staying before a certain time and unable to form new memories.

To help this sage achieve high scores in modern exams, what should we do? The answer is RAG.

Practical Data Science: How to Easily Rank in Kaggle Beginner NLP Competition Using sklearn

· 6 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

Kaggle is an online community and data science competition platform for data scientists, machine learning engineers, and data analysts, featuring many rewarded data science competitions and datasets. The Kaggle community is very famous in the data science field, with many major internet companies publishing rewarded competitions with prizes ranging from tens of thousands to millions of dollars. This article introduces a recent participation in a Kaggle beginner NLP competition, which has no cash rewards but allows learning NLP-related machine learning knowledge.

Kaggle Competition

Competition Overview

This data science competition asks participants to determine whether a tweet is about a real disaster based on a given Twitter tweet. The image below shows a particular tweet containing the keyword "ABLAZE," indicating the tweet is about a house catching fire.

Disater Tweet

Practical Data Analysis: Open Source Automated Data Exploration Tool Rath

· 5 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

Exploratory Data Analysis (EDA) is a task that data analysts or data scientists frequently need to complete when facing datasets. Using Python tools like Pandas and Seaborn can easily accomplish univariate analysis, bi-variate analysis, and multi-variate analysis, but using them for data exploration not only has certain technical barriers but also requires manually writing scripts for data operations and analysis. This article will introduce a very cool automated data exploration open source tool Rath, which can automatically complete EDA and become the Autopilot or Copilot of the data analysis world.

Rath

Installing Rath

Since Rath is still in rapid iteration and its documentation isn't very complete, the fastest way to experience it is through the demo website provided on the official site.

However, if you know some frontend technology, you can still install it locally, though the steps are slightly more cumbersome.

Before starting, ensure you have Node.js 16 and Yarn installed.

Practical Data Analysis: Building a Self-Service Data Analytics Platform with Open Source Superset

· 7 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

Data Analytics and Business Intelligence are important business modules for many enterprises to implement digital strategies. We previously introduced indispensable parts of the data field in 《浅谈数据:数据领域需要掌握些什么?》, namely software tool auxiliary services in architecture and processes. The open source data analysis platform Apache Superset introduced in this article can provide such services. This article will briefly introduce how to install, deploy, and use Superset.

Superset Official Site

Superset Introduction

Superset is an open source self-service data analytics platform incubated by the Apache Foundation. It can be seen as an open source version of Power BI or Tableau, though Superset's interactive interface is limited to Web. The entire system is based on Python Flask and integrates with mainstream relational databases like MySQL, Postgres, SQL Server, as well as modern databases like ElasticSearch, ClickHouse, Snowflake. The frontend visualization analysis interface is very similar to Power BI and Tableau, with relatively simple operations. Therefore, if you need to build an enterprise-level data analytics platform like Power BI or Tableau without spending money, Superset is an excellent choice.

Superset Dashboard

Talking Data: Some Basic yet Useful Statistics in Data Analysis

· 5 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

"All models are wrong, but some are useful."--George Box

For many people, data analysis is familiar but inaccessible. Data rookies think that fancy dashboards look cool. Operational managers think that time-series charts can help them make business decisions. Programmers think that data analysis is nothing more than fetching the data of target fields from the database according to certain requirements. These views are all correct, but incomplete. The really useful data analysis is not only to present charts and numbers, but also to fully combine data insights with the business knowledge, which is meaningful to add values to the business. Understanding some basic statistics knowledge is helpful to discover insights.

Unreliable Averages

We can often see that many data reports display daily, weekly, or monthly averages, such as the daily average sales of the current month, the monthly average number of visits last year, and so on. The statistics of the average value will be helpful for some specific situations, such as the time of getting up every morning, and the offset of hitting the target. But more often, you are likely to be skeptical about the average value, because it fluctuates up and down quite a lot, and the fluctuation range is not small. The root cause here comes from the Non Linear Distribution in the real world. The distribution of websites' response time, number of web page visits, and stock trend is non-linear. In these non-linear distributions, the average value fails because there are a large number of outliers that cause the average value to be seriously skewed. As shown in the figure below, for normal distribution or gaussian distribution, it is linear, so the average value is at the peak in the middle; But for gamma distribution, given that it is a nonlinear, its average value seriously deviates from its peak, and when there are more outliers, its average value will further deviate from its central position.

Gaussian and Gamma Distributions

Therefore, for these non-linear distributions, the average value is not a reasonable indicator, but we can use median instead to describe its overall distribution. There are many tools to deal with non-linear distributions, one of which is the Box Plot. As shown in the figure below, the two distributions are abstracted into a box and several lines, where the box center line is the median, and the edges are first-quartile and third-quartile lines. In this way, it is not necessary to do too much complicated analysis to quickly get a general idea about the distribution.

Talking Data: Why data governance is so important in digital transformation?

· 5 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

"Organisms are living upon negentropy."--What is Life? --by Erwin Schrödinger

In today's Internet era, data is an important asset for enterprises. We generate data all the time: every time we open a mobile app, place an order online, or even drive through traffic lights, data will be generated. Data is everywhere. This is especially true in enterprises. With so much raw data and increasingly mature data analysis technology, entrepreneurs are excited, because this is the gold mine piled up for companies. However, things are not as simple as imagined. It is not easy to extract valuable treasures from these messy so-called "gold mines" that look like garbage dumps. In my previous article Talking Data: What do we need for engaging data analytics?, the concept of Data Governance was mentioned. This article will introduce how data governance creates value in the chaos of enterprise data from the perspective of enterprise data management.

Isolated Data Island

Large and medium-sized enterprises (generally more than 100 people with multiple departments) will encounter the problem of management chaos while their business grows fast. The sales department has its own sales statistics, typically in the form of large and scattered Excel files; The IT department manages the inventory systems by themselves; The HR department maintains an entire personnel statistics list. This situation will lead to many headaches: the boss often complains that he or she can only receive management reports every week; Managers looked at the up and down data in the report, doubting the data integrity; employees work overtime to sort out the data for the reports, but they are often questioned about data quality. Sounds familiar? These are very common issues in companies. The direct cause is the so-called Isolated Data Island problem.

Isolated Data Island

The main reason for isolated data island issues is that data from various departments or teams is disconnected. Many times, because of the rapid growth of business, some teams need to quickly set up a data system, but are incapable of developing it in time, so they can only use Excel or some fast and effective data entry tools to ensure the efficiency of business operations. With the growth of business, these hacking workarounds continue to derive new internal processes, and bottlenecks will gradually emerge after reaching a certain scale, especially when integration with other departments or external systems is required.