Skip to main content

7 posts tagged with "GitHub"

GitHub platform and version control

View All Tags

Unattended AI Programming: My Experience Using GitHub Copilot Agent for Content Migration

· 7 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

Recently, I successfully used GitHub Copilot Agent to migrate all my archived markdown articles to this Docusaurus-based blog, and the experience was surprisingly smooth and efficient. What impressed me most wasn't just the AI's ability to handle repetitive tasks, but also how I could guide it to work autonomously while I focused on higher-level decisions. Even more fascinating was that I could review and guide the AI agent's work using my phone during commutes or breaks. This experience fundamentally changed my perspective on AI-assisted development workflows.

Here's a showcase of the bilingual blog after migration completion:

Figure 1: Migration results overview (Chinese)

Figure 2: Migration results overview (English)

Stanford University Study Reveals Real Impact of AI on Developer Productivity: Not a Silver Bullet

· 8 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

This article is based on a presentation by Stanford University researcher Yegor Denisov-Blanch at the AIEWF 2025 conference, which analyzed real data from nearly 100,000 developers across hundreds of companies. Those interested and able can watch the full presentation on YouTube.

Recently, claims that "AI will replace software engineers" have been gaining momentum. Meta's Mark Zuckerberg even stated earlier this year that he plans to replace all mid-level engineers in the company with AI by the end of the year. While this vision is undoubtedly inspiring, it also puts pressure on technology decision-makers worldwide: "How far are we from replacing all developers with AI?"

The latest findings from Stanford University's software engineering productivity research team provide a more realistic and nuanced answer to this question. After in-depth analysis of nearly 100,000 software engineers, over 600 companies, tens of millions of commits, and billions of lines of private codebase data, this large-scale study shows that: Artificial intelligence does indeed improve developer productivity, but it's far from a "one-size-fits-all" universal solution, and its impact is highly contextual and nuanced. While average productivity increased by about 20%, in some cases, AI can even be counterproductive, reducing productivity.

Crawlab AI: Building Intelligent Web Scrapers with Large Language Models (LLM)

· 6 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

"If I had asked people what they wanted, they would have said faster horses" -- Henry Ford

Preface

When I first entered the workforce, as a data analyst, I accidentally experienced the ability of web crawlers to automatically extract webpage data, and since then I've been fascinated by this magical technology. As I continued to delve into web scraping technology, I gradually understood the core technologies of web crawling, including web parsing - the process of analyzing webpage HTML structure to build data extraction rules based on XPath or CSS Selectors. This process has long required manual intervention. While relatively simple for scraping engineers, if large-scale extraction is needed, this process is very time-consuming, and as webpage structures change, it increases crawler maintenance costs. This article will introduce my LLM-based intelligent web scraping product: Crawlab AI. Although it's still in early development, it has already shown great potential and promises to make data acquisition easy for data practitioners.

As the founder of the web scraping management platform Crawlab, I've always been passionate about making data acquisition simple and easy. Through constant communication with data practitioners, I realized the massive demand for intelligent scrapers (or universal scrapers) - extracting target data from any website without manually writing parsing rules. Of course, I'm not the only one researching and trying to solve this problem: In January 2020, Qingnan released the universal article parsing library GeneralNewsExtractor based on punctuation density, which can implement universal news crawlers with 4 lines of code; In July 2020, Cui Qingcai released GerapyAutoExtractor, implementing list page data extraction based on SVM algorithms; In April 2023, I developed Webspot through high-dimensional vector clustering algorithms, which can also automatically extract list pages. The main problem with these open-source software is that their recognition accuracy has some gaps compared to manually written crawler rules.

Additionally, commercial scraping software Diffbot and Octoparse have also implemented some universal data extraction functionality through proprietary machine learning algorithms. Unfortunately, their usage costs are relatively high. For example, Diffbot's lowest plan requires a monthly subscription fee of $299.

Exploring Crawlab: Your New Enterprise Web Scraping Management Choice

· 3 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

In the modern data-driven era, acquiring and managing online information has become crucial. To provide powerful support for enterprises and developers, Crawlab has emerged as an enterprise-level web scraping management platform characterized by being ready-to-use out of the box. Regardless of your team size, Crawlab can provide professional and efficient web scraping management solutions.

Core Features

Crawlab's core features include distributed system management, spider task management and scheduling, file editing, message notifications, dependency management, Git integration, and performance monitoring, among others. Its distributed node management allows spider programs to run efficiently across multiple servers. No more worrying about manual uploading, monitoring, and deployment hassles - Crawlab automates all of this, ensuring you can easily schedule spider tasks and view spider program running status and task logs in real-time.

Spider List

Key Highlights

Practical Data Analysis: Open Source Automated Data Exploration Tool Rath

· 5 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

Exploratory Data Analysis (EDA) is a task that data analysts or data scientists frequently need to complete when facing datasets. Using Python tools like Pandas and Seaborn can easily accomplish univariate analysis, bi-variate analysis, and multi-variate analysis, but using them for data exploration not only has certain technical barriers but also requires manually writing scripts for data operations and analysis. This article will introduce a very cool automated data exploration open source tool Rath, which can automatically complete EDA and become the Autopilot or Copilot of the data analysis world.

Rath

Installing Rath

Since Rath is still in rapid iteration and its documentation isn't very complete, the fastest way to experience it is through the demo website provided on the official site.

However, if you know some frontend technology, you can still install it locally, though the steps are slightly more cumbersome.

Before starting, ensure you have Node.js 16 and Yarn installed.

Practical Data Analysis: Building a Self-Service Data Analytics Platform with Open Source Superset

· 7 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

Data Analytics and Business Intelligence are important business modules for many enterprises to implement digital strategies. We previously introduced indispensable parts of the data field in 《浅谈数据:数据领域需要掌握些什么?》, namely software tool auxiliary services in architecture and processes. The open source data analysis platform Apache Superset introduced in this article can provide such services. This article will briefly introduce how to install, deploy, and use Superset.

Superset Official Site

Superset Introduction

Superset is an open source self-service data analytics platform incubated by the Apache Foundation. It can be seen as an open source version of Power BI or Tableau, though Superset's interactive interface is limited to Web. The entire system is based on Python Flask and integrates with mainstream relational databases like MySQL, Postgres, SQL Server, as well as modern databases like ElasticSearch, ClickHouse, Snowflake. The frontend visualization analysis interface is very similar to Power BI and Tableau, with relatively simple operations. Therefore, if you need to build an enterprise-level data analytics platform like Power BI or Tableau without spending money, Superset is an excellent choice.

Superset Dashboard

CI/CD in Action: How to use Microsoft's GitHub Actions in a right way?

· 6 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

GitHub Actions is the official CI/CD workflow service provided by GitHub. It is aimed at making it easy for open-source project contributors to manage operational maintenance, and enable open-source communities to embrace cloud-native DevOps. GitHub Actions is integrated into most of my open-source projects including Crawlab and ArtiPub. As a contributor, I think GitHub Actions is not only easy to use, but also free (which is the most important). Therefore, I hope this article will allow open-source project contributors who are not familiar with GitHub Actions, to really get ideas on how to utilize it and make an impact.

Starting from documentation

For those who are not familiar with GitHub Actions, it is strongly recommended that you read the official documentation first, where you can find Introduction Video, Quick Start, Examples, concepts, how it works, etc. If you read through the docs, you can easily do GitHub DevOps with your own experience in CI/CD. References of all codes in this article can be found on the official documentation,

GitHub Actions Docs

Ideas

Let's first figure out what we would like to implement, i.e. using GitHub Actions to run a web crawler to get daily ranking from GitHub Trending.