记一次由空格导致的502

现象:
今天访问 https://login.demo.com/ajax-user.html 的时候出现了 502 Bad Gateway,而https://login.demo.com/*.html 其他页面都是正常的,但是最近又没有上线代码,而且这部分以前测试验证过的,线上也一直在用。所以怀疑是运维在 nginx 上做了什么“手脚”,找运维询问,运维说昨天只变更了 F5 -> LVS + HAProxy,不会影响功能。讲道理的确应该是这样的:1、其他页面的功能正确 2、变更的部分都在TCP层面上,应该不会影响业务。

but,之前又的确没有问题。

定位问题:
1、首先绕过 LVS 和 HAProxy, curl 指定 ip:port 来直连 nginx,结果:正常
2、排查 HAProxy 配置,结果:正常
3、直连 HAProxy 来访问,结果: 异常

怀疑是 nginx返回的内容 导致 HAProxy 不能解析。

对比正常和异常的内容

发现异常的内容的头信息里:
'Access-Control-Allow-Origin : http://demo.com'
分号的前面多了一个空格 去掉之后应该是这样
'Access-Control-Allow-Origin: http://demo.com'

验证:
curl -x ip:post -H 'key : value' 'url' 到 HAProxy , 结果:异常

好了,到此就找到问题是这个很小的空格引起的,之前F5没有这么严格,而HAProxy比较严格,就爆出异常。

所以使用

header("Cache-Control: no-cache, must-revalidate");
header("Expires: Sat, 26 Jul 1997 05:00:00 GMT");

类似这样的语句时,要注意 ":" 前面不能有空格,免得踩坑。

location 匹配规则(转)

location 匹配规则

语法规则

location [=|~|~*|^~] /uri/ { … }

符号 含义
= 开头表示精确匹配
^~ 开头表示uri以某个常规字符串开头,理解为匹配 url路径即可。nginx不对url做编码,因此请求为/static/20%/aa,可以被规则^~ /static/ /aa匹配到(注意是空格)
~ 开头表示区分大小写的正则匹配
~* 开头表示不区分大小写的正则匹配
!~和!~* 分别为区分大小写不匹配及不区分大小写不匹配 的正则
/ 通用匹配,任何请求都会匹配到。

多个location配置的情况下匹配顺序为(参考资料而来,还未实际验证,试试就知道了,不必拘泥,仅供参考):

  • 首先匹配 =
  • 其次匹配 ^~
  • 其次是按文件中顺序的正则匹配
  • 最后是交给 / 通用匹配
  • 当有匹配成功时候,停止匹配,按当前匹配规则处理请求

例子,有如下匹配规则:

location = / {
   #规则A
}
location = /login {
   #规则B
}
location ^~ /static/ {
   #规则C
}
location ~ \.(gif|jpg|png|js|css)$ {
   #规则D
}
location ~* \.png$ {
   #规则E
}
location !~ \.xhtml$ {
   #规则F
}
location !~* \.xhtml$ {
   #规则G
}
location / {
   #规则H
}

那么产生的效果如下:

访问 http://localhost/category/id/1111 则最终匹配到规则H,因为以上规则都不匹配,这个时候应该是nginx转发请求给后端应用服务器,比如FastCGI(php),tomcat(jsp),nginx作为方向代理服务器存在。

所以实际使用中,个人觉得至少有三个匹配规则定义,如下:

#直接匹配网站根,通过域名访问网站首页比较频繁,使用这个会加速处理,官网如是说。
#这里是直接转发给后端应用服务器了,也可以是一个静态首页
# 第一个必选规则
location = / {
    proxy_pass http://tomcat:8080/index
}
# 第二个必选规则是处理静态文件请求,这是nginx作为http服务器的强项
# 有两种配置模式,目录匹配或后缀匹配,任选其一或搭配使用
location ^~ /static/ {
    root /webroot/static/;
}
location ~* \.(gif|jpg|jpeg|png|css|js|ico)$ {
    root /webroot/res/;
}
#第三个规则就是通用规则,用来转发动态请求到后端应用服务器
#非静态文件请求就默认是动态请求,自己根据实际把握
#毕竟目前的一些框架的流行,带.php,.jsp后缀的情况很少了
location / {
    proxy_pass http://tomcat:8080/
}

ReWrite语法

  • last – 基本上都用这个Flag
  • break – 中止Rewirte,不在继续匹配
  • redirect – 返回临时重定向的HTTP状态302
  • permanent – 返回永久重定向的HTTP状态301

1、下面是可以用来判断的表达式:

-f和!-f用来判断是否存在文件
-d和!-d用来判断是否存在目录
-e和!-e用来判断是否存在文件或目录
-x和!-x用来判断文件是否可执行

2、下面是可以用作判断的全局变量

例:http://localhost:88/test1/test2/test.php
$host:localhost
$server_port:88
$request_uri:http://localhost:88/test1/test2/test.php
$document_uri:/test1/test2/test.php
$document_root:D:\nginx/html
$request_filename:D:\nginx/html/test1/test2/test.php

Redirect语法

server {
    listen 80;
    server_name start.igrow.cn;
    index index.html index.php;
    root html;
    if ($http_host !~ “^star\.igrow\.cn$&quot {
        rewrite ^(.*) http://star.igrow.cn$1 redirect;
    }
}

防盗链

location ~* \.(gif|jpg|swf)$ {
    valid_referers none blocked start.igrow.cn sta.igrow.cn;
    if ($invalid_referer) {
       rewrite ^/ http://$host/logo.png;
    }
}

根据文件类型设置过期时间

location ~* \.(js|css|jpg|jpeg|gif|png|swf)$ {
    if (-f $request_filename) {
        expires 1h;
        break;
    }
}

禁止访问某个目录

location ~* \.(txt|doc)${
    root /data/www/wwwroot/linuxtone/test;
    deny all;
}

PHP strings (repost)

Introduction

Strings management has always been a "problem" to consider when designing a C program. If you think about a tiny self-contained C program, just don't bother with strings, use libc functions, or if you need to support Unicode, use libraries such as ICU.
You may also use libraries such as the well known bstring, the APR strings or even glib strings. Many libraries exist, and once you are a senior C developper, you can easilly build your own library fitting your exact needs.

In C, strings are simple NULL-terminated char arrays, like you know. However, when designing a scripting language such as PHP, the need to manage those strings arises. In management, we think about classical operations, such as concatenation, extension or truncation ; and eventually more advanced concepts, such as special allocation mechanisms, string interning or string compression. Libc answers the easy operations (concat, search, truncate), but complex operations are to be developped by yourself.

Let's see together how strings are implemented into PHP 5, and the main differences with PHP 7.

The PHP 5 way

In PHP 5, strings don't have their own C structure. Yes I know, that may seem extremely surprising, but that's the case.
We keep playing with traditional C NULL-terminated char arrays, also often written as char *.
However, we support what is called "binary strings" , those are strings that can embed the NULL character.
Strings embeding the NULL char can't be directly passed to libc classical functions anymore, but need to be addressed by special functions (that exist in libc as well) taking care of the length of the string.

Hence PHP 5 memorizes the size of the string, together with the string itself (the char *).
Obviously, as PHP doesn't support Unicode natively, the string stores each C ASCII character, and the length stores the number of characters, as we always assume one char = one byte : the plain 50-year-old C concept of a "string". If one graphical character (what Unicode calls "Grapheme") were to be stored in more than one byte, then every concept presented here falls down to just being wrong. We always assume that one graphical character equals one byte (no Unicode support). Also, remember that char * buffers can contain any byte, not just only printable characters.

One char equals one byte. This statement is always true, worldwide, whatever the machine / the platform. When not talking about Unicode but plain ASCII, one char = one printable character.

So, we end-up having something like :

typedef union _zvalue_value {
    long lval;
    double dval;
    struct {
        char *val;     /* C string buffer, NULL terminated */
        int len;      /* String length : num of ASCII chars */
    } str;            /* string structure */
    HashTable *ht;
    zend_object_value obj;
    zend_ast *ast;
} zvalue_value;

I lied a little bit when I said PHP doesn't manage strings using its own structure. In fact, in PHP 5, a string is used in the str field of the zval (the PHP variable container).
Graphically, this could give :

strings_php5

Problems in the PHP 5 way

This model has several problems, some are addressed in later PHP 5 versions, and others in the new PHP 7 implementation we'll talk about later.

Integer size

First thing is that we store the length of the string into an integer , which is platform dependant.
On LP64 (~Linux/Unix), an integer weights 4 bytes, but on ILP64 (SPARC64), it weights 8 bytes. Things are barely the same for the 32 bits variants. So, depending on your platform, PHP will behave differently. Also, this is pretty uncommon I assume, but you can't store in PHP's string a string which is larger than the size of the integer, aka in Linux LP64, you can't have strings which length would be over 2^31, even though your CPU line is 64 bits large.

No uniform structure

Second problem : as soon as we don't use the zval container (and its str field), we end-up beeing back in classical C and managing string the traditionnal way.
As we support binary strings nearly everywhere into the engine, it is not rare, in PHP 5, to see again the same kind of structure, but taken out of the zval container.
Examples :

struct _zend_class_entry {
    char type;
    const char *name;
    zend_uint name_length;
...

The above code show a snippet of a PHP class internally : a zend_class_entry structure.
As you can see, this latter also got its own definition of a string : a char *name, and a zend_uint name_length that stores the length.

What about hashtables ?

typedef struct _zend_hash_key {
    const char *arKey;
    uint nKeyLength;
    ulong h;
} zend_hash_key;

Here again, this zend_hash_key structure is used often when it comes to play with hash tables (very very common use-case) ; and here again, we notice that the concept of "PHP string" is once more duplicated : a const char* arKey and its uint nKeyLength

So to sum-up this second problem, there is no unified way of representing a string in PHP 5. The concept is always the same : a char* buffer and its length ; but this concept is not "globally shared and assumed" across all PHP source code, thus many code duplication happens, as well as one last problem.

Memory usage

The last problem is memory consumption.
If PHP meets twice the same string into its life, it will likely store that string twice in memory, or even more than twice.
Add-up every piece of string you can think about, and you'll notice that this is not so little : memory consumption will suffer from the numbers of different copy of the same piece of string in memory.

For example, if two internal layers communicate each other, the top layer passes a string to the bottom layer. If that bottom layer wants to keep the string for itself, to reuse it later for example (with no intention to modify it), well in PHP 5, it has no other way than copying the whole string.
It can't keep the pointer, because the above layer is free to free the pointer whenever it wants to, and also, the bottom layer would like to free that string whenever it wants to, without having the top layer crash in a use-after-free situation.

Just one quick and easy example to illustrate :

static PHP_FUNCTION(session_id)
{
    char *name = NULL;
    int name_len, argc = ZEND_NUM_ARGS();

    if (zend_parse_parameters(argc TSRMLS_CC, "|s", &name, &name_len) == FAILURE) {
        return;
    }

    /* ... ... */

    if (name) {
        if (PS(id)) {
            efree(PS(id));
        }
        PS(id) = estrndup(name, name_len);
    }
}

This is the session_id() PHP function. It is given a string as input (name), and must store it internally in the session module (into PS(id)), to remember and use it later.
This is done by fully duplicating the passed string (the function is estrndup() , this function duplicates a string into memory).
Why duplicate it ? Because if we change the session id again, well, we'll free the last id, but if this last id string were used elsewhere (in your PHP variables), then you will crash reading free'ed memory.
Opposite problem : if we just store the pointer into the session module without duplicating the string it points to, what happens if you - PHP user - destroy the variable this $id was stored in ? We'll end up in the exact same (upside down) situation : Session module is going to use a free()'ed pointer, and then is going to crash.

Those situations, were strings are duplicated just to be kept hot in memory for further usage, happen often in PHP source code, and if you dump the heap memory of a PHP process in the middle of its lifetime, you will notice lots of memory bytes that contain the exact same string. This is a pure waste of machine main memory.

To solve this last problem, PHP 5.4 added a well-known-yet-clever concept to the receipe : interned strings.
PHP 7 on its side, reworked in deep the global "string" concept to finally have a very consistent and memory fair solution.

Before PHP 5.4, there were no solution nor consistency regarding the management of strings into memory. This resulted in poor performances, both in term of CPU and memory usage, especially when the web application is big.

Solutions implemented in PHP 5 for strings management

PHP 5 tried to address the memory consumption problem, and managed to find a clever solution to it.
For every other problem related in the last chapter just above, only PHP 7 solves them, because their solution require massive breaks in the PHP source code, and massive code rewrite.

Interned strings

Before PHP 5.4 , every problem related to string management is present.
Starting from PHP 5.4 , the concept of "interned strings" was implemented, and resulted in less memory consumption, thus solving one of the exposed problem (the biggest one in my opinion).

But, as you will see, interned strings require sharing a global buffer, something which is by definition not thread safe.
Thus, at the moment, interned strings are not supported by ZTS PHP.

Interned strings are fully disabled if you run PHP in ZTS mode. You'll then suffer from memory waste in string management compared to non-ZTS.

What are interned strings ?

If you use your prefered search engine with those terms ("strings interning"), you'll land on many pages defining interned strings. That means that we are facing here a general solution, that is implemented in other programming languages, such as Python or Java, and even in BIG stand-alone softwares, such as your favorite IDE, or your favorite video games ( yes, those latter are "just" some BIG C++ softwares ).

Interned strings basically tells that a same string (f.e "bar"); should never be stored more than once in memory, for one process. Easy enough, isn't it ?
But, how did PHP implement this concept, back in PHP 5.4 ?
Let's see that together.

Interning a string is ensuring this string will never be stored in memory more than once per process. For softwares managing big strings, or many strings, this can represent a nice memory saving as well as better performances regarding any string manipulation.

The concept is easy. Whenever we meet a string, instead of creating it with a classical malloc() (we are assuming dynamic strings that can't really be stack allocated), we store the string into a bound buffer and add it to a dictionnary (which is a hashtable). If the string is already present in the dictionnary, the API returns its pointer, effectively preventing us from creating yet another copy of such a string in memory.

However, only persistent strings can be stored in this interned strings buffer, because you know PHP will free the request-bound memory between each request, when we are actually in the process of treating a request, no interned strings must be created as their pointers need to be freed at the end of the request. To accomplish that, we use a snapshot buffer that is reseted to its position at the end of the current request. So interned strings do work for request-scope allocations, but they are less efficient than "persistent" strings that get allocated at PHP startup and reused until PHP dies, because request-allocated interned strings are freed at the end of the current request and thus have a smaller lifecycle.

Here is how we prepare the buffer, at the very early startup stage of PHP (truncated) :

void zend_interned_strings_init(TSRMLS_D)
{
    size_t size = 1024 * 1024;

    CG(interned_strings_start) = malloc(size);

    CG(interned_strings_top) = CG(interned_strings_start);
    CG(interned_strings_snapshot_top) = CG(interned_strings_start);
    CG(interned_strings_end) = CG(interned_strings_start) + size;

    zend_hash_init(&CG(interned_strings), 0, NULL, NULL, 1);

    CG(interned_strings).nTableMask = CG(interned_strings).nTableSize - 1;
    CG(interned_strings).arBuckets = (Bucket **) pecalloc(CG(interned_strings).nTableSize, sizeof(Bucket *), CG(interned_strings).persistent);
}

I did truncate some code to ease understanding. What you first must spot, is that the interned string buffer is 1Mb large (1024*1024) and cannot be changed by the PHP user (INI setting). Effectively, when this buffer will be full, it will NOT be resized, and from that point the interned strings API will behave like if there is no interned strings : it will lead us to having malloc()ed our strings.

Now, to create an interned string, we internally use zend_new_interned_string() , and not malloc() or strdup() or anything else :

#define IS_INTERNED(s) \
    (((s) >= CG(interned_strings_start)) && ((s) < CG(interned_strings_end)))

static const char *zend_new_interned_string_int(const char *arKey, int nKeyLength, int free_src TSRMLS_DC)
{
    ulong h;
    uint nIndex;
    Bucket *p;

    if (IS_INTERNED(arKey)) {
        return arKey;
    }

    h = zend_inline_hash_func(arKey, nKeyLength);
    nIndex = h & CG(interned_strings).nTableMask;
    p = CG(interned_strings).arBuckets[nIndex];
    while (p != NULL) {
        if ((p->h == h) && (p->nKeyLength == nKeyLength)) {
            if (!memcmp(p->arKey, arKey, nKeyLength)) {
                if (free_src) {
                    efree((void *)arKey);
                }
                return p->arKey;
            }
        }
        p = p->pNext;
    }
    /* ... ... */

As you can see, this API immediately returns the string you want to intern, if this latter is already in the interned string buffer.
If not, it will lookup the interned string hashtable to find if the same string is stored into it. This is a heavy operation, as the hashtable needs to be browsed, and the string needs to be per-byte compared. Both operations will likely invalidate your L1 CPU cache, and possibly your L2 cache as well ; but we only create interned strings when this is necessary and will save performances later (heavilly used string). This is a tradeoff between the work needed to create the string (heavy) and the later work done to fetch back this string and use it.
If the same string is found in the hashtable, this latter pointer is returned, and the original string buffer can also be freed by the API if we asked it to, passing '1' as last parameter.

Let's continue :

if (CG(interned_strings_top) + ZEND_MM_ALIGNED_SIZE(sizeof(Bucket) + nKeyLength) >=
    CG(interned_strings_end)) {
    /* no memory */
    return arKey;
}

Like I said, if the memory buffer for storing the strings is full, the API returns the pointer you passed to it, effectively doing nothing.
Let's end :

p = (Bucket *) CG(interned_strings_top);
CG(interned_strings_top) += ZEND_MM_ALIGNED_SIZE(sizeof(Bucket) + nKeyLength);
h = zend_inline_hash_func(arKey, nKeyLength);

p->arKey = (char*)(p+1);
memcpy((char*)p->arKey, arKey, nKeyLength);
if (free_src) {
    efree((void *)arKey);
}
p->nKeyLength = nKeyLength;
p->h = h;
p->pData = &p->pDataPtr;
p->pDataPtr = p;

/* ... ... */

return p->arKey;

And finally, the string is duplicated (memcpy()) from the pointer you provided (arKey), into the HashTable, which Bucket will be allocated from the interned string buffer itself.

As you can see, there is nothing really complicated, just some clever tricks to make any future manipulation of the related string more efficient.
For example, the string hash (h variable) is computed every time we intern a string. This hash will be needed any time the string is used as a key into a hashtable, and this is very likely to happen, so we prefer eating some more CPU cycles now (by computing the hash), than later at runtime when performances will be critical to the user.

Now we must think about string destruction. As you probably understood, interned strings are shared pointers, and thus should never be freed randomly by someone, because any place elsewhere the pointer is used, will lead to a use-after-free bug.

Thus, when we are given a string and want to free it, we must first ask if this latter is interned.
For this, we don't use efree() (the free() equivalent in PHP's source), but str_efree(), which takes care of interned strings.

#define str_efree(s) do { \
        if (!IS_INTERNED(s)) { \
            efree((char*)s); \
        } \
    } while (0)

We can see that interned strings are effectively NOT destroyed when asked for (because they are shared and used elsewhere).

Also, if you want to duplicate a string pointer for read-only use purpose, use str_estrdup() , instead of estrdup() (which will duplicate the string in memory in anycase). Look :

#define str_estrndup(str, len) \
    (IS_INTERNED(str) ? (str) : estrndup((str), (len)))

Obviously, interned strings refer to shared strings and thus read-only purpose string. Any time you want to modify one of theses strings, for your own usage, you will be required to fully duplicate it in memory and work on your copy. But often, the string is used in a read-only maner, so there should be no need at all to duplicate it in memory if the string is interned (this is the concept).

Interned strings are both shared + read-only concepts, don't attempt to free() an interned string, nor to modify it directly. The main API call zend_new_interned_string_int() returns a const char* to hint you. If you need to write to the interned string : create a private full copy before, and work on that copy (and efree() it when finished if heap-allocated).

Have you seen how cleverly interned strings are stored into memory ?
Here is a picture of what the memory layout could look :

interned_string_buffer_layout_php5

Interned strings are stored into a well known bound buffer, so to check if a pointer holds an interned string, we just have to check if its address is inside the interned string buffer bounds :

#define IS_INTERNED(s) \
    (((s) >= CG(interned_strings_start)) && ((s) < CG(interned_strings_end)))

This is highly performant, as looking up a HashTable any time we want to know if a string is interned or not, is just a no-go for performance, and would slow down the language a lot, as strings are used everywhere into PHP's heart.

So, how to really free an interned string ? You don't do that as a PHP extension writer. The engine will free the whole interned string buffer at once, once it shuts down (end of PHP life, after having treated several requests) :

void zend_interned_strings_dtor(TSRMLS_D)
{
    free(CG(interned_strings).arBuckets);
    free(CG(interned_strings_start));
}

This is once more highly optimized, because the buffer is contiguous address, and the free operation will destroy every piece of interned string at once, instead of one free operation per string ( like in PHP < 5.4).

For request-bound interned strings, the border of the string buffer is simply snapshoted when a request is about to come, and restored when the request has gone, thus, every string created between both operations (during a request) will be taken out of the buffer by simply moving the border of this latter (CG(interned_strings_top)).
At the beginning of the request, we snapshot the upper border of the interned strings buffer :

static void zend_interned_strings_snapshot_int(TSRMLS_D)
{
    CG(interned_strings_snapshot_top) = CG(interned_strings_top);
}

At the end of the request, we restore it, and delete any reference of the strings from our hashtable:

static void zend_interned_strings_restore_int(TSRMLS_D)
{
    Bucket *p;
    int i;

    CG(interned_strings_top) = CG(interned_strings_snapshot_top);

    for (i = 0; i < CG(interned_strings).nTableSize; i++) {
        p = CG(interned_strings).arBuckets[i];
        while (p && p->arKey > CG(interned_strings_top)) {
            CG(interned_strings).nNumOfElements--;
            if (p->pListLast != NULL) {
                p->pListLast->pListNext = p->pListNext;
            } else {
                CG(interned_strings).pListHead = p->pListNext;
            }
            /* ... */
    }
}

Now, to have a concrete example of interned strings usage, let's have a look together at the PHP compiler (which allocates request-bound interned strings) :

void zend_do_begin_class_declaration(const znode *class_token, znode *class_name, const znode *parent_class_name TSRMLS_DC)
{
    /* ... ... */

    new_class_entry = emalloc(sizeof(zend_class_entry));
    new_class_entry->type = ZEND_USER_CLASS;
    new_class_entry->name = zend_new_interned_string(Z_STRVAL(class_name->u.constant), Z_STRLEN(class_name->u.constant) + 1, 1 TSRMLS_CC);
    new_class_entry->name_length = Z_STRLEN(class_name->u.constant);

    /* ... ... */

The code above is triggered when PHP compiles a user class, like class Bar { }
Like you can read, the name of the class will be interned, zend_new_interned_string() is used. It is passed a pointer storing the name of the class as it comes from the parser (class_name->u.constant) , and it will return a new pointer to the same string, but interned. Also, the old pointer will be freed and will become invalid (1 is passed as last parameter to zend_new_interned_string(), which means "free the original pointer so that I don't have to do it myself").

If we continue analyzing the compiler, we'll notice that it does the same thing for many concepts : function names, variable names, constant names, userland PHP strings and more advanced concepts, such as OPArray litterals.

Here is what the string-related memory layout of some PHP script could look like :

<?php
class Bar {
    const FOO = "Bar";
    public function foo($var = "foo") {
        return "var";
    }
}

memory_layout_compiler

Each piece of string is effectively stored once and only once in memory, and they are all stored into the same contiguous buffer, and not sparsed everywhere in memory, improving CPU cache efficiency.

Summary of interned strings

Interned strings are a programmatic mechanism designed to store any piece of string (char *) only once in memory. This is a read-only, globally shared concept : one should not change an interned string (usually given as a const char* pointer to prevent that), nor free it.
Also, as the concept involves a global buffer, care should be taken when using a threaded environment. In PHP ZTS, interned strings are simply disabled.

Interned strings have many advantages :

  • They save memory - this is their first goal - and the save can be huge on big applications, with tons of functions / classes / variables...
  • Interned string hash is computed only once for all and reused when needed (and it is often needed). This saves lots of CPU cycles at PHP runtime.
  • Comparing two interned strings ends in comparing two pointers, with no memory scanning at all, which is very performant.

And also drawbacks :

  • Creating an interned string often requires scanning a hashtable, and sometimes two string buffers (memcmp()); this effort must be worth the expected later gains.
  • Internal developpers must remember that any string could be interned, and thus should never try to free() it directly (will lead to a crash) nor modify it.
  • Because the compiler makes use of interned string, it is then slower, but is likely to accelerate your runtime. You must use an OPCode cache solution to fully benefit from interned string advantages.

The OPCache extension for PHP pushes the interned strings concept even further, by sharing the same interned strings buffer accross several PHP process, and by allowing the user to configure the space of the buffer (using an INI setting) whereas traditionnal PHP doesn't allow that.
You can read more about this on the dedicated blog post.

The PHP 7 way

PHP 7 changed many things in the way PHP manipulates strings internally.

Finally a real shared structure

PHP 7 finally centralized the concept of "strings" into PHP, by designing a structure which is used everywhere PHP uses strings :

struct _zend_string {
    zend_refcounted_h gc;
    zend_ulong        h;
    size_t            len;
    char              val[1];
};

3 things are noticeable in the structure above :

  • The length of the string is stored using a size_t type.
  • The real C string is not declared as a char* but as char[1].
  • It embeds a refcount : gc

As length are typed on a size_t variable, they weight {platform size} bytes ! Whatever the platform. One of the PHP 5 problems is then solved : under a CPU64, string length will be 8 bytes (64 digits) for every platform (this is one of the definition of the C size_t type).

The string is not stored into a char* but a char[1], this is a C trick called a "struct hack" (look for that term if needed). That allows us to allocate the string buffer together with the zend_string buffer, and save one pointer along having a contiguous area of memory. Performances++

Notice that now, strings embed by default their hash (h). So we compute the hash for a given string only once (usually at compile time), and never after that. In PHP, mainly before interned strings (< 5.4), the same string hash was recomputed every time it is needed, that led to tons of CPU cycles burnt for nothing... Pre-computed hashes have pushed overall PHP performances.

Strings are refcounted ! In PHP 7, strings are refcounted (as well as many other primitive types). That means that interned strings are still relevant, but less : PHP layers can now pass strings from one to the other, as strings are refcounted, we are plainly sure that noone will accidentaly free the string as this latter is still used elsewhere (until doing an error on purpose).

This is the concept behind "reference counting" (search the term if needed) : we now count every place where a string is used (stored), so that it will be freed when nobody uses it anymore.

Let's go back to our session module example. Here is its PHP 5 code recalled to you, followed by its PHP 7 code, just for the part managing strings (what we are looking after) :

/* PHP 5 */
static PHP_FUNCTION(session_id)
{
    char *name = NULL;
    int name_len, argc = ZEND_NUM_ARGS();

    /* ... */

    if (name) {
        if (PS(id)) {
            efree(PS(id));
        }
        PS(id) = estrndup(name, name_len);
    }
}

/* PHP 7 */
static PHP_FUNCTION(session_id)
{
    zend_string *name = NULL;
    int argc = ZEND_NUM_ARGS();

    /* ... */

    if (name) {
        if (PS(id)) {
            zend_string_release(PS(id));
        }
        PS(id) = zend_string_copy(name);
    }
}

Like you can see, we use a zend_string structure in PHP 7, and we use an API : zend_string_release() and zend_string_copy() with it.

static zend_always_inline zend_string *zend_string_copy(zend_string *s)
{
    if (!ZSTR_IS_INTERNED(s)) {
        GC_REFCOUNT(s)++;
    }
    return s;
}
static zend_always_inline void zend_string_release(zend_string *s)
{
    if (!ZSTR_IS_INTERNED(s)) {
        if (--GC_REFCOUNT(s) == 0) {
            pefree(s, GC_FLAGS(s) & IS_STR_PERSISTENT);
        }
    }
}

Just a matter of refcounting : the string is passed from external world to PHP session module, and this latter keeps a reference to the string, never needing to fully copy it in memory, like PHP 5 does.

That is a huge step forward in string management in PHP.

PHP 7 added a real structure and API for string management internally. This is a huge step forward in consistency, memory savings and performances.

If you want to grab the zend_string API, it is inlined (for compilation performance reasons) and stored in zend_string.h.

Interned strings still matter

Like we saw, strings in PHP 7 are now reference counted, and that prevents us from needing to fully duplicate them when wanting to store a "copy" of a string from a layer to the other.

But, interned strings still matter. They are still used in PHP 7, and that works nearly the same way as in PHP 5; except that we don't use a special buffer anymore, because we can flag a zend_string as being interned (the structure now allows us to do so).

So, creating an interned string in PHP 7, is creating a zend_string and flag it with IS_STR_INTERNED
When releasing a zend_string using zend_string_release(), the API checks if the string is interned and just does nothing if it is the case.
The interned strings are destroyed barely the same way as in PHP 5, simply the process in PHP 7 is optimized thanks to new allocation and garbage collection mechanisms.

A heavy migration

You got it ? Replacing any place in PHP source code ( ~750K lines) where we used a char*/int by a zend_string structure and its API ... was not an easy job.
You can see some commit diff that are huge about that.

This could definitely not happen in PHP 5 source codebase, because zend_string simply breaks the ABI, and we don't break the ABI until major versions of PHP.

Migrating PHP extensions is not an easy task neither, zend_string is not the only change in PHP 7, and many big centralized structure have changed significantely in PHP 7. As the ABI is broken between PHP 5 and PHP 7, obviously you'll have to rebuild / redownload your favorite extensions for PHP 7, PHP 5 ones won't load into PHP 7 at all.

Conclusions

You now have a glance on how PHP manages strings into its heart. Most of the strings come from the PHP compiler: the user PHP scripts. PHP 5, starting with 5.4, introduced interned strings, which is a concept meaning to save memory by not duplicating strings into memory. Before PHP 5.4, string management was plainly missing in PHP.

Starting with PHP 7, PHP added a new structure and a nice reference-counting-based API for string management, resulting in even more memory savings, and nice consistency accross the language.

PHP7变量内部实现(二)(译)

在上一篇文章中,讨论了PHP5和PHP7变量之间大的改变。回顾一下最大的变化就是zval不再单独分配,不再自己存储引用计数。简单类型比如整形、浮点型的值直接存在zval内部,复杂的类型还是通过一个指针指向独立的结构来表示。

复杂类型都有一个通用的头,就是zend_refcounted:

struct _zend_refcounted {
    uint32_t refcount;
    union {
        struct {
            ZEND_ENDIAN_LOHI_3(
                zend_uchar    type,
                zend_uchar    flags,
                uint16_t      gc_info)
        } v;
        uint32_t type_info;
    } u;
};

这个头保存了refcount,值的类型,周期回收信息(gc_info),以及和类型相关的flag信息。

下面会详细的讨论一下各种复杂类型的实现,以及和PHP5中的区别。其中引用类型在上一篇中已经说过了。资源类型这里不会提到,因为我觉得没什么意思。

String

PHP7中的字符串有一个专门的zend_string类型,是这么定义的:

struct _zend_string {
    zend_refcounted   gc;   /* 这就是上面说的通用的 zend_refcount gc */
    zend_ulong        h;        /* hash value */
    size_t            len;
    char              val[1];
};

除了引用计数的头,字符串结构还包含了一个哈希缓存h,长度len,值val
哈希缓存的作用是避免每次在hashtable中查找key的时候重复计算hash,在首次使用的时候就会被初始化成一个非0的哈希值。

如果你对C不是很熟悉的话,就会对val感到奇怪:一个char怎么能存一个字符串呢?这其实是用了" struct hack " ,数组只用一个元素来声明,在创建zend_string的时候,我们给它分配一个大的string。我们还是可以通过这个val元素来访问这个string。

当然,这是一个技术上没有定义的行为,因为我们通过一个单字符来读、写一个array,然而,C编译器不知道你这样乱整。C99支持这样使用,其实就是"flexible array members(柔性数组)"。

这种新的字符串类型比原生的C字符串有一些有点:第一,本身包含了字符串的长度,意味着取长度的时候不需要遍历了;第二,字符串本身包含了引用计数,可以在多个地方使用同一个字符串,而不是使用zval,这对于共享hashtable的key很重要。

新的字符串和C字符串相比也有一个很大的缺点:从 zend_ztring 可以很简单的用 str->val取得对应的C string,然而不能直接把C的string变成 zend_string,实际上需要新申明一个zend_string,然后把C的string复制进去,这点在C代码中处理字符串的时候不方便。

下面是string中的flags (在gc里面flags字段):

#define IS_STR_PERSISTENT           (1<<0) /* allocated using malloc */
#define IS_STR_INTERNED             (1<<1) /* interned string */
#define IS_STR_PERMANENT            (1<<2) /* interned string surviving request boundary */

Persistent strings 用的是系统的分配器而不是Zend内存管理器(ZMM),这样可以在不仅一次请求中有效,所以就可以透明的在zval中使用一个永久的字符串,然后这在PHP5中需要事先拷贝到ZMM中。

Interned strings 直到请求结束时才会被销毁,也不需要引用计数。他们也是不重复的,在创建一个新的 interned string的时候引擎会检查给的内容是不是已经存在了。PHP代码中所有的字符串(字符换、变量名、函数名等)一般用的是这个。

Permanent strings 是在请求开始时创建的interned string,但是请求结束时不会被销毁。

如果使用了 Opcache, interned string 就会存在共享内存里(SHM),在所有的PHP worker 进程间共享,这种情况下,Permanent strings 就没什么意思了,因为 Interned strings 会被销毁。

Array

因为上一篇文章已经说了数组了,所以这里不再展开讨论,但是因为最近的小改动影响了一些细节,但是大的概念还是一样的。

这里直说一个新概念:Immutable arrays(不可变数组),本质上和 interned string差不多,没有引用计数,在请求结束之前不会被销毁(或者更久)。

为了避免内存管理问题,不可变数组只在Opcache打开的时候使用,下面的代码可以看出会有什么差异:

for ($i = 0; $i < 1000000; ++$i) {
    $array[] = ['foo'];
}
var_dump(memory_get_usage());

有Opcache的情况下是 32M,没有的时候就会飙到390M,因为每个$array的元素都会拿到新的[foo]的拷贝。原因是VM为了避免SHM出错 ,而采用真的拷贝,而不是 refcount + 1。我希望将来可以在不用Opcache的时候解决这个灾难性的问题。

Objects in PHP 5

先看一下PHP5的对象是怎么工作的,找到里面低调的地方:zval 本身存着 zend_object_value,定义如下:

typedef struct _zend_object_value {
    zend_object_handle handle;
    const zend_object_handlers *handlers;
} zend_object_value;

handle是对象的唯一id,用来查找这个对象的数据,handlers是一个存了对象里各种方法的指针的虚函数表。常规的PHP代码中,对象的 handler 表都是一样的,但是 扩展里面创建的对象,可以通过自定义 handlers 来改变对象的行为。

对象句柄就是一个对象库(object_store)里面的索引,对象库就是对象组成的数组,像下面这样:

typedef struct _zend_object_store_bucket {
    zend_bool destructor_called;
    zend_bool valid;
    zend_uchar apply_count;
    union _store_bucket {
        struct _store_object {
            void *object;
            zend_objects_store_dtor_t dtor;
            zend_objects_free_object_storage_t free_storage;
            zend_objects_store_clone_t clone;
            const zend_object_handlers *handlers;
            zend_uint refcount;
            gc_root_buffer *buffered;
        } obj;
        struct {
            int next;
        } free_list;
    } bucket;
} zend_object_store_bucket;

这里的情况比较复杂。前3个成员是一些元信息(析构函数是否调用,bucket是否被使用,对象被递归调用了多少次),中间的联合体用来区分bucket当前是否正在使用还是在空闲列表中。对于使用者来说重要的是struct _store_object

第一个元素object是指向实际对象的指针,它不是直接内嵌在对象库桶里面的,因为对象没有一个固定的大小。这个指针下面紧跟着3个元素分别负责销毁、释放、克隆。注意,PHP里面对象的销毁和释放是不同的步骤,以前在某些情况下会跳过(不完全释放)。clone这个字段实际上从来没用过,因为这些字段都不是一般对象的一部分,(不管出于什么原因)他们都会被每个独立的对象拷贝,而不是共享。

接下来就是handlers指针,指向一个普通的对象,当在不知道zval的情况下销毁这个对象时,这个指针就会有用了(存对象的资源句柄)。

槽里面还包含了一个refcount,因为zval已经存了refcount了,这里还存一个就显得有点怪了。。。为什么需要这样呢?因为通常"拷贝" zval 的时候只是 refcount + 1,但是某些情况下,也会有真的拷贝,比如,申明一个全新的zval但是有相同zend_object_value,这种情况下就是两个独立的zval共享同一个对象库槽,所有对象库槽本身才会需要一个refcount。这种“双重引用计数”在PHP5的zval实现中就是硬伤。bufferd指针指向 GC root buffer,也是因为这个原因才重复的。

现在我们来看看 object 指向的实际 对象吧,用户用的对象通常是这样的:

typedef struct _zend_object {
    zend_class_entry *ce;
    HashTable *properties;
    zval **properties_table;
    HashTable *guards;
} zend_object;

zend_class_entry指向的是这个对象实例化对应的类本身,propertiesproperties_table存的是对象的属性,动态属性(在运行的时候才有的,声明的时候没有)存在properties里。在类里面申明的属性有一个优化点:在编译期间,每个属性都被赋予了一个索引,索引和属性的值都存在properties_table里面。而属性名和索引的对应关系又存在类里面的一个hashtable里面。这样的哈希表内存开销的是避免单个对象,另外在运行时,属性的这个索引是动态缓存的。

guards这个哈希表是用来实现魔术方法的,比如__get,这里不讨论。

除了上面已经提到的“双重引用计数”,对象的实现也当占内存,仅有一个属性的类都会占用136bytes的内存。另外还有比如:从一个对象zval里面拿一个属性,先要取得对象库桶,然后再是 zend_object,然后再是properties_table,最后才是它指向的zval,这已经就有4层了(实际中,一般不会少于7层)。

Objects in PHP 7

PHP7尝试解决掉这些问题:去掉“双重引用计数”,减少内存占用,减少间接指向。先看一下新的zend_object:

struct _zend_object {
    zend_refcounted   gc;
    uint32_t          handle;
    zend_class_entry *ce;
    const zend_object_handlers *handlers;
    HashTable        *properties;
    zval              properties_table[1];
};

注意这个结构现在就只剩下一个对象了,zend_object_value被取而代之的是一个直接指向对象或者对象库的指针。

除了通用的zend_refcounted头,还可以看到handlehandles被移到zend_object里面了,另外properties_table也使用了 struct hack,所以zend_object和属性表将会分配在一块内存里面。当然,属性表现在直接存了zvals,而不是以前那样存的是指针。

guards现在没有直接体现在zend_object结构里面了,如果用到的话,它将会存在properties_table的第一个槽里面,当然,如果没有使用__get这样的魔术方法,guards将会被省略。

之前的dtor,free_storage,clone现在被移到了handlers里面:

struct _zend_object_handlers {
    /* offset of real object header (usually zero) */
    int                                     offset;
    /* general object functions */
    zend_object_free_obj_t                  free_obj;
    zend_object_dtor_obj_t                  dtor_obj;
    zend_object_clone_obj_t                 clone_obj;
    /* individual object functions */
    // ... rest is about the same in PHP 5
};

offset是和内部对象表示相关的,内部对象通常包含一个标准的zend_object,但通常也会添加一些附加的东西,在PHP5中是这么弄得:

struct custom_object {
    zend_object std;
    uint32_t something;
    // ...
};

就是说,可以简单的把zend_object变成自定义的struct custom_object*,这是在C上面进行结构的继承。然后再PHP7中这样做的话就会有问题,因为PHP7的zend_object用了 struct hack来存属性表,PHP会在zend_object的结尾存属性部分,这会导致覆盖掉这些额外的自定义信息,所以PHP7中应该写在前面:

struct custom_object {
    uint32_t something;
    // ...
    zend_object std;
};

这就意味着不能直接在zend_objectstruct custom_object*之间转换,因为offset不同,在编译的时候可以通过offsetof()宏来确定offset。

你也许会好奇为什么PHP7还是有一个handle,毕竟现在存了一个直接指向zend_object的指针,所以现在不需要用handle在对象库里面查找对象了。

然而handle还是需要的,因为对象库还是存在的,尽管已经精简了它。它现在只是一个简单的数组,存的是指向对象的指针。当创建一个对象的时候,就会有一个指针插入到这个对象库中,索引就是这个handle,当释放对象的时候,这个也会被移除。

为什么还需要对象库呢?原因就是在 请求结束 的时候,再运行用户空间的代码就不安全了,因为执行器已经部分关闭了。为了避免这种情况,PHP会请求结束的时候执行所有对象的析构函数并阻止他们之后再运行。所以才需要这么一个对象库列表。

另外,这个handle对调试代码有好处,因为他就是每个对象的唯一ID,所以就很方便查看两个对象是真的相同还是只是有相同的内容。HHVM尽管没有对象库的概念,但它还是存了对象的handle。

和PHP5相比,现在只有一个refcount了(zval本身没有了),内存占用变小了,只要40bytes存对象结构,16bytes存一个属性(包含它的zval)。间接指向也变少了,很多中间的结构都去掉了或者内嵌了。现在读一个属性只需要1步,而不是之前的4步了。

Indirect zvals

到此为此,常规的zval类型都说完了。还有两个在特定情况下才会出现的类型,都是PHP7新加的,其中一个就是 IS_INDIRECT.

间接zval意思就是zval的值是存在别的地方的,注意这和IS_REFERENCE这种直接指向另一个zval不同,zend_reference结构是直接嵌在zval里面的。

为了理解什么情况下这个是必须得,需要想一下PHP怎么实现一个变量的:

所有在编译时已知的变量都会被赋予一个索引,索引和值本身都会被 compiled variables(CV) table,PHP还允许你动态的引用变量,比如$$val。PHP将会为函数和脚本创建一个符号表,里面包含了所有的变量名和值得关系映射。

这就带来了一个问题:这两种访问格式怎么能同时支持呢?我们用CV表来访问常规的变量,用符号表来访问 $$ 变量。在PHP5中CV table用的是二级的zval**指针,指针会指向zval*的二级指针表,然后才会指向实际的zval:

+------ CV_ptr_ptr[0]
| +---- CV_ptr_ptr[1]
| | +-- CV_ptr_ptr[2]
| | |
| | +-> CV_ptr[0] --> some zval
| +---> CV_ptr[1] --> some zval
+-----> CV_ptr[2] --> some zval

当使用符号表的时候,zval*这个二级指针表实际是没有用的,zval**直接指向hashtable的桶,举例说明:

CV_ptr_ptr[0] --> SymbolTable["a"].pDataPtr --> some zval
CV_ptr_ptr[1] --> SymbolTable["b"].pDataPtr --> some zval
CV_ptr_ptr[2] --> SymbolTable["c"].pDataPtr --> some zval

PHP7中肯定不能用这样做了,因为当rehash hashtable的时候,指针就失效了。实际上PHP7用了相反的策略:对于存在CV table的变量,符号表里面就有一个INDIRECT入口,指向CV table的入口,CV table在符号表的有效期里是不会重新分配的,所以也就没有指针失效的问题。

所以如果在函数中使用 CV 里面的 $a, $b, $c,并且动态创建 $d的时候,符号表应该是这样:

SymbolTable["a"].value = INDIRECT --> CV[0] = LONG 42
SymbolTable["b"].value = INDIRECT --> CV[1] = DOUBLE 42.0
SymbolTable["c"].value = INDIRECT --> CV[2] = STRING --> zend_string("42")
SymbolTable["d"].value = ARRAY --> zend_array([4, 2])

间接zval也可以指向一个IS_UNDEFzval,当hashtable没有关联的key的时候就会这样处理。所以如果unset($a)CV[0]写出UNDEF的时候,就和符号表里面没有"a"这个键差不多。

Constants and ASTs

PHP5和PHP7中都有IS_CONSTANTIS_CONSTANT_AST这两个特别的类型,这事干什么的?看下面这个例子:

function test($a = ANSWER,
              $b = ANSWER * ANSWER) {
    return $a + $b;
}

define('ANSWER', 42);
var_dump(test()); // int(42 + 42 * 42)

test()函数的两个参数默认值都用了ANSWER常量,但是在申明函数的时候常量并没有定义,只有等define()运行的时候,常量才可用。

出于这个原因,参数和属性的默认值(静态属性),常量以及其他接受静态表达式的东西,不得不推迟表达式的计算,直到首次使用。

如果值是常量或者静态属性,这些用的最平凡的地方都用了延迟计算,这个常量zval就有IS_CONSTANT标志。如果是常量表达式的话就是IS_CONSTANT_AST标示,并且zval指向了一个抽象语法树(AST)。

关于变量的实现,就说这么多吧,两篇文章了。不久之后再讲一些关于 VM的优化(尤其是 类型约定), 编译器的优化吧。

spinlock(自旋锁)和mutex(互斥锁)的区别(转)

首先spinlock是只有在内核态才有的,当然你也可以在用户态自己实现,但是如果想要调用spinlock_t类型,那只有内核态才有。但是semaphore是内核态和用户态都有的,mutex是一种特殊的semaphore。

spinlock是一种忙等待,也就是说,进程是不会睡眠的,只是一直在那里死循环。而mutex是睡等,也就是说,如果拿不到临界资源,那它会选择进程睡眠。那什么时候用spinlock,什么时候用mutex呢?首先,如果是在不允许睡眠的情况下,只能只用spinlock,比如中断的时候。然后如果临界区中执行代码的时间小于进程上下文切换的时间,那应该使用spinlock。反之应该使用mutex。

那mutex和semaphore有什么区别呢?mutex是用作互斥的,而semaphore是用作同步的。也就是说,mutex的初始化一定是为1,而semaphore可以是任意的数,所以如果使用mutex,那第一个进入临界区的进程一定可以执行,而其他的进程必须等待。而semaphore则不一定,如果一开始初始化为0,则所有进程都必须等待。同时mutex和semaphore还有一个区别是,获得mutex的进程必须亲自释放它,而semaphore则可以一个进程获得,另一个进程释放。