09:23
Digital Ocean, who I assume are very nice people and meant well, did a Hacktoberfest event in which people were encouraged to submit a pull request to any open source GitHub project. In exchange, “contributors” would get a t-shirt.
You can probably guess what happened:
sigmavirus24: Hey @digitalocean your October pull request event has done nothing but wreak havoc on projects I maintain. I’ve had to reject too many PRs
Reviewing pull requests is a lot of work. Getting buried under trivial pull requests by people who are only after a t-shirt is a great way to get people to take projects private. Please don’t use open source projects as free marketing.
12:41
SERIAL (32 bit integer) or BIGSERIAL (64 bit integer) are the first choice for most people for a synthetic primary key. They’re easy, they’re comprehensible, and they’re transaction-safe. The values that come out of them are, at least to start, manageable and human-scale. They can also provide an easy sortation on creation order.
They’re not perfect, though: If you have to merge together two tables that were generated using SERIALs, you have a massive key update ahead of you to avoid conflicts. If you use SERIAL, exhausting the range is a possibility. If you have a sharded database, you need some way of keeping the sequences independent, such as different starting offsets (but what if you get the offset wrong?) or creating them using different increments (but what if you add another server)
A good alternative is using UUIDs, generated using the uuid_generate_v4()
function in PostgreSQL’s uuid-ossp contrib module. This makes mergers much easier, and guarantees independence across multiple tables.
But UUIDs are 128 bits, not 64, and require a function call to generate. How much of a problem is that, really?
As a test, I created a table with a single primary key column and a single float field:
CREATE TABLE test(
id <type>,
f float,
PRIMARY KEY (id)
)
<type> could be one of three possibilities:
- BIGSERIAL.
- UUID, using
uuid_generate_v4()
.
- BIGINT, using the
next_id
function from Instagram.
The test inserted 10,000,000 rows into the table. In one run, it did a COMMIT after each INSERT; in the other, a single COMMIT after all INSERTs. This was on PostgreSQL 9.4.4 on an AWS i2.2xlarge instance, with the two SSDs in a RAID-0 as the database volume.
The results were:
COMMITing after each INSERT:
column type | time (s) | size (MB) |
BIGSERIAL | 4262 | 636.7 |
UUID | 4367 | 890.0 |
BIGINT | 4624 | 636.7 |
Bulk COMMIT:
column type | time (s) | size (MB) |
BIGSERIAL | 898 | 636.7 |
UUID | 991 | 890.0 |
BIGINT | 1147 | 636.7 |
Overall, the INSERT time for UUIDs was slightly longer than that for BIGSERIAL, but not appreciably. The BIGINT column was notably slower, due to the PL/pgSQL function generating the new keys.
The UUID tables were bigger, of course, although this is an extremely example in that the primary key was only one of two fields in the table; more realistic tables with more columns would not show the same percentage increase.
The conclusion I draw is that it is fine to use UUIDs unless you are faced with a very tight INSERT performance requirement; they are surprisingly efficient compared to BIGSERIAL. My supposition is that the increased computation for the UUID is balanced agains the I/O to maintain the SERIAL.
23:36
One common source of query problems in PostgreSQL results an unexpectedly-bad query plan when a LIMIT clause is included in a query. The typical symptom is that PostgreSQL picks an index-based plan that actually takes much, much longer than if a different index, or no index at all, had been used.
Here’s an example. First, we create a simple table and an index on it:
xof=# CREATE TABLE sample (
xof(# i INTEGER,
xof(# f FLOAT
xof(# );
CREATE TABLE
xof=# CREATE INDEX ON sample(f);
CREATE INDEX
And fill it with some data:
xof=# INSERT INTO sample SELECT 0, random() FROM generate_series(1, 10000000);
INSERT 0 10000000
xof=# ANALYZE;
ANALYZE
Then, for about 5% of the table, we set i
to 1:
xof=# UPDATE sample SET i=1 WHERE f<0.05;
UPDATE 499607
xof=# ANALYZE;
ANALYZE
Now, let’s find all of the entires where i
is 1, in descending order of f
.
xof=# EXPLAIN ANALYZE SELECT * FROM sample WHERE i=1 ORDER BY f DESC;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Sort (cost=399309.76..401406.04 rows=838509 width=12) (actual time=1415.166..1511.202 rows=499607 loops=1)
Sort Key: f
Sort Method: quicksort Memory: 35708kB
-> Seq Scan on sample (cost=0.00..316811.10 rows=838509 width=12) (actual time=1101.836..1173.262 rows=499607 loops=1)
Filter: (i = 1)
Rows Removed by Filter: 9500393
Total runtime: 1542.529 ms
(7 rows)
So, 1.5 seconds to do a sequential scan on the whole table. So, just getting the first 10 entries from that should be much faster, right?
xof=# EXPLAIN ANALYZE SELECT * FROM sample WHERE i=1 ORDER BY f DESC LIMIT 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..277.33 rows=10 width=12) (actual time=12710.612..12710.685 rows=10 loops=1)
-> Index Scan Backward using sample_f_idx on sample (cost=0.43..23218083.52 rows=838509 width=12) (actual time=12710.610..12710.682 rows=10 loops=1)
Filter: (i = 1)
Rows Removed by Filter: 9500393
Total runtime: 12710.714 ms
(5 rows)
Oh. 12.7 seconds. What happened?
PostgreSQL doesn’t keep correlated statistics about columns; each column’s statistics are kept independently. Thus, PostgreSQL made an assumption about the distribution of values of i
in the table: they were scattered more or less evenly throughout. Thus, walking the index backwards meant that, to get 10 “hits,” it would have to scan about 100 index entries… and the index scan would be a big win.
It was wrong, however, because all of the i=1
values were clustered right at the beginning. If we reverse the order of the scan, we can see that was a much more efficient plan:
xof=# EXPLAIN ANALYZE SELECT * FROM sample WHERE i=1 ORDER BY f LIMIT 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..277.33 rows=10 width=12) (actual time=0.029..0.046 rows=10 loops=1)
-> Index Scan using sample_f_idx on sample (cost=0.43..23218083.52 rows=838509 width=12) (actual time=0.027..0.044 rows=10 loops=1)
Filter: (i = 1)
Total runtime: 0.071 ms
(4 rows)
So, what do we do? There’s no way of telling PostgreSQL directly to pick one plan over the other. We could just turn off index scans for the query, but that could well have bad side effects.
In this particular case, where a predicate (like the WHERE i=1
) picks up a relatively small number of entries, we can use a Common Table Expression, or CTE. Here’s the example rewritten using a CTE:
xof=# EXPLAIN ANALYZE
xof-# WITH inner_query AS (
xof(# SELECT * FROM sample WHERE i=1
xof(# )
xof-# SELECT * FROM inner_query ORDER BY f LIMIT 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=351701.16..351701.18 rows=10 width=12) (actual time=1371.946..1371.949 rows=10 loops=1)
CTE inner_query
-> Seq Scan on sample (cost=0.00..316811.10 rows=838509 width=12) (actual time=1168.034..1244.785 rows=499607 loops=1)
Filter: (i = 1)
Rows Removed by Filter: 9500393
-> Sort (cost=34890.06..36986.33 rows=838509 width=12) (actual time=1371.944..1371.944 rows=10 loops=1)
Sort Key: inner_query.f
Sort Method: top-N heapsort Memory: 25kB
-> CTE Scan on inner_query (cost=0.00..16770.18 rows=838509 width=12) (actual time=1168.040..1325.496 rows=499607 loops=1)
Total runtime: 1381.472 ms
(10 rows)
A CTE is an “optimization fence”: The planner is prohibited from pushing the ORDER BY or LIMIT down into the CTE. In this case, that means that it is also prohibited from picking the index scan, and we’re back to the sequential scan.
So, when you see a query come completely apart, and it has a LIMIT clause, check to see if PostgreSQL is guessing wrong about the distribution of data. If the total number of hits before the LIMIT are relatively small, you can often use a CTE to isolate that part, and only apply the LIMIT thereafter. (Of course, you might be better off just doing the LIMIT operation in your application!)
14:10
Assignment: Load a remote URL via https, verifying the site certificate against a local root CA certificate.
Java, Android SDK:
// Load CAs from an InputStream
// (could be from a resource or ByteArrayInputStream or ...)
CertificateFactory cf = CertificateFactory.getInstance("X.509");
// From https://www.washington.edu/itconnect/security/ca/load-der.crt
InputStream caInput = new BufferedInputStream(new FileInputStream("load-der.crt"));
Certificate ca;
try {
ca = cf.generateCertificate(caInput);
System.out.println("ca=" + ((X509Certificate) ca).getSubjectDN());
} finally {
caInput.close();
}
// Create a KeyStore containing our trusted CAs
String keyStoreType = KeyStore.getDefaultType();
KeyStore keyStore = KeyStore.getInstance(keyStoreType);
keyStore.load(null, null);
keyStore.setCertificateEntry("ca", ca);
// Create a TrustManager that trusts the CAs in our KeyStore
String tmfAlgorithm = TrustManagerFactory.getDefaultAlgorithm();
TrustManagerFactory tmf = TrustManagerFactory.getInstance(tmfAlgorithm);
tmf.init(keyStore);
// Create an SSLContext that uses our TrustManager
SSLContext context = SSLContext.getInstance("TLS");
context.init(null, tmf.getTrustManagers(), null);
// Tell the URLConnection to use a SocketFactory from our SSLContext
URL url = new URL("https://certs.cac.washington.edu/CAtest/");
HttpsURLConnection urlConnection =
(HttpsURLConnection)url.openConnection();
urlConnection.setSSLSocketFactory(context.getSocketFactory());
InputStream in = urlConnection.getInputStream();
copyInputStreamToOutputStream(in, System.out);
Python:
import requests
x = requests.get('https://certs.cac.washington.edu/CAtest/', verify='load-der.crt')