We have the data we have. The NY Times data has been recorded using the same protocol all along. Everything else is conjecture. That is how science works. We constantly re-examine hypotheses when new data comes in. The provenance of the data sets are important. With all its limitations, the NYT data set is about as good as it gets. If you want to learn more Udacity has a nice, free introduction to Data Science course
here.